Searching PDF Files: Technical Notes

For a general overview that will be sufficient for most users, see Searching PDF Files.

For those who are interested in how the program retrieves text from PDF files, the following notes provide technical details.

PDFs are a rendered format consisting of (x,y) coordinates and snippets of text. Converter programs have to use a set of heuristics to try to rebuild what the source text must have been. Nota Bene uses several different programs to do this, sometimes one program will do a better job than others and we try to detect when one will do better.

More specifically:

Even PDFs that look like regular text are not simply text strings, but are instead a “rendered” format with instructions on where to draw characters, without any guarantees about anything being in any order. These files are thus completely unreadable if opened in a text editor (they are instead what is called “binary”), as noted below. The precise way that these files are encoded varies wildly. Some (such as those produced directly from a text file) have offsets for each character, while some have offsets for each word, or each line. Others (such as those created using OCR) are composed both of images (of characters, words, or pages, etc.), and extracted text strings. In addition to what appears to be regular text, these files can of course also contain pictures, illustrations, and other non-textual elements, etc.

Getting the text from these files requires using some external program to convert these files to searchable text. If the files were scanned using OCR, the text conversion may not match the image (for example, “modern” could come out as “modem”) – there is nothing that can be done about this. But even if the text is composed of characters, this can be extremely complicated, especially if the PDF files are composed of offsets+individual characters, since the program has to combine those into words, something that is very difficult, particularly for justified text. The primary failure in these cases is that words are run together in one long character string.

NB uses the best open-source program we have found to rebuild the text from PDF files. After extraction, we have a check to determine if there seem to be run-in words; if there are, we run it through another filter to see if that does it any better. Which filter is used for this second attempt depends on which PDF-extraction filter is installed on your system (these come either from installing PDF readers such as Acrobat Reader or another program, or from the operating system).

In general, files that can be extracted using the first program are better formatted than files that require a second filter. Specifically, for the first, entries are pages; for the second, they may be pages, or they may be the entire file.

But sometimes two attempts is not enough – text simply cannot be reliably rebuilt from the file, something confirmed by using still other filters that we have tested (for example, we have encountered files that have resisted extraction using four different filters, including those from Adobe, who created PDF). Fortunately, these difficult-to-convert files seem to be rather rare. But it’s important that users have some sense of which files might have obvious conversion problems.

Orbis+ tries to give some sense of how much actual searchable text there is in a file, displaying the results graphically in the “Textishness (PDFs)” in the indexer. For files that have mostly text content, and for which the primary extraction engine works, these estimates should be a reasonably good guide. But if the PDF has many short pages or partially blank pages or pages with lots of traditional images in addition to the text, the estimates might be low, even though all the existing text is searchable (for example, all the text in a file with 26% textishness may actually be searchable).

Files that require the second filter are harder to estimate accurately. These are shown in red, either with a percentage, or “??%” if unknown, just to alert the user to the possibility that retrieval from them may be difficult. However, in most cases (but not all), files flagged in red can in fact be searched; indeed, those with a low “textishness” rating may be even 100% text.

Orbis offers other options for you to assess searchability -- in addition to showing the textishness in the indexer log (by clicking the green + when indexing, or going to File, Nexus Options, View Log File), you can click the three view buttons (the magnifying glasses to the right) to see the file: (a) The first one opens the PDF file in the default viewer. (b) The second one shows the actual (binary, unreadable) data of the PDF file if it were opened in a text editor. (c) The third file shows the actual text that has been extracted from the file.