Searching PDF Files (Orbis+)

 

In general, Orbis+ can search PDF files. The program uses a wide variety of tools to extract text from PDF file so that the text can be searched. This works well for the vast majority of PDF files. However, there are some files for which text cannot be extracted, and they therefore cannot be searched. Fortunately, these difficult-to-convert files seem to be rare. In addition, for files that were created by scanning using OCR, the text conversion may not be 100% accurate. For example, the image “modern” could come out as “modem.” In this case, the file could be searched, but a search for “modern” would fail to find this instance of the word.

 

You can create a textbase that includes PDF files and use it for your searches. It is likely that all of your PDF files will be searchable. However, if your textbase includes files that cannot be converted to text, then your searches will not retrieve text from these files. So for example, a search for "medieval" might find text in several PDF documents, but miss an instance of "medieval" in one document that could not be properly converted.

 

If you would like to assess the searchability of the documents in your textbase, Orbis offers an indexer log:

 

1Open the indexer log by clicking the green + when indexing, or by clicking File, Nexus Options, View Log File.
2If a file is shown in red, there might (or might not) be problems retrieving text from it. It is unlikely that you will see any files shown in red, but if you do, click on the file, and then use the three view buttons (the magnifying glasses at the top right) to see the file.
The first view button opens the PDF file in the default viewer.
The second one shows the actual (binary, unreadable) data of the PDF file if it were opened in a text editor (for use by techies only).
The third file shows the actual text that has been extracted from the file. If this viewer shows that the text has been extracted, then the file is searchable.
3The Textishness (PDFs) column gives some sense of how much actual searchable text there is in a file, as opposed to non-searchable material like images and other graphics. But if the PDF has many short pages or partially blank pages or pages with lots of traditional images in addition to the text, the estimates might be low, even though all the existing text is searchable. Use the view buttons described above to see what text has been extracted from the file.
4The red X at the top right of the indexer log can be used to delete a file from the textbase.

 

For detailed technical information, see PDF Files: Technical Notes.  

 

 

See also:

Orbis+

Orbis Indexer