home
Searchable Image PDFs
HHS is now using a feature of PDFs that allows an image PDF - i.e. a PDF that contains primarily images, not pure text - to have the text that is contained in the images be searchable and selectable, just as though it were text itself.

This is made possible by using Optical Character Recognition (OCR) software to examine the images for any text they might contain, and then create a text layer below the image that matches the image. When you then use a PDF reader to do a search, it knows enough to search this text layer and highlight that area on the screen, approximately in the position occupied by the text in the image.

For example, below is a screen snapshot of a search for the word "vacation" (not case sensitive) in an issue of the Beachcomber, showing that it found the word in an advertisement in what is actually an image of one of the pages.
ocr find

We think that this can be a useful tool, but there are some aspects of it that users should be aware of. First, we think that it is remarkable that the OCR software (ABBYY PDF Transformer+) can do this at all, considering the poor quality of a lot of the pages - and therefore the images - of the Beachcomber newspapers. But because of the poor quality, often the OCR translation is not perfect, and sometimes pretty poor.

For example, below is another screen snapshot from the same search for the word "vacation," showing another page. You can clearly see that the word "Vacation" is on the page, but it was not found by the search. What's up with that??
ocr find

This happens because of the poor quality of the pages (especially when the pages are not straight and flat). So, it does call into question the reliablility of doing a search, especially the danger of false negatives - i.e. searching for something and have it not found when it really exists.

Our only advise is to not rely on this tool too much. You can somewhat judge the quality of the text translation on a particular page by using the key combination CTRL-A in the PDF reader. This is the hotkey combo for "select all" and it will show you all the text on that page that was recognized as text.

For example, below is the same page after doing a CRTL-A, showing that the headline portion containing the word "vacation" was not converted to text, so that is why it was not found in the search. ocr find

It also shows how the text layer doesn't always match position perfectly with the image itself. If you copied and pasted that text into a text editor you would see that the text of the article below the headline in question is pretty complete and really has all the letters, right up to the left margin.

But often the text that is recognized is pretty complete, as shown on another page, below.
ocr find

So, we think the tool is useful, but don't expect too much.