Scanned documents now join Flash and .pdf files on the list of non text based formats that are indexable by Google’s robots. In an announcement today on Google’s Official Blog, Optical Character Recognition (OCR) technology’s ability to convert images of text into actual text that can be searched and indexed. This significant new step in Google’s arsenal will allow for many sources of previously seemingly inaccessible documents on the internet to be readily available, and easily searchable to the masses.
Google’s previous methods of indexing scanned documents utilized page/file titles, and other unreliable sources of metadata in an attempt to index the search engine-unfriendly images. If your Google search does in fact return results that include a scanned document, you’ll still be able to view it in its original form as a .pdf file, as well as the OCR’d text version, available to you through a ‘View As HTML’ link