Methodology for Searching Adobe Acrobat Portable Data Format Files Based on Content Relevance
Original Publication Date: 2000-Apr-01
Included in the Prior Art Database: 2003-Jun-19
A methodology is described to enable users to search Adobe* Acrobat* Reader Portable Data Format files and be presented search results that exploit the document structure and content relevance. There are many methodologies for determining the importance of terms within a search index, such as the number of times a word appears in a document, but to date no methodology exists as described herein. This methodology is based on the premise that bookmarks within an Acrobat PDF file are easier to identify and the content contained within the bookmark can be indexed separately, thus enabling users to limit their search to bookmarks. This methodology overcomes the inherent difficulty of searching Acrobat PDF files because bookmarks are simpler to identify within an Acrobat PDF file than other objects. When users search an Acrobat Portable Data Format (PDF) file for information, they are searching a document that contains information that is often structured in nature. For example, documents typically contain a title, headings, table captions, figure captions, notes and other data elements which represent structure. These structural elements represent importance as determined by the document's author as is the case with a heading that the author created to convey the beginning of a new topic, continuation of a topic, and to help the user navigate through the information. A heading is an important signpost alerting the reader that the heading represents a new topic. But when a full-text search index is built, the text of the heading is simply added to the index and the relevance of the heading is lost because the structure no longer exists as the index is simply a list of all words that appear in the PDF file. When a user searches for a word, they are presented a list of results that is based on how often that word appears in the PDF file. The author's intent is lost because if the word appears in a heading, then that appearance should be more important to the user than if that word appears in a paragraph but full-text search indexes are usually based on how often a term appears and do not capture relevance. Words that appear in a heading should be valued (ranked) more highly than words that appear in a paragraph but this is not implemented in most search technology used with Acrobat PDF files. Search engines do not capture the relevance of the structural elements contained in an Acrobat PDF file because of the complex nature of the Acrobat PDF file. An Acrobat PDF file can be described as a collection of objects that identify elements of the Acrobat PDF file. Creating a search index is difficult because search engines must sort through many lines of programming code to identify elements that belong in the index and the nature of these elements, such as the properties that support a heading (for example, font and typeface) are buried within that programming code and are very difficult to extract into an index.