Browse Prior Art Database

Methodology for Searching Adobe Acrobat Portable Data Format Files Based on Content Relevance

IP.com Disclosure Number: IPCOM000014438D
Original Publication Date: 2000-Apr-01
Included in the Prior Art Database: 2003-Jun-19
Document File: 2 page(s) / 34K

Publishing Venue

IBM

Abstract

A methodology is described to enable users to search Adobe* Acrobat* Reader Portable Data Format files and be presented search results that exploit the document structure and content relevance.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

  Methodology for Searching Adobe Acrobat Portable Data Format Files Based on Content Relevance

   A methodology is described to enable users to search Adobe* Acrobat* Reader
Portable Data Format files and be presented search results that exploit the
document structure and content relevance.

There are many methodologies for determining the importance of terms within a
search index, such as the number of times a word appears in a document, but to
date no methodology exists as described herein. This methodology is based on
the premise that bookmarks within an Acrobat PDF file are easier to identify
and the content contained within the bookmark can be indexed separately, thus
enabling users to limit their search to bookmarks. This methodology overcomes
the inherent difficulty of searching Acrobat PDF files because bookmarks are
simpler to identify within an Acrobat PDF file than other objects.

When users search an Acrobat Portable Data Format (PDF) file for information,
they are searching a document that contains information that is often
structured in nature. For example, documents typically contain a title,
headings, table captions, figure captions, notes and other data elements which
represent structure. These structural elements represent importance as
determined by the document's author as is the case with a heading that the
author created to convey the beginning of a new topic, continuation of a
topic, and to help the user navigate through the information. A heading is an
important signpost alerting the reader that the heading represents a new
topic. But when a full-text search index is built, the text of the heading is
simply added to the index and the relevance of the heading is lost because the
structure no longer exists as the index is simply a list of all words that
appear in the PDF file. When a user searches for a word, they are presented a
list of results that is based on how often that word appears in the PDF file.
The author's intent is lost because if the word appears in a heading, then
that appearance should be more important to the user than if that word appears
in a paragraph but full-text search indexes are usually based on how often a
term appears and do not capture relevance. Words that appear in a heading
should be valued (ranked) more highly than words that appear in a paragraph
but this is not implemented in most search technology used with Acrobat PDF
files.

Search engines do not capture the relevance of the structural elements
contained in an Acrobat PDF file because of th...