Browse Prior Art Database

Method to generate indexing information for structured documents using document models

IP.com Disclosure Number: IPCOM000015496D
Original Publication Date: 2001-Dec-01
Included in the Prior Art Database: 2003-Jun-20
Document File: 5 page(s) / 69K

Publishing Venue

IBM

Abstract

A novel approach to generate application-specific information for enhancing a text-search index with information about the structure of indexed documents is presented. In particular, so-called document models are used to describe a mapping from application-defined document structures to searchable fields. The recognition of document structures can thus be separated from the text search subsystem itself.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 26% of the total text.

Page 1 of 5

  Method to generate indexing information for structured documents using document models

    A novel approach to generate application-specific information for enhancing a text-search index with information about the structure of indexed documents is presented. In particular, so-called document models are used to describe a mapping from application-defined document structures to searchable fields. The recognition of document structures can thus be separated from the text search subsystem itself.

Background

    In many applications that have to deal with a large collection of electronic documents, a subsystem that provides for a functionality of generic text search is an important component. Generic text search allows a user to retrieve documents by specifying ad hoc search terms they contain. To achieve this functionality the text search subsystem maintains a text search index containing all putative search terms for a given document collection together with information about the individual occurrences of those terms in the documents. To allow for focused searches and good result rankings, however, it is often required to relate the occurrence of terms to the structure of the given document in order to account for the fact that terms that appear in salient document parts, like title or abstract, need to be weighed higher than terms that appear just in the body. Moreover, searches in structured documents may be restricted to certain document structures or fields.

    From the point of view of the text search subsystem we use the simple concept of a text field to support structure-related queries. Before we explain this concept, let us first review some basic concepts commonly used to explain text-search systems.

    An index term is the basic unit of an index for which information is stored and retrievable in a fast way. In text-search systems this is most often the normalized form of a word, but can also be a multi-word term, or any other information that can be associated with positions in documents.

    An abstracted document is an ordered sequence of index terms together with their position information. The ordering is by position. That position information is an encoding of the position of an index term occurrence within a document, such as a word (or character) count.

    An abstract text index for a document collection is a mapping from a set of index terms to sets of pairs (document identifier, position), where each such pair represents an occurrence of that term in the indexed document collection.

    Now, a text field in a document is an identity associated with a (possibly non-contiguous) range of positions. As an example, consider a newspaper page as a document. We could make up a text field that includes all headlines. A search within a certain text field can be expressed by restricting a query using the text field's identifier,
e.g., SEARCH "Guiliani" WITHIN "Headlines-ID". Text fields may also be associated with a data type in order to support dat...