Method to generate indexing information for structured documents using document models
Original Publication Date: 2001-Dec-01
Included in the Prior Art Database: 2003-Jun-20
A novel approach to generate application-specific information for enhancing a text-search index with information about the structure of indexed documents is presented. In particular, so-called document models are used to describe a mapping from application-defined document structures to searchable fields. The recognition of document structures can thus be separated from the text search subsystem itself. Background In many applications that have to deal with a large collection of electronic documents, a subsystem that provides for a functionality of generic text search is an important component. Generic text search allows a user to retrieve documents by specifying ad hoc search terms they contain. To achieve this functionality the text search subsystem maintains a text search index containing all putative search terms for a given document collection together with information about the individual occurrences of those terms in the documents. To allow for focused searches and good result rankings, however, it is often required to relate the occurrence of terms to the structure of the given document in order to account for the fact that terms that appear in salient document parts, like title or abstract, need to be weighed higher than terms that appear just in the body. Moreover, searches in structured documents may be restricted to certain document structures or fields. From the point of view of the text search subsystem we use the simple concept of a text field to support structure-related queries. Before we explain this concept, let us first review some basic concepts commonly used to explain text-search systems.