Browse Prior Art Database

Storage Structures and Agorithms for High-Volume Text Indexing

IP.com Disclosure Number: IPCOM000110772D
Original Publication Date: 1994-Jan-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 101K

Publishing Venue

IBM

Related People

Shoens, K: AUTHOR

Abstract

Database systems that support full-text queries use indices to achieve acceptable performance. Practical text retrieval systems invariably use inverted files, which consist of a list of words, with each word indicating the list of positions where the word occurs. While several variations on the inverted file idea have been used, a typical implementation might use a B-tree to store the list of words and a second file to represent the word positions, called the postings file. In some systems, the postings simply consist of the number of times that the word occurs in each document. Other systems record the position of each word occurrence in each document so that queries searching for words within a limited distance of other words can be answered.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Storage Structures and Agorithms for High-Volume Text Indexing

      Database systems that support full-text queries use indices to
achieve acceptable performance.  Practical text retrieval systems
invariably use inverted files, which consist of a list of words, with
each word indicating the list of positions where the word occurs.
While several variations on the inverted file idea have been used, a
typical implementation might use a B-tree to store the list of words
and a second file to represent the word positions, called the
postings file.  In some systems, the postings simply consist of the
number of times that the word occurs in each document.  Other systems
record the position of each word occurrence in each document so that
queries searching for words within a limited distance of other words
can be answered.

      This invention exploits the pattern of information update that
we see.  Typically, a large corpus of information is maintained, with
some fraction (typically 20% or less) of the corpus modified,
deleted, or inserted each day.  The idea is to use a two-part
postings file to take advantage of the incremental update.  The first
part of the postings file is a linear file containing the postings of
sufficiently old information.  The linear file contains word
occurrences for each word packed as tightly as possible.  It is space
efficient and fast to read, but not directly updatable.  The second
part of the postings file is a hash-partitioned fixed-block size
scheme containing three-part tuples:  [wordid, document, position].

      The wordid is an integer value assigned to each unique word.
Wordids are determined by looking up the actual English word in a
B-tree index.  As new words are discovered, they are assigned new
unique wordids and added to the B-tree.  Wordids are used as array
indices in the word table.  Each entry in the word table indicates
where information for the word is stored.  The example below shows
what these entries look like:

      The first row shows a word that has entries in the linear file,
but as yet has no entries added to the hash file.  The second entry
shows a word that was not present in the linear file, but has started
accumulating entries in the hash file.  The third entry shows a more
common word that has entries in both the linear file and the hash
file.  The document and occurrence counts are useful for approximate
matching queries and can be ignored for the purposes of this
invention.

      The number of partitions for the hash file is chosen to be
large enough to sp...