Browse Prior Art Database

Maintaining Excluded Word-Count in a Cross Referencing Index Tool

IP.com Disclosure Number: IPCOM000111230D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 75K

Publishing Venue

IBM

Related People

Gadd, RJ: AUTHOR

Abstract

Disclosed is a method for text index-generation, whereby the cross referencing tool excludes common words of low search interest that would take up excessive storage but keeps a count of them instead. This count is economical in storage space but indicates their presence and volume in the text.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Maintaining Excluded Word-Count in a Cross Referencing Index Tool

      Disclosed is a method for text index-generation, whereby the
cross referencing tool excludes common words of low search interest
that would take up excessive storage but keeps a count of them
instead.  This count is economical in storage space but indicates
their presence and volume in the text.

      In a text cross-reference tool the index generation step works
by scanning all the input data and creating index records for all
words which are not found in an exclusion table.   This technique is
intended to reduce the size of the index data by avoiding words which
are very common or otherwise considered not worth indexing.   But
words which are not indexed cannot be found in the index.   This
might seem obvious, but a user of a cross-reference tool expects it
to find words, so if the word is not found they take this to mean
that no instances of that word occur in the input data.   This is a
serious shortcoming, since the user might make a decision based on
the apparent absence of a word.   To index every word in not an
option, since very many instances of words are excluded.   Without
this technique the index size would be unmanageable.   Keeping
information relating to all words encountered, even if subsequently
discarded, could mean a large increase in processing time and disk
space.

      The solution is for each word excluded, due to being in an
exclusion table, maintain a count of the number of times that word is
NOT indexed.   When the user searches for a given word, search the
index data AND the list of excluded words.   Either show the user the
summary information about the references to that word, if any are
found, or show the user the number of times that the word was
discarded in the index preparation step.   This allows the user to
know the details of all instances of use of any word which is
indexed, and it also makes the user aware of the fact that many words
are present in the input data but are discarded.

The following sections describe the...