Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Algorithm to Identify Common Phrases in Source Text

IP.com Disclosure Number: IPCOM000120549D
Original Publication Date: 1991-May-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 3 page(s) / 151K

Publishing Venue

IBM

Related People

Timms Jr, GD: AUTHOR

Abstract

The size of a text string can be reduced by storing commonly used words and phrases in a dictionary, and replacing each word or phrase in the text by an entry number for that word or phrase in the dictionary. National-language text can easily be parsed to identify its component words using finite state machines and standard compiler technology. It is less obvious (without prior knowledge of the text) how to identify and select a set of phrases that should be stored in the dictionary to most increase text compression. This invention describes an heuristic technique, based on the string-recognition part of the Liv-Zempel algorithm, that can identify and select a suitable set of phrases to improve data compression with reasonable resource investment.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 41% of the total text.

Algorithm to Identify Common Phrases in Source Text

      The size of a text string can be reduced by storing
commonly used words and phrases in a dictionary, and replacing each
word or phrase in the text by an entry number for that word or phrase
in the dictionary.  National-language text can easily be parsed to
identify its component words using finite state machines and standard
compiler technology.  It is less obvious (without prior knowledge of
the text) how to identify and select a set of phrases that should be
stored in the dictionary to most increase text compression.  This
invention describes an heuristic technique, based on the
string-recognition part of the Liv-Zempel algorithm, that can
identify and select a suitable set of phrases to improve data
compression with reasonable resource investment.

      The compression process begins by scanning all text to identify
words.  Each new word found is assigned a unique identifier and added
to the word/phrase table (a classical hash table structure, similar
to the symbol table commonly used for compilers).  A use count in the
word/phrase table entry is incremented for each word that already
exists in the table.  In addition to updating the word/phrase table,
an intermediate form of the text is created by concatenating the
identifiers for the words that make up the text.

      The algorithm chosen to identify common phrases is a modified
version of the Liv-Zempel algorithm that scans the intermediate form
of the text and treats each word identifier as a separate
"character".  The process begins by scanning the intermediate text to
build a tree structure where the nodes represent words and links tie
together phrases in the original text.

      The Liv-Zempel algorithm is modified to limit the tree to a
fixed size (64K entries, in this implementation) and terminate the
scan when either all intermediate text is scanned or the tree area is
full (whichever comes first). The tree size limit is an arbitrary
number chosen to simplify implementation and restrict the amount of
time/effort used to perform phrase analysis.

      Liv-Zempel is also modified to avoid filling the tree area with
low-use phrases by storing only the most frequently used words (up to
255 of them in this implementation) as nodes in the tree; any other
word is considered a delimiter that ends the previous phrase (and is
not stored in the tree).  This is a more significant algorithm
change, because it automatically ignores many phrases that are not
frequently used.

      After the tree is built, the intermediate text is scanned again
and matched against the tree to increment a use count in each node
matched against a word in the intermediate text.  This extension to
standard Liv-Zempel processing annotates the tree with a use count
for each phrase and subphrase (subset of a longer phrase) found in
the text.  The use counts could be incremented as the tree was built
but are more accurate b...