Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

# Fast Algorithm for Evaluating Word Sequence Statistics in Large Text Corpora by Small Computers

IP.com Disclosure Number: IPCOM000100226D
Original Publication Date: 1990-Mar-01
Included in the Prior Art Database: 2005-Mar-15
Document File: 3 page(s) / 89K

IBM

## Related People

Bandara, U: AUTHOR [+3]

## Abstract

The described algorithm allows evaluating word sequence statistics in large text corpora without searching combinations.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 63% of the total text.

Fast Algorithm for Evaluating Word Sequence Statistics in Large Text Corpora by Small Computers

The described algorithm allows evaluating word sequence
statistics in large text corpora without searching combinations.

The algorithm serves to identify sequential textual word
combina- tions of orders up to n.  Example: n=3 identifies monogram,
bigram and trigram sequences.

The steps of the algorithm are:
1.   A unique positive number, hereinafter called "key", is
assigned to each word in the systems dictionary.
2.   The words are ordered in a search table along with the
corresponding key.
3.   The text is resolved to smaller elements (such as
sentences, paragraphs, etc.), hereinafter called "records".
4.   Each word in a record is searched in the table. If the
word exists in the table, the relevant key is selected to transcribe
the respective series of words in the record to a series of numbers.
If a particular word is not in the system vocabulary, its key is
assigned -1.
5.   The series of keys is then transcribed to a series of
bits, with the conditions being as follows: 1 if key > 0, 0 if key
& 0.  In this step, correspondence to the initial key is retained.
6.   A bit frame is constructed with a width of n+1 bits, where
n is the required order of the combination of words.
7.   The bit frame is moved along the series of bits, and the
relevant decima...