Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Handling Names and Numerical Expressions in an n-gram Language Model

IP.com Disclosure Number: IPCOM000113873D
Original Publication Date: 1994-Oct-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 97K

Publishing Venue

IBM

Related People

Brown, PF: AUTHOR [+6]

Abstract

In a number of pattern recognition problems it is important to know the probability that a sequence of words is well-formed English. Examples of such problems include speech recognition, spelling correction, and machine translation A common method of estimating the probability that a sequence of words is well-formed English is with an n-gram model as described in [*]. In an n-gram model the probability of the ith word given the previous i-1 words is estimated to be equal to the probability of the ith word given the previous n-1 words. The invention described herein includes a method of improving n-gram model estimates of the probability that a word will follow (not necessarily immediately) a name or numerical expression.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Handling Names and Numerical Expressions in an n-gram Language Model

      In a number of pattern recognition problems it is important to
know the probability that a sequence of words is well-formed English.
Examples of such problems include speech recognition, spelling
correction, and machine translation A common method of estimating the
probability that a sequence of words is well-formed English is with
an n-gram model as described in [*].  In an n-gram model the
probability of the ith word given the previous i-1 words is estimated
to be equal to the probability of the ith word given the previous n-1
words.  The invention described herein includes a method of improving
n-gram model estimates of the probability that a word will follow
(not necessarily immediately) a name or numerical expression.

      In order to avoid having extremely large vocabularies, many
natural language processing systems, such as speech recognizers and
machine translation systems, include single digits in their
vocabularies, but do not include longer digit sequences as single
vocabulary items.  For example, the digit sequence 1234 is typically
represented as a sequence of five distinct words 1, 2, 3, 4, and 5.

The problem with this method of encoding digits is that when
predicting a word after a multiple-digit number the preceding context
can be lost.  For example, a trigram model only looks back two words.
This means that the trigram model estimate of the probability that
the word "pounds" will follow the sequence "He weighs 123" is equal
to the trigram model estimate that the word "pounds" will follow the
sequence "A New York hotel costs 123", since in both cases the
previous two words are the digits 2 and 3.  The fact that the word
preceding 123 is "weighs" in the first sequence and "costs" in the
second sequence clearly should effect the probability that the word
after 123 will be "pounds".  Unfortunately as long as a multi-digit
sequence is encoded as a sequence of single digits, this fact cannot
be taken account of in a trigram language model.

      Another problem with an n-gram language model is that it takes
a very large corpus of text to get accurate estimates of n-gram
probabilities.  However, one should expect that probability that a
particular word  will follow 123 is close to the probability that
that word will follow 234.  However, since there are a tremendous
number (infinitely many, in fact) of sequences of numbers, it is
difficult to reliably estimate the probability distribution for wo...