Browse Prior Art Database

Algorithm for Measuring the Probability of Infrequent Events

IP.com Disclosure Number: IPCOM000119618D
Original Publication Date: 1991-Feb-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 7 page(s) / 252K

Publishing Venue

IBM

Related People

deSouza, P: AUTHOR [+3]

Abstract

A technique is described whereby an algorithm estimates the probability of infrequent events, such as unobserved or rarely seen m-grams as used in speech recognition applications. Although the technique described pertains to speech recognition, the concept can be applied to any field in which sparse data is a problem.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 29% of the total text.

Algorithm for Measuring the Probability of Infrequent Events

      A technique is described whereby an algorithm estimates
the probability of infrequent events, such as unobserved or rarely
seen m-grams as used in speech recognition applications.  Although
the technique described pertains to speech recognition, the concept
can be applied to any field in which sparse data is a problem.

      Typically, speech recognition systems incorporate some kind of
language model where an important concern is the estimation of
so-called m-gram probabilities (1-3). However, difficulty arises in
estimating m-gram probabilities in that the number of m-grams grows
exponentially with m.  (The term m-gram refers to sequences of length
m and may be word sequences, class sequences, or other sequences).
For example, in a moderate vocabulary of five thousand words, there
would be a staggering 125 billion distinct word 3-grams.  Word 3-gram
probabilities (1) can be approximated by a weighted linear
combination of 1-gram, 2-gram, and 3-gram relative frequencies, where
the weights vary according to the reliability of the individual
components.

      In (3), m-gram probabilities are estimated without the need for
approximate linear combinations of lesser-gram probabilities.  Here,
the maximum likelihood probability estimates are assumed to be
unbiased for frequent m-grams (those seen more than k times, where k
is typically about five).  For m-grams seen one to k times, the
maximum likelihood probability estimates are assumed to be biassed
upwards, and they are reduced using a discounting formula attributed
to Turing.  The total probability (mass) so discounted is then
distributed among the m-grams which did not occur at all.  The
maximum likelihood estimate of these unseen m-grams would be 0.0,
which is biassed downwards.

      Although the Turing method is appealing, it hinges on the
discounting formula and is valid only insofar as the computed
discounts are valid.  The estimation method has many appropriate
applications in speech recognition. However, Turing's formula is
inappropriate if rNr is not a monotonically decreasing function of r,
where Nr denotes the number of m-grams which occurred precisely r
times in some finite samples of data.  In this case, the use of (3)
is unwise.

      The algorithm described herein provides an alternative
discounting formula, which is appropriate even when Turing's is not,
but which is used similarly and therefore preserves the intuitive
appeal and general approach used in (3).

      Consider a sample of N m-grams and let Nr denote the number of
distinct m-grams occurring precisely r times in that sample. The
maximum likelihood estimate of the probability of an m-gram seen r
times is its relative frequency:

      In particular, Po = 0, which is seriously biased. Assume Pr is
satisfactory for r>k, a new estimate is formed for Pr, r&k.  It is
assumed that the m-grams have independent bino...