Browse Prior Art Database

Variable N-Gram Method For Statistical Language Processing

IP.com Disclosure Number: IPCOM000048891D
Original Publication Date: 1982-Apr-01
Included in the Prior Art Database: 2005-Feb-09
Document File: 1 page(s) / 12K

Publishing Venue

IBM

Related People

Damerau, FJ: AUTHOR

Abstract

The processing of natural language symbol strings by a statistical method using fixed length n-grams and associated frequencies requires an expensive computation to determine the string of maximum likelihood when ambiguities are possible. This method eliminates the computation.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 1

Variable N-Gram Method For Statistical Language Processing

The processing of natural language symbol strings by a statistical method using fixed length n-grams and associated frequencies requires an expensive computation to determine the string of maximum likelihood when ambiguities are possible. This method eliminates the computation.

There are a number of applications, of which speech recognition is one, where a statistical model of a language is useful in accomplishing some data processing task. The normal way to construct such a model is to fix on some level of approximation, expressed as a count of the number of symbols for which statistics are gathered. A model based on the statistics of three-grams is called a three-gram model, and in general a model based on strings of length n is an n- gram model. With each symbol string is stored its relative frequency. In use one wants to find, for a given arbitrary string of symbols of which some are ambiguous (i.e., it is not certain which of some restricted set has actually occurred), which sequence of n-grams provides the most likely covering of the input symbol string, thereby resolving the ambiguities present in the original. For example, a speech recognizer might produce a string of symbols which at some point might contain either an n or an m but the recognition logic cannot decide between them. The statistical language model would hopefully be able to establish that one possibility was more likely than the other....