Browse Prior Art Database

Clustered Language Models

IP.com Disclosure Number: IPCOM000121508D
Original Publication Date: 1991-Sep-01
Included in the Prior Art Database: 2005-Apr-03
Document File: 2 page(s) / 75K

Publishing Venue

IBM

Related People

Mercer, RL: AUTHOR [+3]

Abstract

A general clustering algorithm for clustering linguistic strings into equivalence classes is given along with several linguistic metrics; each metric defines a specific clustering algorithm.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Clustered Language Models

      A general clustering algorithm for clustering linguistic
strings into equivalence classes is given along with several
linguistic metrics; each metric defines a specific clustering
algorithm.

      One approach to language modeling is the equivalence class
method wherein the probability distribution of a word given a
preceding string of words is approximated by the probability of the
word given the equivalence class which contains the preceding string.
The major task in this approach is the determination of suitable
equivalence classes. We describe here a family of clustering
algorithms for determining such classes.  The algorithms are
stringspace variants of the familiar k-means algorithm which clusters
points of n-dimensional Euclidean space. There are three things
necessary for this: (1) a dissimilarity measure, (2) a definition of
centroid and (3) an initializing partition.
Dissimilarities

      Let x be the set of strings over the alphabet (vocabulary) V.
Any metric x will serve as a dissimilarity, but weaker definitions
are useful.  We regard any nonnegative function p:x x x T R as a
dissimilarity so long as p(x,y) = 0 whenever x=y.  For such a
function to be useful for linguistic clustering it must somehow
utilize prior information (or side information) about the language;
we give two methods for doing this.  Clearly different
dissimilarities can be combined.
Method 1: Tree-based dissimilarity

      Let T be a decision tree for predicting the next word from the
last k words.  Then T x T P, where P is the set of probability
distributions over the vocabulary.  The dissimilarity between two
strings is then defined by the tree as the dissimilarity between p(x)
and p(y).  There are several useful e...