Browse Prior Art Database

Construction of Gaussian Seed Clusters from a Context-Dependent Acoustic Tree for Use in a Speech Recognition System

IP.com Disclosure Number: IPCOM000106786D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 116K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

In [1] a procedure is given for constructing a vector quantiser for speech recognition purposes. The vector quantiser is based on a mixture of diagonal Gaussian distributions which are created in two stages. First, seed distributions are derived using an expensive Euclidean clustering process. And second, the seed distributions are refined via K-means diagonal Gaussian clustering.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 43% of the total text.

Construction of Gaussian Seed Clusters from a Context-Dependent Acoustic Tree for Use in a Speech Recognition System

      In [1]  a procedure is given for constructing a vector
quantiser for speech recognition purposes.  The vector quantiser is
based on a mixture of diagonal Gaussian distributions which are
created in two stages.  First, seed distributions are derived using
an expensive Euclidean clustering process.  And second, the seed
distributions are refined via K-means diagonal Gaussian clustering.

In [2] a faster (cheaper) method is given for obtaining the Gaussian
seeds, but a modest amount of Euclidean clustering is still required.
In [3] an alternative is described in which Gaussian seeds are
obtained via "leafemic" trees.  A leafemic tree is a decision tree
which maps a phone in context to a Markov model which reflects the
properties of the acoustic labels expected when the phone is
articulated in the given context.

      The invention below describes an extremely efficient, accurate
method for obtaining seed Gaussians using context-dependent acoustic
trees.  It differs from [1,2] in that no expensive Euclidean
clustering is required.  And it is superior to [3] in two respects.
First, [3]  cannot be used when no leafemic tree exists:  this
happens during the initial stages of the construction of leafemic
models and when non-leafemic models are being used, such as phonetic
models.  The present invention does not suffer from these
limitations.  Second, and more important, leafemic trees are
determined from labels not from the acoustic vectors being modelled,
and labels provide a poor (quantised) estimate of acoustic vectors.
For this reason leafemic trees are suboptimal classifiers of acoustic
vectors into seed clusters.  The present invention creates seed
clusters from context-dependent acoustic trees which are determined
directly from the acoustic vectors not from labels.  Assume that some
training data from several different speakers has been recorded,
signal processed, and Viterbi aligned against phoneme-based Markov
word models as described in [1,4].  Phoneme-based models include but
are not limited to leafemic models.

      Assume further that existence of some phonetically meaningful
questions which may be used to construct phonological trees.  These
questions, which may be applied to any phone P in the neighborhood of
the frame being processed, usually take the form "is P a member of
the set S?"  Here S denotes a set containing one or more phonetic
phones having something in common.  The necessary sets may be
obtained from almost any phonetic text book.  for present purposes,
"word boundary" is also considered to be a phonetic phone.

      Similarly, assume that the existence of some phonetically
meaningful questions that can be applied to any word W in the
vicinity of the frame being processed.  These questions usually take
the form "is W a member of the set T?", where T denotes a set...