Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Determination of Acoustic Prototypes for Speech Recognition via Top- Down Clustering

IP.com Disclosure Number: IPCOM000106744D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 132K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

In discrete-parameter speech recognition systems, a vector quantiser outputs an acoustic label stream over time. In one prominent approach to speech recognition [1,2], each label is characterised by a "prototype" consisting of a mixture of about 20 diagonal Gaussian distributions, and the label output identifies the prototype which maximises the likelihood of a corresponding acoustic parameter vector.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 43% of the total text.

Determination of Acoustic Prototypes for Speech Recognition via Top- Down Clustering

      In discrete-parameter speech recognition systems, a vector
quantiser outputs an acoustic label stream over time.  In one
prominent approach to speech recognition [1,2], each label is
characterised by a "prototype" consisting of a mixture of about 20
diagonal Gaussian distributions, and the label output identifies the
prototype which maximises the likelihood of a corresponding acoustic
parameter vector.

The quality of the resulting label stream clearly depends on the set
of Gaussian distributions chosen to represent each label.  In [1,2]
seed Gaussians are obtained via a K-means clustering algorithm
starting from random seeds; the number of clusters is reduced
iteratively from some high initial number to the required number of
about 20.  Other methods for choosing the seed Gaussians are
suggested in [3,4].  However they are selected, the seed Gaussians
are then refined into the final set of Gaussians by K-means
clustering with a diagonal Gaussian distance measure.

These traditional methods place no restrictions on which acoustic
vectors may belong to which cluster.  In the present invention,
cluster membership is severely restricted: the members of each
cluster are constrained to belong to the same context class.  The
definition of the context classes and a method for constructing them
are specified in the invention below.  Also described are algorithms
for obtaining the label prototypes, and subsequent labeling.

      It will be assumed that some training data has been recorded,
signal processed, and Viterbi aligned against phoneme-based Markov
word models as described in [1,2].

      It will be further assumed that the existence of some
phonetically meaningful questions which may be used to construct
phonological trees.  These questions, which may be applied to any
phone P in the neighbourhood of the frame being processed, usually
take the form "is P a member of the set S?".  Here S denotes a set
containing one or more phonetic phones having something in common.
The necessary sets may be obtained from almost any phonetic text
book.  For present purposes, "word boundary" is also considered to be
a phonetic phone.

The procedure begins in the same manner as for the construction of
context-dependent prototypes [5].  Specifically, the following steps
are performed.

(1) Using the existing Viterbi alignments, tag each each parameter
vector in the training data with (1) the identity of the arc against
which the vector was aligned, (2) the phonetic phone which contained
that arc, (3) the N phonetic phones which preceded that phone, and
(4) the N phonetic phones which followed it.  A typical value for N
is 5.

(2) Perform Steps 3-6 for each arc A in the arc inventory.

(3) Extract from the tagged data of Step 1, all the data which was
aligned with arc A.

(4) Construct a phonological tree from the data extracted in Step 3,
so as to maximi...