Browse Prior Art Database

Extraction of Dynamic and Static Acoustic Information for Use in a Disrete-Parameter HMM Speech Recognition System

IP.com Disclosure Number: IPCOM000111118D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 102K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+3]

Abstract

Reference [1], among others, observed that when static speech spectral information is augmented with dynamic spectral information, improved speech recognition is obtained. References [2,3] used one stream of acoustic labels to capture static spectral information and another to capture dynamic information. In all three cases, however, the authors chose arbitrary time-dependent spectral functions as their dynamic spectral features.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Extraction of Dynamic and Static Acoustic Information for Use in
a Disrete-Parameter HMM Speech Recognition System

      Reference [1], among others, observed that when static speech
spectral information is augmented with dynamic spectral information,
improved speech recognition is obtained.  References [2,3]  used one
stream of acoustic labels to capture static spectral information and
another to capture dynamic information.  In all three cases, however,
the authors chose arbitrary time-dependent spectral functions as
their dynamic spectral features.

      In the invention below the use of multiple label streams is
adopted, but prior work is improved on by (a) computing optimal
linear time-dependent functions to serve as dynamic spectral
features, and (b) by increasing the amount of acoustic training data
available form a new speaker by augmenting it with properly
normalized data from a previous (reference) speaker.

      The existence of some training data is assumed for both the
subject and reference speaker, and that it has been signal processed
into a series of acoustic parameter vectors; typically there would be
one vector for about every 10ms of speech.  These vectors will be
referred to as unspliced vectors.  It will be further assumed that
these parameter vectors have been aligned against hidden Markov
models representing the training scripts, and that these Markov
models are constructed from sub-word units called phones.

The following steps are performed.

Step 1.  Using the reference speaker's unspliced vectors, create
label prototypes as described in [4]:  Cluster each phone iteratively
from about 50 Euclidean clusters down to about 20, then cluster into
the same number of diagonal Gaussian clusters, and hence obtain label
prototypes in the form of Gaussian mixtures.

Step 2.  Using the unspliced parameter vectors compute and apply a
speaker-standardizing transformation as in [5], so as to minimize
spectral differences between the subject and reference speakers.

Step 3.  Combine the transformed subject and reference unspliced
vectors, and use them to create new label prototypes starting from
the clusters of Step 1, as detailed in [5].

Step 4.  Using the prototypes of Step 3, label the subject speaker's
unspliced vectors.  This stream of labels conveys static spectral
information.

Step 5.  Create spliced parameter vectors for each speaker by
concatenating up to about 9 contiguous unspliced vectors, as
described in [4].

Step 6.  Using the reference speaker's spliced vectors, compute the
principal discriminating eigenvectors for separating the phones, as
detailed in [4].  These...