Browse Prior Art Database

Iterative Statistics Training and Multonic Baseform Adaptation for Automatic Speech Recognition

IP.com Disclosure Number: IPCOM000106580D
Original Publication Date: 1993-Nov-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 120K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

Personalizing to a new speaker a set of hidden Markov word models derived from a large amount of data obtained on several speakers can be done in a number of ways. These strategies are evaluated for the training of both the multonic coefficients and the fenonic parameters of a set of speaker-independent multonic baseforms. Iterative training allows the models to adjust to the new speaker's pronunciation after only a relatively small amount of data from this new speaker.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 47% of the total text.

Iterative Statistics Training and Multonic Baseform Adaptation for Automatic Speech Recognition

      Personalizing to a new speaker a set of hidden Markov word
models derived from a large amount of data obtained on several
speakers can be done in a number of ways.  These strategies are
evaluated for the training of both the multonic coefficients and the
fenonic parameters of a set of speaker-independent multonic
baseforms.  Iterative training allows the models to adjust to the new
speaker's pronunciation after only a relatively small amount of data
from this new speaker.

      In an automatic speech recognition system such as described in
[1], the pronunciation of each word is represented by a hidden Markov
model composed of a sequence of elementary units drawn from an
alphabet Aeu.  In [2], for example, the elementary units are fenones
taking their values in an alphabet of typical size 300, while in [3],
the elementary units are multones drawn from an alphabet of typical
size 5000.  The construction of the resulting fenonic [2]  and
multonic baseforms [3]  is typically carried out using a large
database of speech (e.g., 20,000 sentences) provided by several
talkers.  Thus, the resulting baseforms are speaker-independent.

      In a speaker-dependent speech recognition system such as
considered in [4], each new speaker then provides an adequate amount
of training data for the construction of his/her personal vector
quantizer prototypes.  In [5], it was found that a small portion of
this data could be advantageously utilized for training the multonic
coefficients of multonic baseforms to the new speaker.  In
particular, this approach was found valuable to more accurately model
potential idiosyncrasies in the pronunciation of the new speaker.  No
attempt was made, however, to make use of the training data to also
train the Markov model probabilities (or fenonic parameters).  This
is the object of the present article.

      Four ways can be isolated in which both the multonic
coefficients and the fenonic parameters can be trained to the new
speaker.  To initialize each algorithm, it is assumed that a set of
speaker-independent multonic baseforms has been constructed as
described in [3], so that both speaker-independent fenonic parameters
and speaker-independent multonic coefficients are available.  All
training are done using the forward-backward algorithm for the
parameter estimation of hidden Markov models [1], with deleted
interpolation performed in the last iteration [1].  This is to avoid
overfitting the estimated parameters to the training data.

1.  (Multonic Coefficients First) This corresponds to first doing
    baseform adaptation as proposed in [5], and then training the
    statistics using the updated multonic baseforms.

2.  (Fenonic Parameters First) This corresponds to first training the
    statistics as in [1], and then perform baseform adaptation using
    the new fenonic paramete...