Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Baseform Adaptation using Multonic Markov Word Models

IP.com Disclosure Number: IPCOM000106551D
Original Publication Date: 1993-Nov-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 103K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

Personalizing to a new speaker a set of hidden Markov word models derived from a large amount of data obtained on several speakers is a recurrent problem in automatic speech recognition. This article shows how the multonic coefficients of a set of speaker-independent multonic baseforms can be trained to reflect the new speaker's pronunciation using only a relatively small amount of data from this new speaker.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Baseform Adaptation using Multonic Markov Word Models

      Personalizing to a new speaker a set of hidden Markov word
models derived from a large amount of data obtained on several
speakers is a recurrent problem in automatic speech recognition.
This article shows how the multonic coefficients of a set of
speaker-independent multonic baseforms can be trained to reflect the
new speaker's pronunciation using only a relatively small amount of
data from this new speaker.

      In an automatic speech recognition system such as described in
[1] the pronunciation of each word is represented by a hidden Markov
model composed of a sequence of elementary units drawn from an
alphabet Aeu.  In [2], for example, the elementary units are fenones
taking their values in an alphabet of typical size 300.  In [3], the
elementary units are multones drawn from an alphabet of typical size
5000.

      The construction of such fenonic [2] and multonic baseforms [3]
is typically carried out using a large database of speech (e.g.,
20,000 sentences) provided by several talkers.  Thus, the resulting
baseforms are speaker-independent.  In a speaker-dependent speech
recognition system such as considered in [4], each new speaker then
provides roughly 2,000 sentences of training data for the
construction of his/her personal vector quantizer prototypes and
Markov model probabilities.  There is, however, no provision for
customizing the baseforms themselves to the new speaker.  As a
result, the system may fail to accurately model potential
idiosyncrasies in the pronunciation of the new speaker.

      These shortcomings can be directly traced to the construction
procedure for fenonic baseforms used in [4].  As it is inherently a
discrete optimization process, it does not lend itself easily to
adaptation schemes.  Adaptation would have to be carried out through
a computationally expensive stack search procedure which would likely
require a large amount of training data from the new speaker.  On the
other hand, the construction of multonic baseforms, initialized from
a set of fenonic baseforms, is a continuous optimization process [3].
This makes it possible to consider the adaptation of multonic
baseforms to a new speaker with only a relatively small amount of
training data.

      To initialize the baseform adaptation procedure, it is assumed
that a set of speaker-independent multonic baseforms has been
constructed as described in [3], so that both speaker-independent
fenonic parameters and speaker-independent multonic coefficients are
available.  The fenonic parameters will be kept constant throughout
the procedure, while the multonic parameters will be trained using
the new speaker data.  The following steps are performed for each
word w in the vocabulary.

1.  Define S as the set of all multonic coefficients for the word w,
    and Y as the set of all label se...