Browse Prior Art Database

Efficient Construction of a Supervised Speaker-Independent Acoustic Label Alphabet for Automatic Speech Recognition

IP.com Disclosure Number: IPCOM000106685D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 161K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

Recently a fast algorithm was described to derive speaker-dependent acoustic prototypes for use in speech recognition systems using sub-word acoustic hidden Markov models. The present article shows that a modified algorithm directly leads to the efficient construction of a supervised, speaker-independent acoustic label alphabet. The new algorithm allows for a drastic reduction in the amount of time necessary to construct such an alphabet.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Efficient Construction of a Supervised Speaker-Independent Acoustic Label Alphabet for Automatic Speech Recognition

      Recently a fast algorithm was described to derive
speaker-dependent acoustic prototypes for use in speech recognition
systems using sub-word acoustic hidden Markov models.  The present
article shows that a modified algorithm directly leads to the
efficient construction of a supervised, speaker-independent acoustic
label alphabet.  The new algorithm allows for a drastic reduction in
the amount of time necessary to construct such an alphabet.

      In an automatic speech recognition system such as described in
[1], the pronunciation of each word is represented by a hidden Markov
model, composed of a small inventory A sub <sw> of sub-word acoustic
models [2], representing an allophone, a phone, a syllable, etc.
[3].  Each sub-word model is in turn composed of a sequence of
elementary units (e.g., fenones) drawn from an alphabet A sub <eu>.
As a result, these sub-word models can be derived automatically given
some acoustic labels [4]  in an alphabet A sub <al> resulting from a
large inventory of acoustic prototypes.  In [5], for example, the
sub-word models are allophonic leafforms.

      In order to derive speaker-independent sub-word models, one
needs a speaker-independent acoustic label alphabet A sub <al>which
can be used to characterize utterances obtained from several
different speakers.  In [6], a strategy was proposed to achieve a
properly fine partition of the acoustic space, but, as pointed out in
[7], the resulting algorithm is computationally intensive and does
not incorporate any supervision.  In contrast, the method of [7] is
efficient and does relate the HMM allophonic leafforms with their
acoustic manifestations, but it has the drawback of producing a
speaker-dependent acoustic prototype set.  As a result, it is
inappropriate to construct a speaker-independent acoustic label
alphabet.

      The present article extends the strategy of [7]  to derive an
acoustic prototype set using data obtained from several speakers.
The resulting approach incorporates supervision and is intrinsically
amenable to a fast implementation.  As in [7], the algorithm
articulates around a clustering step followed by a pruning step.  The
purpose of the first step is to grow a binary clustering tree from
multi-speaker speech data.  This tree completely exposes, one
elementary unit at a time, the relevant inter-relationships between
all potential clusters.  The purpose of the second step is to prune
this tree according to some appropriate criterion, again using the
data obtained from the same speakers.  Only a desired number of
leaves is retained, thus forming the final clusters.

      The procedure is initialized as detailed in [7], except that
speech has been recorded for some adequate number of speakers N.  (A
common choice is 4 lt N le 10).  The following steps are performed
for each elementary...