Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Derivation of Supervised Acoustic Prototypes with Enhanced Discrimination Power

IP.com Disclosure Number: IPCOM000106670D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 146K

Publishing Venue

IBM

Related People

Bellegarda, JR: AUTHOR [+2]

Abstract

The performance of speech recognition systems using sub-word acoustic hidden Markov models heavily hinges on the power of discrimination of a set of acoustic prototypes. This article presents an algorithm which derives a set of maximum-discrimination acoustic prototypes for each sub-word unit based on the confusability matrix between all sub-word units.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 34% of the total text.

Derivation of Supervised Acoustic Prototypes with Enhanced Discrimination Power

      The performance of speech recognition systems using sub-word
acoustic hidden Markov models heavily hinges on the power of
discrimination of a set of acoustic prototypes.  This article
presents an algorithm which derives a set of maximum-discrimination
acoustic prototypes for each sub-word unit based on the confusability
matrix between all sub-word units.

      In an automatic speech recognition system such as described in
[1], the pronunciation of each word is represented by a hidden Markov
model, composed of a small inventory 'A' sub <sw> of sub-word
acoustic models [2], representing an allophone, a phone, a syllable,
etc. [3].  Each sub-word model is in turn composed of a sequence of
elementary units drawn from a small alphabet 'A' sub <eu>.  In [4],
for example, the sub-word models are allophonic leafforms, composed
of fenones taking their value in an alphabet 'A' sub <eu> of
approximate size 300.

      The advantage of using such sub-word models is that they can be
derived automatically given some acoustic labels [5] resulting from a
large inventory of acoustic prototypes 'A' sub <ap>.  As is well
known (e.g., [6]), the performance of the recognizer is heavily
dependent on the adequacy of the acoustic prototypes in 'A' sub <ap>,
especially for continuous speech tasks where co-articulation effects
may be severe.  In [6], a strategy was proposed to achieve a properly
fine partition of the acoustic space.  Subsequently in [7], a more
efficient algorithm was developed by incorporating supervision, in
the sense of relating the HMM allophonic leafforms with their
acoustic manifestations.  However, this supervision was introduced at
the elementary unit level (in this case, at the fenone level),
without regard to the power of inter-unit discrimination of the
resulting set of prototypes.

      The present article proposes a new strategy, more amenable to
enhance discrimination between different elementary units, and
thereby to ultimately improve the recognition accuracy of the system.
Instead of exposing, one elementary unit at a time, the relevant
inter-relationships between all potential clusters within this
elementary unit, we first isolate, for each elementary unit, the set
of elementary units which appear to be the most confusable with this
elementary unit, based on some Viterbi criterion.  (Note that this
set will normally include the current elementary unit itself.)  Then,
we expose the relevant inter-relationships between all potential
clusters within each set, and finally prune the tree thus formed so
as to prevent the final clusters for the current elementary unit from
containing data attached to other elementary units within the same
confusable set.  In this manner, not only is the cluster selection
naturally supervised, but it is also geared toward producing clusters
which maximally discriminate the current elementary uni...