Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Self-Normalization vs. Statistics Update for Efficient Use of Additional Training Data

IP.com Disclosure Number: IPCOM000113415D
Original Publication Date: 1994-Aug-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 121K

Publishing Venue

IBM

Related People

Bellegarda, JR: AUTHOR [+2]

Abstract

In a speaker-dependent speech recognition system, it is of paramount importance to make adequate use of all the sentences uttered by each speaker, including additional data resulting from the on-going use of the system by each newly trained user. Four different ways are isolated to exploit this additional data in a practical environment such as Tangora, the IBM experimental speech recognizer. Two of these solutions are viable alternatives to improve recognition accuracy.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 45% of the total text.

Self-Normalization vs.  Statistics Update for Efficient Use of Additional
Training Data

      In a speaker-dependent speech recognition system, it is of
paramount importance to make adequate use of all the sentences
uttered by each speaker, including additional data resulting from the
on-going use of the system by each newly trained user.  Four
different ways are isolated to exploit this additional data in a
practical environment such as Tangora, the IBM experimental speech
recognizer.  Two of these solutions are viable alternatives to
improve recognition accuracy.

      We consider a speaker-dependent speech recognition system such
as described in (1), where the pronunciation of each word is
represented by a hidden Markov model.  These Markov word models are
further assumed to be composed of a small inventory A sub sw of
sub-word acoustic models (2), representing an allophone, a phone, a
syllable, etc.(3).  Finally, we assume that each sub-word model is
comprised of a sequence of elementary units (e.g., fenones) drawn
from a small alphabet A sub eu.  Typically, the size of the alphabet
A sub sw is approximately 2000, while the size of the alphabet A sub
eu is approximately 300.

      The following is concerned with the potential improvement in
recognition that may result as more and more training sentences are
gathered for each user, for example from the on-going use of the
system by each newly trained speaker.  Taking into account our recent
work in (4) where we presented a new class of algorithms for making
use of previously acquired data while training a new speaker, we can
isolate four ways in which the additional data may be exploited.

      The new class of algorithms disclosed in (4) is based on a
speaker normalization procedure which maps a reference speaker's
acoustic feature space onto the new speaker's acoustic feature space.
The reference data thus transformed can be used to supplement the
original training data supplied by the new speaker.  This strategy
has a number of benefits, including a drastic reduction in the amount
of training data required from new speakers.  It also opens up the
following four possibilities for making use of the additional
training data.

1.  (Transformation Update) The additional data may be merged with
    the original new speaker's data and the transformation may be
    re-computed using the same reference speaker.

2.  (Self-Normalization) A completely new transformation may be
    computed using the original new speaker's data as reference data
    and the additional data as new data.

3.  (Statistics Update) The transformation may be kept the same and
    the statistics may be updated by re-computing transition and
    output probabilities on the total amount of new speaker's data
    available.

4.  (Self-Normalization with Statistics Update) A self-normalizing
    transformation may be computed as in 2.  and subsequently the
    statistic...