Browse Prior Art Database

Simplified Speaker Normalization Procedure for Efficient Training with Limited Data

IP.com Disclosure Number: IPCOM000111012D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 113K

Publishing Venue

IBM

Related People

Bellegrada, JR: AUTHOR [+2]

Abstract

In a speaker dependent speech recognition system, it is often advantageous to take advantage of previously acquired reference data while training a new speaker to the system. A more efficient version of a recent speaker normalization algorithm is proposed for implementation purposes in a practical environment such as Tangora, the IBM experimental speech recognizer.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Simplified Speaker Normalization Procedure for Efficient Training
with Limited Data

      In a speaker dependent speech recognition system, it is often
advantageous to take advantage of previously acquired reference data
while training a new speaker to the system.  A more efficient version
of a recent speaker normalization algorithm is proposed for
implementation purposes in a practical environment such as Tangora,
the IBM experimental speech recognizer.

      A speaker-dependent speech recognition system such as described
in [1], where the pronunciation of each word is represented by a
hidden Markov model.  These Markov word models are further assumed to
be composed of a small inventory Asw of sub-word acoustic models [2],
representing an allophone, a phone, a syllable, etc [3].  Finally, it
is assumed that each sub-word model is comprised of a sequence of
elementary units (e.g., fenones) drawn from a small alphabet Aeu.
Typically, the size of the alphabet Asw is approximately 2000, while
the size of the alphabet Aeu is approximately 300.

      The following is concerned with making use of previously
acquired reference data while training a new speaker to the system.
In [4], a general speaker normalization procedure for achieving this
result is presented.  This strategy has a number of benefits,
including a drastic reduction in the amount of training data required
from new speakers, a decrease in the amount of training CPU time
necessary for new speakers, and improved vector-quantization accuracy
in discrete parameter systems by effectively increasing the amount of
training data.  However, it also suffers from an increased
computational complexity with respect to, for example, the
non-normalizing procedure involving only one speaker described in
[5].  This increased computational complexity may, in a practical
situation such as the Tangora environment, destroy the leverage
brought about by the additional training data by complicating the
implementation of the algorithm.

      The present article reports on two ways to cut back on the
amount of computations and thereby considerably improve execution
speed.

      First, the re-clustering of the transformed data from the
original untransformed seeds is unnecessary.  This step was
originally motivated by the fact that the piecewise linear
transformation described in [4] does not preserve Euclidean distance.
Thus, the cluster centroids for the transformed data are generally
different from the initial cluster centroids for the untransformed
data.  However, these centroids are only useful to come up with
preliminary prototype distributions for each elementary model in the
alphabet Aeu.  As long as the transformation does not entail drastic
changes in the clusters, it is conceivable that the original cluster
centroids could be used for this purpose without loss of labelling
accuracy.

      Second, it is possible to use a smaller amount of transformed
reference...