Browse Prior Art Database

Least Squares Projection for Speaker-Normalization in Speech Recognition Systems

IP.com Disclosure Number: IPCOM000111224D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 86K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

It is desirable that speech recognition systems be able to make use of previous speech when encountering new speakers. For this reason, procedures are described in [1,2] which attempt to map a new speaker's acoustic parameter vectors as closely as possible to corresponding vectors from a reference speaker. The first step in [2] is to project the acoustic parameter vectors of a new speaker into a reference speaker's vector space (usually of lower dimension) using a least-squares linear mapping. This mapping is crucial: if it is inaccurate, information is permanently lost, resulting in poor acoustic processing and a high error rate. The procedure below details an iterative algorithm for obtaining a much improved initial mapping.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Least Squares Projection for Speaker-Normalization in Speech Recognition
Systems

      It is desirable that speech recognition systems be able to make
use of previous speech when encountering new speakers.  For this
reason, procedures are described in [1,2]  which attempt to map a new
speaker's acoustic parameter vectors as closely as possible to
corresponding vectors from a reference speaker.  The first step in
[2]  is to project the acoustic parameter vectors of a new speaker
into a reference speaker's vector space (usually of lower dimension)
using a least-squares linear mapping.  This mapping is crucial: if it
is inaccurate, information is permanently lost, resulting in poor
acoustic processing and a high error rate.  The procedure below
details an iterative algorithm for obtaining a much improved initial
mapping.

      In [2]  the initial speaker-normalizing mapping is computed via
least-squares from a collection of matched vector pairs.  Matching
vectors are determined on the basis of tags which are obtained with
extreme precision in [1].  Despite the careful tagging of [1],
vectors with identical tags are often not particularly similar in an
acoustic sense; this is because of speaker variability.

      The following iterative algorithm uses tagged vector pairs to
get started, but then performs iterative refinements ignoring the
tags, so as to obtain an improved mapping.  By ignoring the tags, the
mapping is much less influenced by the noise attributable to
dissimilar vectors with identical tags.  In addition, if the ultimate
purpose of mapping a new and reference speaker together is to label
the new speaker using reference-speaker prototypes, it makes more
sense to compute the mapping on the basis of matched vector/prototype
pairs, as the algorithm does, instead of matched vector/vector pairs.

The following steps are performed.

Step 1.  Perform Step 2 for each sub-word unit (e.g. phone) in the
alphabet of such units.

Step 2.  Perform iterative K-means Euclidean clustering of the
reference speaker's acoustic parameter vectors as described in [3] so
as to partition the reference data into sub-classes, each with its
own centroid.

Step 3.  Compute the least-squares linear mapping from the new
speaker's vectors to the reference speaker's vectors on the ba...