Browse Prior Art Database

Adaptation of Acoustic Spectral Prototypes by Maximum Likelihood Estimation of Affine Linear Transformations of Gaussian Mixtures

IP.com Disclosure Number: IPCOM000113468D
Original Publication Date: 1994-Aug-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 4 page(s) / 116K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

Reference speech aligned to a phone is modeled as having a known mixture of Gaussian densities. New speech is modeled as a mix and dependent noisy affine linear transformation of reference speech. Adaptation consists of estimating the noise covariance and the collection of affine linear transformations by maximum likelihood based on alignment-matched data.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 44% of the total text.

Adaptation of Acoustic Spectral Prototypes by Maximum Likelihood
Estimation of Affine Linear Transformations of Gaussian Mixtures

      Reference speech aligned to a phone is modeled as having a
known mixture of Gaussian densities.  New speech is modeled as a mix
and dependent noisy affine linear transformation of reference speech.
Adaptation consists of estimating the noise covariance and the
collection of affine linear transformations by maximum likelihood
based on alignment-matched data.

(1)  A completely specified Gaussian mixture model for each phone to
be represented by a prototype (aka "Z-prototype" as in (*).
   eqno (1)
   P(x) = sum from i=1 to k p sub i g(x vbar u sub i,
   Sigma sub i)
where x is a spectral vector and g is a d dimensional Gaussian
density.  The mixand probabilities P sub i, the mean vectors mu sub i
nd the nonsingular covariance matrices sum sub i are all assumed
known.  This information is available in the case of (plentiful)
reference data.

(2)  Matched pairs of spectral vectors (X sub t, Y sub t) t=1,....,N
from reference and new speech.  Such pairing of spectral vectors does
not occur naturally but we can and do manufacture this pairing e.g.
by associating two vectors if they belong to the same leafimic phone
based on Viterbi alignments.  We arbitrarily choose one of the many
pairings possible between two sets of vectors.  This may be done by
any of the following methods (1) reduce the larger set by choosing a
random subset to match the size of the smaller and associate pairs
based on a random permutation, (2) reduce both sets to their
centroids, (3) match every vector of each set with every vector of
the other set, etc.  The abundance and reliability of such pairs is
largely determined by the amount of new speech data available.

(3)  The assumption that the X sub i are independent samples from the
known mixture density (p(x) and the assumption that the random
mixture index I sub t is such that Y sub t is related to the pair (X
sub t, I sub t) by
  eqno (2)
  Y sub t = A sub j X sub t + B sub j + epsilon sub t

where A sub j is a d x d matrix and B sub j is a d x l vector, and
where epsilon sub t are mutually independent zero mean Gaussian noise
errors with common covariance matrix Gamma independent of the class
index I.

      Classify each vector Z sub t = (X sub t, Y sub t) by choosing
the index as I sub t = argmax p(j vbar X sub t).  In each of the k
classes estimate A sub j, B sub j by the 1 le j le k maximum
likelihood estimates in multivariate Gaussian regression.  Estimate
Gamma as the covariance matrix of the residual errors i.e. of Y sub t
- A sub It X sub t - B sub It for all t = 1,...,N

ITERATE the following two steps.

      E-STEP - Using the fixed means and covariances and class
probabilities mu sub j, Sigma sub j, p sub j together with the latest
version of the estimated transformations and noise covariance A sub
j, B sub j Gamma, construct the conditional mean v...