Browse Prior Art Database

Speaker-Independent Labeling in Discrete-Parameter Continuous Speech Recognition Systems

IP.com Disclosure Number: IPCOM000111254D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 4 page(s) / 123K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

In discrete parameter speech recognition systems, acoustic labels are emitted at regular intervals (typically 10ms) by an acoustic processor. In one prominent approach [1] each label is characterized as a mixture of Gaussian distributions which is intended to model the acoustic parameters associated with that label. Usually, the parameters of the component Gaussians are computed from a sample of training data provided by the speaker to be recognized [1-3].

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 44% of the total text.

Speaker-Independent Labeling in Discrete-Parameter Continuous Speech
Recognition Systems

      In discrete parameter speech recognition systems, acoustic
labels are emitted at regular intervals (typically 10ms) by an
acoustic processor.  In one prominent approach [1]  each label is
characterized as a mixture of Gaussian distributions which is
intended to model the acoustic parameters associated with that label.
Usually, the parameters of the component Gaussians are computed from
a sample of training data provided by the speaker to be recognized
[1-3].

      The invention below specifies a procedure for computing
speaker-independent Gaussians which may be used for all speakers in a
Markov model based recognition system, thus eliminating the need for
the usual vector-quantiser training data.  For best results, the
procedure should be repeated for male and female speakers separately.

The following steps are performed.

1.  Record a large number of training sentences from a wide variety
    of speakers.  There should be at least 15,000 sentences
    (preferably more) and at least 10 speakers (preferably more).
    Each contributing speaker should provide a sufficient number of
    sentences to allow speaker-dependent labelling to be performed as
    described in [1-3].

2.  Compute speaker-dependent Gaussians for each training speaker
    [1-3]  and label their training data with their own personal
    Gaussians.

3.  Train and align each speaker's training data against the Markov
    models used by the recogniser [4].

4.  Using the alignments from Step 3, tag each frame of each speaker
    with the identify of the arc against which it was aligned.  These
    tags indicate the label that we would like the vector quantiser
    to output when presented with the associated frame.

5.  Compute discriminating eigenvectors as in [1], using a sample of
    the available training data.  This sample should be stratified
    across speakers and should consist of at least 5000 sentences.

6.  Splice and project the parameter vectors of each speaker into
    P-dimensions as described in [1]  using the eigenvectors from
    Step 5.  A reasonable value for P is 80.  This projection is
    intended to reduce the size of the acoustic parameter vectors as
    much as possible while minimizing the loss of relevant
    information (relevant means useful for discriminating between
    labels).

7.  Using the projected data from Step 6, and the same sample size as
    in Step 5, compute the eigenvectors and eigenvalues of the
    average within-class covariance matrix.  The average within-class
    covariance matrix is obtained by computing the covariance matrix

    for each label separately, and then forming their weighted
    average where the weights are equal to the relative frequencies
    of the labels.

8.  Rescale the eigenvectors of Step 7 by dividing each eigenv...