Browse Prior Art Database

Construction of Context Dependent Label Prototypes for Use in a Speech Recognition System

IP.com Disclosure Number: IPCOM000106678D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 108K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

In discrete-parameter speech recognition systems, a vector quantiser outputs an acoustic label at regular intervals. In one prominent approach to speech recognition [1], each label is characterized by a "prototype" consisting of a mixture of diagonal Gaussian distributions, and the label output identifies the prototype which maximizes the likelihood of a corresponding acoustic parameter vector. There is one prototype mixture per label, and it does not depend on the phonetic context of the frame being labelled.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 46% of the total text.

Construction of Context Dependent Label Prototypes for Use in a Speech Recognition System

      In discrete-parameter speech recognition systems, a vector
quantiser outputs an acoustic label at regular intervals.  In one
prominent approach to speech recognition [1], each label is
characterized by a "prototype" consisting of a mixture of diagonal
Gaussian distributions, and the label output identifies the prototype
which maximizes the likelihood of a corresponding acoustic parameter
vector.  There is one prototype mixture per label, and it does not
depend on the phonetic context of the frame being labelled.

      The invention below generalizes the concept of Gaussian mixture
prototypes so as to make them context-dependent.  Instead of having
one mixture per label, there are several mixtures per label, one of
which will be selected to assess the likelihood of an acoustic
vector.  The appropriate mixture is determined from the phonetic
context of the corresponding frame.

      The idea of context-dependent acoustic modelling is not new: it
forms the basis of the acoustic Markov word models previously
described in [2].  But in [2] it is the word models that are context
dependent, not the label prototypes as advocated here.

      Others have also recently suggested the use of
context-dependent prototypes [3].  The invention below differs from
[3]  in the following principal ways.  First, we advocate different
acoustic parameters.  Second, we obtain context-dependency rules via
decision trees which maximize Gaussian likelihoods, whereas in [3]
they seek to minimize Euclidean distances.  Third, decision trees are
constructed for each vector element separately in [3], rather than a
single decision tree for the entire vector as here.  Fourth, the
prototypes in [3] consist of single Gaussians.  The present invention
uses a mixture of diagonal Gaussians.  Fifth, in [3]  the
context-dependency rules cover only one phoneme on each side of the
frame being modelled, whereas the method below covers several on each
side: typically five.

      Assume that some training data has been recorded, signal
processed, and Viterbi aligned against phoneme-based Markov word
models as described in [1,4].

      Assume further, that the existence of some phonetically
meaningful questions which may be sued to construct phonological
trees.  These questions, which may be applied to any phone P in the
neighborhood of the frame being processed, usually take the form "is
P a member of the set S?".  Here S denotes a set containing one or
more phonetic phones having something in common.  The necessary sets
may be obtained from almost any phonetic text book.  For present
purposes, "word boundary" is also considered to be a phonetic phone.

The following steps are performed.

1.  Using the existing Viterbi alignments, tag each parameter vector
    in the training data with (1) the identity of the arc against
    which the vector w...