Context-Dependent Length-Constrained Phonetic Models for the Acoustic Detailed Match
Original Publication Date: 1995-Jun-01
Included in the Prior Art Database: 2005-Mar-30
Bahl, LR: AUTHOR [+1]
A new algorithm is disclosed for improving the accuracy of speech recognition systems. This is done by improving the acoustic modelling by incorporating length information in the models.
Context-Dependent Length-Constrained Phonetic Models for
algorithm is disclosed for improving the accuracy of
speech recognition systems. This is done by improving the acoustic
modelling by incorporating length information in the models.
In the IBM*
Speech Recognition System, there are two stages in
the recognition, the first stage does a fast acoustic match using
simple models, to provide a shortlist of words at a given time .
To discriminate between these words in the process of picking the
best, the second stage incorporates a detailed acoustic match using
more sophisticated models for the phones. Typically, the model for a
word is made up by concatenating the models of its constituent
phones, and to account for coarticulation and similar features, the
model for each phone is usually made to depend on the phonetic
context in which it occurs (both the phonetic context within the
word, and across word boundaries) [2,3]. These special instances of
a phone, in a given context, are called allophones. Typically
however, the models corresponding to different allophones of a phone
differ only in their output distributions, and not in their topology
or minimum lengths. This is a potential drawback because the length
of the acoustic sequence corresponding to a phone also varies as a
function of the context in which it occurs. In this disclosure, a
method is described to make up allophonic models where the machine
topology, minimum lengths and output distribution are all made to
depend on the context in which the phone occurs. The use of these
new context dependent models results in a reduction in the overall
The algorithm incorporates the following features:
1. The starting point is the phonological rule tree, described in
(2). Briefly speaking, a tree is grown for each phone, that
separates instances of the phone, depending on the context in
which it occurs. Hence the leaves of the tree represent the
allophones of that particular phone.
2. Next, the training data is poured down this tree, and the label
sequences corresponding to each allophone are collected.
Subsequently, the distribution of the lengths of the sequences
obtained, at each allophone. In order to model this length
distribution, the minimum length of the model corr...