Browse Prior Art Database

Automatic Weighting Function Generation for Dynamic Time Warping Speech Recognition

IP.com Disclosure Number: IPCOM000100488D
Original Publication Date: 1990-Apr-01
Included in the Prior Art Database: 2005-Mar-15
Document File: 3 page(s) / 122K

Publishing Venue

IBM

Related People

Grice, DG: AUTHOR [+4]

Abstract

This invention proposes a method for identifying the parts of similar words that differentiate them from each other. Given a set of confused references from an initial uniform match, the word is "listened to" again (matched) with particular attention being paid to the parts of the words that make them unique. The weighting function is derived dynamically and takes into account the actual confused words and the speech patterns of the speaker. The cost to do this is minimal in terms of processing time and will reduce the error rate of the recognizer.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Automatic Weighting Function Generation for Dynamic Time Warping Speech Recognition

       This invention proposes a method for identifying the
parts of similar words that differentiate them from each other. Given
a set of confused references from an initial uniform match, the word
is "listened to" again (matched) with particular attention being paid
to the parts of the words that make them unique.  The weighting
function is derived dynamically and takes into account the actual
confused words and the speech patterns of the speaker.  The cost to
do this is minimal in terms of processing time and will reduce the
error rate of the recognizer.

      Automatic speech recognition using Dynamic Time Warping is
performed by finding the least cost path through pairs of speech
patterns (called templates).  The algorithm finds the total cost of
matching an incoming template (the unknown word) with each of the
reference templates.  The reference that results in the lowest score
is considered to be the spoken word.  (A threshold is often set such
that inputs that result in scores that are too high are rejected as
not being under- stood.)

      In traditional recognition systems each time "slice" of a
reference template is given equal weight, or importance, in computing
the total cost of matching the input.  When the vocabulary of
reference templates contains similar sounding words, such as GO and
NO, errors can result because the bulk of each word is the vowel "O".
 The traditional method would, therefore, emphasize the precise way
the "O" was said instead of giving the initial consonant the most
attention. What is described here is a method to automatically
identify the parts of the references that differentiate them from
each other.  This is a dynamic process since the differentiating part
of a word depends on the word it is confused with.  In the case of
NO and GO, the "N" of NO is more important than the O.  For NO
and NEW, however, the "O" sound would be more important.  The ability
to identify the differentiating portion of the word allows an
additional processing step which can increase the recognition
accuracy when similar sounding words are encountered.

      The recognition process in such a system would contain three
stages.  The first is a crude matching process which eliminates many
of the words in the vocabulary as not being likely candidates.  The
second stage is a detailed match of those words that remain.  The
third stage would be the inclusion of this weighted matching process.
 If after the second stage there is possible confusion as to the
correct word (because the scores of the best matches were too close),
then this ambiguity remover is invoked.

      For the sake of example, the case of two confused...