Browse Prior Art Database

Automatic Error Detection With Markov Word Models in Automatic Speech Recognition

IP.com Disclosure Number: IPCOM000036552D
Original Publication Date: 1989-Oct-01
Included in the Prior Art Database: 2005-Jan-29
Document File: 4 page(s) / 47K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

In accordance with the invention, parameters derived or derivable during the process of recognizing speech with Markov word models are used in determining if an error in recognition has occurred.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 44% of the total text.

Page 1 of 4

Automatic Error Detection With Markov Word Models in Automatic Speech Recognition

In accordance with the invention, parameters derived or derivable during the process of recognizing speech with Markov word models are used in determining if an error in recognition has occurred.

In performing speech recognition with Markov models, a plurality of labels (or fenemes) are defined, each of which represents a respective sound type that may be associated with an interval of time, for example, one centisecond. Each word in a vocabulary is typically defined by a Markov model that includes (a) a plurality of states, (b) transition which extend from a state to a state, (c) a probability associated with each transition, and (d) a probability indicating the likelihood that a particular feneme is produced at a particular transition. The various probabilities are determined during a training session.

In the recognition process, spoken input is converted into a string of fenemes by an acoustic processor. A matching is then performed to determine which word or words, based on their respective Markov models, has the highest probability of having produced the feneme string. In performing the match, several approximations may be employed to provide a "fast approximate" match. Alternatively, the information contained in the Markov word models may be used without employing approximations, to yield a "detailed" match. Preferably, the "fast" match eliminates the more improbable words from consideration, thereby providing a short list of candidate words. For each candidate word an approximate acoustic match score is computed and all candidate words are ranked according to their respective match scores.

Following the fast match, a detailed match is preferably performed on the candidate words in the short list. A language model which considers the context of words may also be applied to further assure correct recognition.

The detailed match, it is noted, is performed time frame by time frame over the sequence of fenemes constituting the utterance.

At any particular point in the utterance, the probability associated with each stated in the word model can be computed, and the state with maximum probability noted. In this way, a "state sequence" is obtained specifying the most probable state at each point in time during the utterance. This state sequence usually increases linearly for a correct match. A "match profile" is produced during the detailed match. The profile is computed on a frame by frame basis and its value at any point in time is essentially just the sum of all the state probabilities at that time. In practice, the profile values are divided by their respective expected values under the assumption that the match is correct. Consequently, when the match is indeed correct, the profile should be linear; departures from linearity indicate errors. The end of the profile is taken to be that point in time where the probability of the final state in the w...