Browse Prior Art Database

Growing Phonetic Baseforms From Multiple Utterances in Speech Recognition

IP.com Disclosure Number: IPCOM000039977D
Original Publication Date: 1987-Sep-01
Included in the Prior Art Database: 2005-Feb-01
Document File: 2 page(s) / 14K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+4]

Abstract

The most likely sequence of hidden Markov model phones which constitute a vocabulary word is determined by (a) generating a string Si(where 1 & i & n) of labels (speech prototype vectors) for each of n utterances of a word; (b) determining the probability of each string Si given a prescribed sequence Pj of phones; (c) computing (d) multiplying Pragg by the prior probability of Pj to provide a joint probability; (e) repeating steps (a) through (d) for each of a plurality of phone sequences Pj; and (f) by iterative stack decoding, determining which phone sequence has the best joint probability (for the Si strings) above a prescribed threshold. The stack decoding involves determining a first probability measure based on acoustics and a second probability measure based on a language model.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Growing Phonetic Baseforms From Multiple Utterances in Speech Recognition

The most likely sequence of hidden Markov model phones which constitute a vocabulary word is determined by (a) generating a string Si(where 1 & i & n) of labels (speech prototype vectors) for each of n utterances of a word; (b) determining the probability of each string Si given a prescribed sequence Pj of phones; (c) computing (d) multiplying Pragg by the prior probability of Pj to provide a joint probability; (e) repeating steps (a) through (d) for each of a plurality of phone sequences Pj; and (f) by iterative stack decoding, determining which phone sequence has the best joint probability (for the Si strings) above a prescribed threshold. The stack decoding involves determining a first probability measure based on acoustics and a second probability measure based on a language model. Both measures are based on phones rather than words -- the language model being generated to indicate preferably the likelihood of a given phone following a prior sequence of phones. A speech recognizer includes an acoustic processor which, in response to an utterance, generates a string of labels. The labels are selected from an alphabet of labels. Each label corresponds to a prototype vector in which each vector component corresponds to a feature of speech (for example, energy in a particular frequency band during a prescribed interval of time). The speech recognizer also stores a set of Markov model phones, each having a plurality of states connected by transitions. Each transition has a probability and, for at least some transitions, there are probabilities of labels being generated thereat. The probabilities are determined from a training session during which known strings of labels are produced in response to the uttering of known phone sequences. The transition and label probabilities assigned to the phone models are used in determining the acoustic probability that a given phone model (or sequence of phone models) produces some string of labels. A second probability measure involves the phone language model (PLM). Phone sequences are viewed as m-grams (i.e., sequences of m phones). The number of times each m-gram occurs in so...