Browse Prior Art Database

A method of using multi-word phrases to enhance allophonic contextual information in speech recognition Disclosure Number: IPCOM000021974D
Original Publication Date: 2004-Feb-18
Included in the Prior Art Database: 2004-Feb-18
Document File: 3 page(s) / 60K

Publishing Venue



Most speech recognition engines do not take into account the "right-context" beyond the current word boundary. This trades loss of accuracy for gain of speed: such modeling does not account for the potential coarticulation effects by which the next word may impact the pronunciation of the current one. Short common words that often happen in conjunction are likely to be spoken as a single unit, with strong coarticulation effects. For those cases where it is both needed and practical (for example, for decoding digit strings), we change the vocabulary units to include "phrases". These coarticulations now occur "in-token", which allows for modeling them with appropriate contextual allophonic models.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 48% of the total text.

Page 1 of 3

A method of using multi-word phrases to enhance allophonic contextual information in speech recognition

Speech recognition systems often model words using their phonetic transcriptions into a phonetic alphabet, as for example in the figure below

Figure 1: phonetic transcriptions

     However, since the acoustic evidence of a phone's realization strongly depends on its context (neighbouring phones), lower-level models (often called allophones) are built that take into account this phonetic context. For example, instead of having a mathematical model that represents the behavior of e.g. phone /a/, we have models of /a/ when preceded by /r/ and followed by /t/, models of /a/ in a word-final position, models of /a/ when the third phone to its right is /s/, etc... Whether a particular context is worthy of its own model can be determined by automatic procedures that maximize some information theory criterion.

     One convenient way to store these models is in a tree, where the root node is the phone (/a/ in our example), each further node is a "question" asked about the context (of the form "is the phone [n] positions to the [left|right] of the current phone one of X,
Y...Z?". Many questions can be asked in sequence (as the tree developing algorithm sees fit, according to a maximization of the informativeness of the tree). At the resulting leaves, we have a model of the phone in a particular context, also known as an "allophone", e.g. /a/ with /t/ to its left and /r/ 3 phones to the right, unless it's a word-final /a/. This model gets trained on speech segments that match that phone in that context.

     Figure 2 shows such a tree (made-up example), where the system has found it useful to distinguish 5 allophonic variants of phone /a/, and the appropriate questions to ask about the phonetic context of an instance, to determine which variant applies. For example, the allophone /a2/ is used for those where the phone to the left (P-1) is not an unvoiced plosive (/p/, /t/ or /k/), and the phone 3 positions to the right (P+3) is /r/. That would be the allophone used in "matter" /m a t uh r/, for example.

Figure 2: an example of allophonic tree and contextual questions


[This page contains 20 pictures or other non-text objects]

Page 2 of 3

     At decoding time, when examining an hypothesized string of words, the engine looks up the phonetic transcriptions of these words. When trying to match acoustic signal to a phone in the current word, the phonetic context of this phone (in the current hypothesis) is looked up, to match the signal to the proper allophonic model. For example, if using an allophonic tree, starting at the root node for the current phone, we trickle down the tree, answers to the questions at each node driving which way we go, until we reach a leaf where we get the proper allophonic model for acoustic matching.

     The problem addressed here is that, at decoding time, the right-hand context words (words that have yet to occur or be hypothesized) is...