Browse Prior Art Database

Construction of Markov Word Models for Computer Recognition of Continuous Speech

IP.com Disclosure Number: IPCOM000106603D
Original Publication Date: 1993-Nov-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 114K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

In some speech recognition systems, the pronunciation of a word is represented by a Markov model. This document describes how to obtain Markov word models for continuous speech.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Construction of Markov Word Models for Computer Recognition of Continuous Speech

      In some speech recognition systems, the pronunciation of a word
is represented by a Markov model.  This document describes how to
obtain Markov word models for continuous speech.

      Assume the existence of phonetic Markov word models, such as
that those described in (1).  Also assume that some speech has been
recorded, and signal-processed (2) to produce a vector of P ear-model
parameters every T ms.  Good choices for P and T are 20 and 10
respectively.  Finally, we shall assume that each parameter vector
has been augmented with its total energy, and replaced with a spliced
parameter vector (3) obtained by concatenating the N augmented
vectors centered around the replaced vector.  A good choice for N is
9.  A reasonable amount of training data is 2000 sentences per
speaker, and a reasonable number of speakers is 10.

      The procedure is as follows.  Note that the terms "baseform"
and "Markov model" are used synonymously.

1.  Label all the training data via the method of phonetic
    supervision as described in (3).

2.  Create a singleton fenemic baseform (4) for each lexeme in the
    trainining data.  There is one lexeme for each distinct
    word-pronunciation in the data.

3.  Using those utterances from which singleton baseforms were not
    created, compute transition and output probabilities for the
    singleton baseforms via the forward-backword algorithm (1).

4.  Using the trained statistics from Step 3, and all the training
    data, create a (grown) fenemic baseform (5,6) for each lexeme in
    the training data.

5.  Using all the training data, compute transition and output
    probabilities for the grown fenemic baseforms of Step 4 via the
    forward-backword algorithm.  These statistics are required in the
    first execution of Steps 10-11.

6.  Compute speaker-dependent transition and output probabilities for
    the phonetic baseforms, using all the availbale training data for
    each speaker, and the forward-backword algorithm (1).

7.  Using the statistics of Step 6, Viterbi align (1) the training
    data against the corresponding phonetic baseforms.

8.  Using the Viterbi alignment of Step 7, determine the end-points
    of every phonetic phone and function lexeme (7) in the training
    scripts.

9.  Using the phone/lexeme end-points, extract the label sequences
    corresponding to every function lexeme in the training data, and
    every phonetic phone not embedded in a function lexeme.  Tag each
    such sequence with its phonetic phone context:  typically the 5
    preceeding and 5 following phones.

10. Using the label sequences of Step 9, and the current fenemic
    statistics, construct a tree of phonological rules (8,9) for each
    phonetic phone and function lexeme.

11. Construct a fenemic baseform (leafform) for each leaf of ea...