Browse Prior Art Database

Template Averaging for Adapting a Dynamic Time Warping Speech Recognizer

IP.com Disclosure Number: IPCOM000100492D
Original Publication Date: 1990-Apr-01
Included in the Prior Art Database: 2005-Mar-15
Document File: 5 page(s) / 215K

Publishing Venue

IBM

Related People

Grice, DG: AUTHOR [+4]

Abstract

The article discusses a method for averaging templates (digitized voice patterns). There are several reasons for wanting to average templates. Speech recognition algorithms are pattern matching routines that look for similarities in digitized voice prints (called templates). One method that increases the likelihood that a correct match is made is to store multiple templates of the same word. Since we never say the same word exactly the same way twice, multiple copies of the same word will better cover all the different possible ways that a word can be spoken.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 33% of the total text.

Template Averaging for Adapting a Dynamic Time Warping Speech Recognizer

       The article discusses a method for averaging templates
(digitized voice patterns).  There are several reasons for wanting to
average templates.  Speech recognition algorithms are pattern
matching routines that look for similarities in digitized voice
prints (called templates).  One method that increases the likelihood
that a correct match is made is to store multiple templates of the
same word.  Since we never say the same word exactly the same way
twice, multiple copies of the same word will better cover all the
different possible ways that a word can be spoken.

      The problem is that multiple templates require additional
memory as well as requiring the matching routines additional
computation to complete all of the matching. Important advantages in
using this method involve both memory and time savings, since it is
possible to represent more than one utterance of the same word with
just one template.  In this way you get the potential benefit of
multiple templates without the added memory or required computation.

      Another potential benefit from using this method is its
potential use in adapting a template to a new speaker or the same
speaker in a new acoustical environment.  For instance, the training
portion of speech recognition can be eliminated if the user were
supplied with a set of speaker-independent training words.  Then, as
the system is used, the general templates can be dynamically adapted
to the natural characteristics of the user's voice, thus improving
the recognition accuracy of the system.  Over a period of time, the
templates can appear more as the user's own, through the continual
adaptation of the templates.  In a similar manner, if the acoustical
environment varied from time to time, the changes in acoustical
information can be reflected in the changing templates over time.

      Typically, a word template (voice print) represents n time
frames of feature information, often these features representing
energy values in the frequency spectrum.  It is these values that are
used by the dynamic time warping methods to perform the matching
algorithms.  Figs.  A and B illustrate two templates, one for the
word "no" and the other for the word "go."  The horizontal axis
represents time and the vertical axis represents increasing
frequency. For purposes of example, these two templates will be
referred to throughout this article.  Obviously, you would not want
to average together templates for the words "no" and "go."

      In order to average two templates, a base template is needed
(the template to be adapted), and a current template or one that will
cause the base template to be altered. There are two considerations
to be made in averaging templates.  The first one is in the method in
which the feature information from two templates are averaged
together.  The second consideration is in taking this ne...