Browse Prior Art Database

Construction of a Projection Matrix Spanning a Large Time Window for Use in a Speech Recognition System

IP.com Disclosure Number: IPCOM000105823D
Original Publication Date: 1993-Sep-01
Included in the Prior Art Database: 2005-Mar-20
Document File: 2 page(s) / 85K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+3]

Abstract

In one prominent approach to speech recognition [1] the following acoustic processing is performed. An acoustic parameter vector of about 21 elements is computed at regular intervals of about 10ms. A spliced parameter vector of about 189 elements is then associated with each time frame and is obtained by concatenating together about 9 of the 21-dimensional vectors from a window centered on the associated frame. These spliced vectors of about 189 elements are then projected down to about 50-dimensions using discriminating eigenvectors [2]. Thus, the final parameter vector associated with any given frame t reflects both the instantaneous character of the signal at time t, and the dynamic properties over a window of about 90ms centered at time t.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Construction of a Projection Matrix Spanning a Large Time Window for Use in a Speech Recognition System

      In one prominent approach to speech recognition [1] the
following acoustic processing is performed.  An acoustic parameter
vector of about 21 elements is computed at regular intervals of about
10ms.  A spliced parameter vector of about 189 elements is then
associated with each time frame and is obtained by concatenating
together about 9 of the 21-dimensional vectors from a window centered
on the associated frame.  These spliced vectors of about 189 elements
are then projected down to about 50-dimensions using discriminating
eigenvectors [2].  Thus, the final parameter vector associated with
any given frame t reflects both the instantaneous character of the
signal at time t, and the dynamic properties over a window of about
90ms centered at time t.

      Unfortunately, some acoustic events in speech cannot be
distinguished on the basis of this information alone.  In extreme
cases a window of 300ms may be required.  Yet the above technology
cannot easily be extended to windows of this length: the covariance
matrices involved in the calculation of the discriminating
eigenvectors become so large that (1) there is generally insufficient
data available to estimate them reliably, and (2) they are
unmanageable from an algorithmic point of view.

      The disclosed invention specifies an iterative algorithm for
increasing the conventional window to a more desirable length without
increasing the sizes of any of the covariance matrices or parameter
vectors used in the calculations.

     Assume that some training data has been recorded and signal
processed, and that a P-dimensional acoustic parameter vector has
been associated with each time frame.  Typically, P would be about
21, and the time frame would be about 10ms.

The following steps are performed.

(1)  At each time frame T, concatenate together the N P-dimensional
vectors centered on T.  A reasonable value for N is 9.  Let Q = N.P
denote the dimension of the resulting spliced vectors.

(2) Using the Q-dimensional vectors just created, compute
Q-dimensional discriminating eigenvecotrs as descri...