Browse Prior Art Database

Improved Speech Recognition Feature Binning Technique

IP.com Disclosure Number: IPCOM000100426D
Original Publication Date: 1990-Apr-01
Included in the Prior Art Database: 2005-Mar-15
Document File: 2 page(s) / 81K

Publishing Venue

IBM

Related People

Grice, DG: AUTHOR [+4]

Abstract

An automated speech recognizer must extract the critical characteristics from a voice sequence in order to accurately identify that sequence again. A recognition system based on a perceptual model of speech tries to mimic the ear. It attempts to extract those characteristics that seem to be of interest to the ear. Since the ear is sensitive to the frequency content of sound, the fourier transform is often used by recognizers. A Fast Fourier Transformation (FFT) is an effective means to convert a time domain speech input to a frequency spectral representation. The n magnitude values resultant of the FFT process can be treated as a filter bank. The energy bands are typically grouped producing a combination logarithmic/linear spacing. This normal space is called 'critical binning'.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Improved Speech Recognition Feature Binning Technique

       An automated speech recognizer must extract the critical
characteristics from a voice sequence in order to accurately identify
that sequence again.  A recognition system based on a perceptual
model of speech tries to mimic the ear.  It attempts to extract those
characteristics that seem to be of interest to the ear.  Since the
ear is sensitive to the frequency content of sound, the fourier
transform is often used by recognizers.  A Fast Fourier
Transformation (FFT) is an effective means to convert a time domain
speech input to a frequency spectral representation.  The n magnitude
values resultant of the FFT process can be treated as a filter bank.
The energy bands are typically grouped producing a combination
logarithmic/linear spacing.  This normal space is called 'critical
binning'.

      A different grouping method is described here which attempts to
more closely mimic the ear's frequency response characteristics.
This model will 'react' more similar to the ear.

      A perceptual-based model should only be sensitive to a shift in
frequency if the ear of the speaker/listener is also sensitive to
this shift.  If two sounds are perceived as the same to a listener,
then these two sounds should map to the same model feature.  As an
extension to the critical binning procedure, the proposed binning
technique not only allows for critical band spacing but introduces
the new concept of frequency masking modeling.

      Frequency masking refers to the fact that certain sounds tend
to mask or drown out other sounds with differing success.
Experiments should have several interesting properties consistent in
human subjects:
      For moderate masker intensities, tones tend to mask
      neighboring tones most effectively;
      Low frequency tones effectively mask high frequency
      tones, while high frequency tones are less effective in
      masking low frequency tones.

      Although the general concept may be varied, the accompanying
table shows an example of how the proposed binning process might
work.

      Channel   Standard        Pre                  Post
            Component      2    4         2    4    8    16   32
   1         6             5              7    8    9    10   11
   2         7-8           6              9    10   11   12   13
   3         9-10          8              11   12   13  ...