Browse Prior Art Database

New Signal Conditioning Step for Continuous Speech Recognition

IP.com Disclosure Number: IPCOM000111187D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 4 page(s) / 125K

Publishing Venue

IBM

Related People

Bellegrada, JR: AUTHOR [+3]

Abstract

The success of an automatic speech recognition system is critically dependent on the quality of its signal processing front-end. This is especially true in continuous speech where co-articulation effects may be severe. This work describes a new signal conditioning step which we have found useful for the DARPA Resource Management recognition task.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

New Signal Conditioning Step for Continuous Speech Recognition

      The success of an automatic speech recognition system is
critically dependent on the quality of its signal processing
front-end.  This is especially true in continuous speech where
co-articulation effects may be severe.  This work describes a new
signal conditioning step which we have found useful for the DARPA
Resource Management recognition task.

      The so-called "front-end" of an automatic recognition system
usually refers to the mapping between the acoustic signal received
through the microphone and a multi-dimensional vector space suitably
encompassing the salient features of this signal.  Since these
features are subsequently used for recognition, the quality of the
signal processing is critical to the ultimate performance of the
recognizer.  In the IBM speech recognition system, the basic signal
processing comprises A/D conversion, short term power spectrum
computation, critical band filtering, compressive loudness scaling,
and ear model adaptation [1].  This particular approach has been
shown to favorably mimic the human auditory system [1].

      On the other hand, such a front-end turns out to be quite
sensitive to the dynamic range exhibited by the speech utterances.
This is because the long term component of the adaptation maps the
effective dynamic range of each band into a fixed dynamic range,
approximately (30,80) dB.  This causes no problem for the usual IBM
office correspondence dictation task, where the typical office
environment exhibits a dynamic range varying between 25 and 45 dB.
However, this has the potential to produce deleterious effects in a
quieter environment, where the floor might be very low.  In
particular, this is the case of the DARPA Resource Management (RM)
data [2].  This task exemplifies an exceptionally noise free
environment, since the utterances were digitally recorded in a
sound-isolated recording booth using a headset noise-cancelling
microphone [2].

      Because of the unusual cleanliness of the DARPA recordings, out
standard signal processing may potentially reduce the effective
dynamic range of speech-related events in a (somewhat unnecessary)
effort to characterize instances of perfect silence.  This reduction
may in turn results in a loss of acoustic information and thereby a
drop in the recognition accuracy of our system.  To cope with this
situation, we introduce a signal conditioning step between critical
band filtering and long term adaptation.  Recall from, e.g., [1],
that after critical band filtering a frame of speech is represented
by a N-dimensional vector X containing the power spectrum amplitudes
in each of N critical bands, where typically 17 lt N lt 20.  This
vector is then converted to a log domain representation before
adaptation processing.

The above procedure is modified by applying to X a conditioning
transformatio...