Browse Prior Art Database

Speech Recognition Result Validation according to Signal-to-Noise Ratio

IP.com Disclosure Number: IPCOM000112627D
Original Publication Date: 1994-Jun-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 37K

Publishing Venue

IBM

Related People

Bahl, LR: AUTHOR [+5]

Abstract

An algorithm is disclosed that aims at eliminating spurious voice command recognitions triggered by lower-level speech (such as found in background noise). It relies on estimating the Signal-to-Noise Ratio (SNR) of the utterances that trigger a "satisfactory" recognition per the usual measures of the recognition system, and rejecting those found to have a low SNR.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 86% of the total text.

Speech Recognition Result Validation according to Signal-to-Noise
Ratio

      An algorithm is disclosed that aims at eliminating spurious
voice command recognitions triggered by lower-level speech (such as
found in background noise).  It relies on estimating the
Signal-to-Noise Ratio (SNR) of the utterances that trigger a
"satisfactory" recognition per the usual measures of the recognition
system, and rejecting those found to have a low SNR.

      A command word recognizer will generally be able to recognize
an utterance as being one of a few candidates (active vocabulary).  A
good recognizer should also be able to detect the presence of a
speech utterance, but reject it if it is too different from the words
of the current active vocabulary.  Such an utterance is commonly
called a "mumble" in speech recognition circles.

      In a noisy environment, and especially if the background noise
consists of lower-level speech, there may be a lot of extraneous
sounds that map sufficiently to one of the words in the vocabulary to
generate spurious recognitions, instead of silence or mumble
detections.

      The SNR is computed as the difference between signal level and
noise level (in dB).

      The signal level is estimated as the peak energy (95
percentile) of the signal during the time the recognition system says
the utterance lasted (end-point detection).

      The noise level measurement relies on dynamic tracking of the
background noise level, by...