Browse Prior Art Database

Method of Endpoint Detection

IP.com Disclosure Number: IPCOM000107352D
Original Publication Date: 1992-Feb-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 3 page(s) / 84K

Publishing Venue

IBM

Related People

Hashimoto, Y: AUTHOR [+2]

Abstract

This article describes a technique for making a real-time endpoint detector effective in handling unclear utterances by using multiple endpoints and adaptive thresholds for both the background noise level and the speech level. Process of Estimating Adaptive Thresholds

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method of Endpoint Detection

       This article describes a technique for making a real-time
endpoint detector effective in handling unclear utterances by using
multiple endpoints and adaptive thresholds for both the background
noise level and the speech level.
Process of Estimating Adaptive Thresholds

      To estimate the background noise level, the input log energy is
stored in a ring-buffer with 80 frames.  If the input log energy
falls below Nmax, the ring-buffer and histogram are updated frame by
frame.  The background noise level is the mode, Nmode, of the
histogram.  The frame period of our system is 9.6 msec, so the
estimator can cope with time-varying background noise in 400 msec.
The input log energy P(t) is normalized as follows.
      P(t) = P(t) - Nmode

      To determine the speech level, two kinds of level are
estimated.  One of these, S1, is a comparatively long-term speech
level, which is estimated by using the same process of determining
the background noise level for input log energy that exceeds Nmax.
The value is not the mode, but a boundary that contains the top 10%
of the histogram.  The other level, S2, is a comparatively short-term
speech level, which is estimated from the average input log energy
that exceeds a threshold, Max(Smin, Nmode + SNmin), in each word
segment.  This threshold is used to confirm that the word segment has
a sufficiently high energy level to be recognized.

      In a normal signal to noise ratio (S/N) environment, SNstnd,
two- level thresholds are estimated from the normalized log energy in
advance.  There ar...