Browse Prior Art Database

Robust Voice Activity Detection Based on the Static and Dynamic Energy for the Embedded System

IP.com Disclosure Number: IPCOM000125925D
Original Publication Date: 2005-Jun-22
Included in the Prior Art Database: 2005-Jun-22
Document File: 4 page(s) / 504K

Publishing Venue

Motorola

Related People

Zhaobing Han: AUTHOR [+2]

Abstract

In this paper we propose an effective, robust and computationally low-cost voice activity detector based on static and dynamic energy (SDEVAD) for the embedded automatic speech recognition. SDEVAD adopts dynamic energy derived from the optimal filters of edge detection in image processing and static energy from the spectrum to formulate the possible start-end points and then determines the accurate voice intervals using the latter. In addition, the mean and variance analysis are applied to forecast the possible noise segment, then the pitch method is used to reject the wrongly detected noise portion. In according to different categories applications, SDEVAD divides into real-time and batch-mode processing modules. Evaluation experimental results show that SDEVAD batch-mode for PCSCSI database (including airport, car, outdoors and so on) can detect the active voice as close to the accuracy as manually labeled and real-time mode for the MADISON database (lower SNR, extremely diverse environmental noise conditions) also demonstrates significantly better performances than the AMR VAD1, AMR VAD2, Advanced ETSI VAD, zero-crossing, multi-band, and noise model method while the proposed one has much less computational complexity.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 15% of the total text.

Robust Voice Activity Detection Based on the Static and Dynamic Energy for the Embedded System[1]

Zhaobing Han, Yaxin Zhang

Abstract

In this paper we propose an effective, robust and computationally low-cost voice activity detector based on static and dynamic energy (SDEVAD) for the embedded automatic speech recognition. SDEVAD adopts dynamic energy derived from the optimal filters of edge detection in image processing and static energy from the spectrum to formulate the possible start-end points and then determines the accurate voice intervals using the latter. In addition, the mean and variance analysis are applied to forecast the possible noise segment, then the pitch method is used to reject the wrongly detected noise portion. In according to different categories applications, SDEVAD divides into real-time and batch-mode processing modules. Evaluation experimental results show that SDEVAD batch-mode for PCSCSI database (including airport, car, outdoors and so on) can detect the active voice as close to the accuracy as manually labeled and real-time mode for the MADISON database (lower SNR, extremely diverse environmental noise conditions) also demonstrates significantly better performances than the AMR VAD1, AMR VAD2, Advanced ETSI VAD, zero-crossing, multi-band, and noise model method while the proposed one has much less computational complexity. 

1.      Introduction

Voice Activity Detection (VAD) is important in various speech signal processing applications. Actual speech activities normally occupy 60% of the time on a regular conversation in a telecommunication system in which VAD enables reallocating system resources during the periods of speech absence. In Automatic Speech Recognition (ASR) in general, and noise robust ASR in particular, VAD ensure that the decoder, which is computationally intensive, only runs when necessary. This point plays a particularly important role in embedded application (cellular phone, PDA, etc), where processing power is limited.

  In general, VAD errors can be categorized into two main types of errors, notably clipping errors and false detection errors. Clipping errors occur when a speech is misclassified as a noise frame, which is intolerable in speech encoders due to its effect on speech intelligibility. While false detection errors are due to misclassifying a noise frame into a speech frame. Echo cancellation systems are normally sensitive to this type of errors because it results in incorrect parameter adaptation.

   A variety of methods have been proposed for VAD in passed several decades. In general, different applications and environments need different algorithms to meet their specific requirements in terms of computational accuracy, complexity, robustness, sensitivity, response time, etc. The approaches include those based on energy threshold [1], pitch detection [2], spectrum analysis, cepstral analysis [3], zero-crossing rate [4][5], periodicity measure, detection in image processing[6][7], cl...