Browse Prior Art Database

Method of speech event detection using audiovisual information

IP.com Disclosure Number: IPCOM000005121D
Publication Date: 2001-Aug-15
Document File: 3 page(s) / 92K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method of speech event detection using audiovisual information. Benefits include greatly improved speech event recognition and implementation flexibility.

This text was extracted from a WORD97 document.
This is the abbreviated version, containing approximately 87% of the total text.

Method of speech event detection using audiovisual information

Disclosed is a method of speech event detection using audiovisual information. Benefits include greatly improved speech event recognition and implementation flexibility.

Background

Conventional speech recognition had reached a degree of maturity, especially in domain specific spoken dialogue applications. However, system performance suffered from recognition failure due to environmental noise. In extreme cases, even humans are unable to overcome loud noise problems. For example, two people may talk to each other face to face in a noisy room and be unable to understand each other. The problem may exist with the aid of telephones when the two people are in different noisy rooms. For people, a visual cue may make the difference between successful and failed communication. Electronically, visual cues had not been conventionally available. Even speech or signal detection had been difficult when an extreme amount of noise was present.

Description

The disclosed method addresses the speech-event detection problem with audiovisual cues. Visual cues aid recognition directly by detecting lip movement (lip recognition). They also confirm speech detection from the subject being listened to, which is extremely difficult using only a speech signal when loud or extreme noise exists.

The minimal hardware requirement on the platform system is a camera and a microphone. More than one of each device may be used. An array of microphones and/or cameras may be utilized. Platform systems include:

Desktop computer

Laptop

Handhold device

Tablet

Mobile phone

The system analyzes the input signal captured from the camera and microphone simultaneously (see Figure 1). The camera tracks the subject's lips and detects movement that indicates the subject may start to speak. The signal captured from microphone is used for speech recognition and speech event detection. When the signal from the camera indicates no lip movement, the algorithm assumes the signal from microphone is background noise. When the signal from camera indicates lip movement, the energy level from microphone signal is compared with the latest recorded energy level unaccompanied by lip movement to determine if the subject is truly speaking. This comparison is performed to avoid cases of false alarm, for example when the subject may open his/her month but des not produce any sound. If the energy levels remain unchanged, no speech event is detected. If the energy levels change, the recognition process is started (see Figure 2). Alpha, a calibration factor, can be set automatically by using the variance of the energy levels detected during system initialization. By comparing E1 and E0 when the lip movement has been detected, reliable speech event detection is achieved.

Benefits

By using a visual aid (the camera), the difficulty and...