Browse Prior Art Database

Audio-Visual Training Utilizing Existing Speech Databases

IP.com Disclosure Number: IPCOM000031711D
Published in the IP.com Journal: Volume 4 Issue 11 (2004-11-25)
Included in the Prior Art Database: 2004-Nov-25
Document File: 4 page(s) / 195K

Publishing Venue

Siemens

Related People

Juergen Carstens: CONTACT

Abstract

Actual speech recognition techniques are based on the Hidden Markov Model (HMM) or Neuronal Networks. Both kinds of technologies require large databases to learn the characteristics of each word or phoneme in a so called training process. These databases must be transcribed, that means the utterances have to be assigned to textual representation. The combination of audio and visual information can improve the recognition rate, especially in noisy environments. This technology is called lip reading and for the training process large audio-visual databases are required for general purpose speaker independent applications. Up to now the state of the art of the audio-visual speech recognition is not as advanced as the conventional speech recognition. Most of the existing audio-visual corpora are the results of efforts by a few university groups or individual researchers with limited resources. Therefore most of these databases suffer from one or more shortcomings. They contain a single or a small group of speakers, and they usually address simple recognition tasks, such as small vocabulary, ASR (Automatic Speech Recognition) of isolated or connected words. The largest audio-visual database and the only suitable for large vocabulary continuous speech recognition is the IBM Via Voice audio-visual database. This is an IBM property and it is not available for commercial purposes for other companies.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 28% of the total text.

Page 1 of 4

S

Audio-Visual Training Utilizing Existing Speech Databases

Idea: Dr. Tobias Schneider, DE-Kamp-Lintfort; Klaus Lukas, DE-Muenchen; Jesús Fernando

Guitarte Pérez, DE-Muenchen

Actual speech recognition techniques are based on the Hidden Markov Model (HMM) or Neuronal Networks. Both kinds of technologies require large databases to learn the characteristics of each word or phoneme in a so called training process. These databases must be transcribed, that means the utterances have to be assigned to textual representation.

The combination of audio and visual information can improve the recognition rate, especially in noisy environments. This technology is called lip reading and for the training process large audio-visual databases are required for general purpose speaker independent applications.

Up to now the state of the art of the audio-visual speech recognition is not as advanced as the conventional speech recognition. Most of the existing audio-visual corpora are the results of efforts by a few university groups or individual researchers with limited resources. Therefore most of these databases suffer from one or more shortcomings. They contain a single or a small group of speakers, and they usually address simple recognition tasks, such as small vocabulary, ASR (Automatic Speech Recognition) of isolated or connected words. The largest audio-visual database and the only suitable for large vocabulary continuous speech recognition is the IBM Via Voice audio-visual database. This is an IBM property and it is not available for commercial purposes for other companies.

Two main problems arise by using small audio-visual databases for training. First, as the audio-visual databases are quite small in comparison to conventional audio-only database this implies that the training of the acoustic part of the HMM will be poorer than with the conventional large audio-only databases. Thus it would be advantageous to enable a reuse of the conventional acoustical training information after the training of the visual part. And secondly the visual part of the training will also be poorly trained, as there is not enough transcribed video material for a proper training.

Audio-visual information is being continuously generated on TV, but also on the web by the use of web cams and nowadays also in multimedia mobile phones everyday. Therefore non-transcribed and labeled audio-visual information can be easily obtained. The problem is how to use this information for training. Because it is neither transcribed nor labeled, the boundaries between phonemes are not known.

So far, the problems were solved by labeling by hand the different words for speaker dependent small vocabularies lip reading systems. Lip reading systems have not been implemented up to now in commercial solutions, therefore the databases used for research need not to be as large as the database used for the training of a product providing speaker independent lip reading with arbitrary vocab...