Audio-Visual Training Utilizing Existing Speech Databases
Original Publication Date: 2004-Nov-25
Included in the Prior Art Database: 2004-Nov-25
Related PeopleOther Related People:
Actual speech recognition techniques are based on the Hidden Markov Model (HMM) or Neuronal Networks. Both kinds of technologies require large databases to learn the characteristics of each word or phoneme in a so called training process. These databases must be transcribed, that means the utterances have to be assigned to textual representation. The combination of audio and visual information can improve the recognition rate, especially in noisy environments. This technology is called lip reading and for the training process large audio-visual databases are required for general purpose speaker independent applications. Up to now the state of the art of the audio-visual speech recognition is not as advanced as the conventional speech recognition. Most of the existing audio-visual corpora are the results of efforts by a few university groups or individual researchers with limited resources. Therefore most of these databases suffer from one or more shortcomings. They contain a single or a small group of speakers, and they usually address simple recognition tasks, such as small vocabulary, ASR (Automatic Speech Recognition) of isolated or connected words. The largest audio-visual database and the only suitable for large vocabulary continuous speech recognition is the IBM Via Voice audio-visual database. This is an IBM property and it is not available for commercial purposes for other companies.