Browse Prior Art Database

Visual Speech Recognition by Tracking of Lip Motion

IP.com Disclosure Number: IPCOM000010899D
Published in the IP.com Journal: Volume 3 Issue 2 (2003-02-25)
Included in the Prior Art Database: 2003-Feb-25
Document File: 4 page(s) / 420K

Publishing Venue

Siemens

Related People

Juergen Carstens: CONTACT

Abstract

The invention provides a visual speech recognition system which is based on lip-reading. For this a gradient technique is applied which tracks the motion of facial parts involved in articulation with sub-pixel accuracy. To compensate bulk motion of the head, it measures relative motions (i.e. how the lips open and close). Also particular regions in the face are identified which carry significant speech information. The core of this invention lies in the identification of novel features and in the dynamical approach to extract features.  Shape-recognition is only required for initialization, from then on, motions are tracked.  Novel features: The invention uses novel features which carry speech information and which provide speech information in addition to lip motion:  chin-motion (measured as chin to nose distance)  motion of the upper lip (measured as upper lip to nose distance)  motion of the lower lip (measured as lower lip to chin distance) Previously especially the opening of the lips and the width of the mouth have been applied. The invention proposes a dynamical approach, which does not extract shapes but tracks motions instead.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 34% of the total text.

Page 1 of 4

Visual Speech Recognition by Tracking of Lip Motion

Idea: Dr. Werner Hemmert, DE-Muenchen; Juan Pablo de la Cruz Gutierrez, DE-Muenchen

The invention provides a visual speech recognition system which is based on lip-reading. For this a gradient technique is applied which tracks the motion of facial parts involved in articulation with sub- pixel accuracy. To compensate bulk motion of the head, it measures relative motions (i.e. how the lips open and close). Also particular regions in the face are identified which carry significant speech information.

The core of this invention lies in the identification of novel features and in the dynamical approach to extract features.

[g183] Shape-recognition is only required for initialization, from then on, motions are tracked.

[g183] Novel features: The invention uses novel features which carry speech information and which provide speech information in addition to lip motion:

[g183] chin-motion (measured as chin to nose distance)

[g183] motion of the upper lip (measured as upper lip to nose distance)

[g183] motion of the lower lip (measured as lower lip to chin distance)

Previously especially the opening of the lips and the width of the mouth have been applied.

The invention proposes a dynamical approach, which does not extract shapes but tracks motions instead.

[g183] Robustness: Motion estimates are calculated from regions including many pixels, which makes the algorithm robust.

[g183] Speed: The extraction of shapes is computationally expensive. For tracking, no shape extraction is required.

For this invention novel features have been identified which carry a significant part of the speech information. These are:

[g183] Vertical opening of the moth (upper lip - lower lip motion)

[g183] Motion of the upper lip (subtracting the upper lip motion from the motion of the face, i.e. a region including the nose tip).

[g183] Motion of the chin (subtraction of the chin motion from the motion of the face, i.e. a region including the nose tip). The motion of the chin is independent of the motion of the lips; chin motion is possible - even when the mouth stays closed.

[g183] Width of the lips (left - right lip-crease motion)

[g183] Motion of the lower lip relative to the chin (subtraction of the lower-lip motion from chin motion).

The following part shows examples according to this invention.

A visual speech recognition system using an inexpensive web-camera and a computer has been implemented. Then, regions in the face (nose, upper lip, left lip crest, right lip crest, lower lip, chin) have been marked by hand. For future versions of this system the regions of interest could be labeled automatically by an algorithm with finds faces/lips (i.e. face recognition system with adaptive graph matching). The gradient algorithm tracks the motion of the regions of interest from image to image. Bulk motion of the head is suppressed by extracting relative motions. Then the inventors high-pass filter the motion da...