Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Segmenting and Recognizing Continuous Speech

IP.com Disclosure Number: IPCOM000080598D
Original Publication Date: 1974-Jan-01
Included in the Prior Art Database: 2005-Feb-27
Document File: 3 page(s) / 16K

Publishing Venue

IBM

Related People

Das, SK: AUTHOR

Abstract

This description relates to a segmentation and recognition method for continuous speech signals. The performance achieved by this method is high compared to the performances realized by other existing methods, and the technique is easily adaptable to various input forms (different filterbank outputs, predictive coefficients, etc.).

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 42% of the total text.

Page 1 of 3

Segmenting and Recognizing Continuous Speech

This description relates to a segmentation and recognition method for continuous speech signals. The performance achieved by this method is high compared to the performances realized by other existing methods, and the technique is easily adaptable to various input forms (different filterbank outputs, predictive coefficients, etc.).

The technique of the present method has up to four levels of processing. Each level utilizes prototypes which are calculated by an iterative linear threshold training procedure. The training procedure works on multidimensional feature vectors, whose component features are selected out of an exhaustive list of features by applying dimensionality reduction or ranking techniques. The training procedure always remembers its best past performance, and is such that an acceptable solution is realized even though the training samples are not linearly separable.

In the first level, basic sound events (e.g., phonemes) are detected. The events may correspond to the steady-state portions of the speech signal as well as to the transitional portions. The detection is accomplished by using one or more prototypes for each event. These prototypes are derived by utilizing several handlabeled examples of the basic sound events. Let the events be labeled a(1), a(2),...., a(n) where n is the total number of events under consideration. Weight vectors w(a1), w(a2),..., w(an) will be derived by using linear threshold training technique. (Alternatively, other methods of training can also be used.) To this end, a training set of data is first selected. This set should have several occurrences of all the events. In order to derive w(ai), feature vectors derived at all occurrences of a(i) are trained against all or several of the feature vectors obtained at a(j) (for all j, j does not =i). Selection of this particular training set of feature vectors is not crucial, since the derived weight vectors will require retraining at a later stage.

The weight vector w(ai) is then tested on the training set of utterances. Feature vectors derived at the time samples where a(i) is detected falsely are included in the training set for retraining purposes. The retraining starts with the previously derived weight vectors w(ai) and calculates a new weight vector w(ai), which is capable of reducing or possibly eliminating false detections. This process is repeated several times till there is no further improvement in performance, or till perfect score is achieved on the training set of utterances. Next, this first stage w(ai) is tested on a test set of utterances which is separate from the training set (different sentences and different speakers, or, if speaker-specific training is being pursued, different sentences but same speaker) and which is known to have several occurrences of the events under study. If the performance is worse than that obtained on the training set, the test set is included in the...