Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Efficient Method for Large Vocabulary, Discrete Utterance Recognition

IP.com Disclosure Number: IPCOM000045515D
Original Publication Date: 1983-Apr-01
Included in the Prior Art Database: 2005-Feb-07
Document File: 4 page(s) / 53K

Publishing Venue

IBM

Related People

Dixon, NR: AUTHOR [+3]

Abstract

This invention relates to an efficient method for speech recognition by successive comparisons between a voice sample of sectional average spectrum and a hierarchically stored library. A two-stage recognition process is used. In the first stage, an incoming utterance is matched against an entire vocabulary in order to identify a predetermined number of best match words. The following paragraphs are directed to the first stage processing of the method and the significance of utterance duration and sectional frequency spectrum.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 44% of the total text.

Page 1 of 4

Efficient Method for Large Vocabulary, Discrete Utterance Recognition

This invention relates to an efficient method for speech recognition by successive comparisons between a voice sample of sectional average spectrum and a hierarchically stored library. A two-stage recognition process is used. In the first stage, an incoming utterance is matched against an entire vocabulary in order to identify a predetermined number of best match words. The following paragraphs are directed to the first stage processing of the method and the significance of utterance duration and sectional frequency spectrum.

Fig. 1 shows a functional diagram of a two-stage recognition system. The first stage portion is the vocabulary limiter of the present disclosure.

First, the utterance from a speaker is converted to an electrical analog signal by the MIKE (microphone), and is directed to an amplifier (AMP). The amplified signal is directed to the feature abstraction routine of each stage in parallel, where predetermined features or measurements are computed. An example of a typical feature abstraction routine of the first stage is shown in Fig. 2, which will be explained later.

The recognizer is operated in two modes: (1) training mode and (2) recognition mode. During the training mode, the abstracted features for each word are stored in FEATURE STORAGE at each stage, which is typically a random-access memory (RAM) device. During the recognition mode, the utterance is compared with each of the stored utterances, and a predetermined number (typically 100) of the best-matched words are selected at the first stage. Then, exhaustive matching is performed for those selected words at the second stage, ultimately a single best matched word being recognized.

In the first stage, two features are used: one is utterance duration, and the other is sectional frequency spectra. For those words whose durations are very much different from that of the unknown utterance, no further operation is performed. Let w (u) and w (p(i)) be the durations of the utterance and the i-th prototype, respectively, where i = 1 to M and M is the number of prototypes (vocabulary size). The decision on duration is w(u)-w(p(i) < q.w(u), where q is a predetermined fraction. For example, if q = .3, a 30% allowance is made for speaking-rate and other forms of duration variation. The set of N prototypes with the greatest likelihood to the input utterance is determined (here, N has been mentioned to be 100) Each of the N words will be compared to the input utterance in the second stage, using a much more detailed description of each word. The operations per word will take more time; but, the total time needed will not increase substantially, since the number of candidate words has been reduced enough by stage one, the vocabulary limiter, to be handled in real time by the system. In the second stage, the word which is closest (smallest distance) to the input utterance will probably be selected as the...