Browse Prior Art Database

Tracking the Pitch of a Digital Speech Signal

IP.com Disclosure Number: IPCOM000081253D
Original Publication Date: 1974-Apr-01
Included in the Prior Art Database: 2005-Feb-27
Document File: 3 page(s) / 50K

Publishing Venue

IBM

Related People

Bakis, R: AUTHOR [+2]

Abstract

One of the fundamental acoustic parameters required for the study of stress in human speech is pitch, or the fundamental frequency of the vocal cord oscillations. It is well known that stress gives many very important clues for speech recognition. A method for pitch tracing is described herein which uses the frequency domain, popular as a basis for many acoustic processors, but involves a higher speed implementation than spectral analysis techniques used by others with no discernible loss of accuracy.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

Tracking the Pitch of a Digital Speech Signal

One of the fundamental acoustic parameters required for the study of stress in human speech is pitch, or the fundamental frequency of the vocal cord oscillations. It is well known that stress gives many very important clues for speech recognition. A method for pitch tracing is described herein which uses the frequency domain, popular as a basis for many acoustic processors, but involves a higher speed implementation than spectral analysis techniques used by others with no discernible loss of accuracy.

The detection of pitch in voiced speech has been previously attempted in the time domain by several different hardware and software methods. Each of these exhibits many problems due to the detection of false pitch peaks, because of harmonics., etc. If only the pitch frequency is required, and not the actual duration between individual pitch peaks, then a frequency domain approach may be used. This involves obtaining short-time spectra -- the magnitudes of the Fourier transform of segments of speech which have the properties of sufficient frequency resolution to discern pitch, and enough pitch periods in the short-time signal to be transformed so that pitch effects become evident.

When these two criteria are met, it is known that cepstral techniques (i.e., taking the Fourier transform of the log of the power spectrum) will yield reasonably accurate and useful results. The resultant nonlinear frequency scale, and the high cost involved in implementation, however, make spectral techniques somewhat less attractive.

In the present method, the conditions above for the spectral representation that are given above are assumed. To digitized speech, a windowed fast Fourier transform of length N is taken every S seconds. For sample length delta t, this means that N delta t seconds of data are included in each spectrum. Usually S greater than N delta t, which implies that time-domain data is used for several spectra. N must be selected such that several pitch periods are included within N delta t. Nominal values are N delta t=40 msec and S=10 msec. as illustrated by Fig. 1.

The log of the magnitude of the short-time windowed discrete Fourier transform of some segment of data is shown in Fig. 2.

It is well known that if an idealized linear model of the vocal...