Original Publication Date: 2002-Nov-20
Included in the Prior Art Database: 2003-Jun-21
Background Signal processing, especially for the purpose of speech recognition, involves filtering of time data. The standard approach takes advantage the Fast Fourier Transform (FFT) and converts a time frame (typically 10 ms) into the frequency domain. In the frequency domain speech relevant features like the formants are extracted. Although the FFT provides is extremely efficient algorithm to derive a spectrum, it suffers from a number of fundamental deficiencies. Temporal resolution is limited by the length of the time frame. The length of the time frame T in turn limits the frequency resolution, which is 1/T. In the human auditory system however, time and frequency resolution are not constant. Whereas frequency resolution decreases with frequency (in the frequency range above 500 Hz), the temporal resolution (at medium and high levels) increases. The time-frequency resolution of the inner ear seems to be optimized to provide optimum performance in both domains by the means of: 1. Asymmetrical filter shapes The shapes of the cochlear filters are extremely asymmetrical. Whereas the low-frequency slope is shallow (6dB/oct), the high-frequency slope is very steep (up to 100 dB/oct). The excellent frequency resolution of the hearing system is therefore mainly provided by the steep high-frequency slope. From a signal processing point of view, the low-frequency slope of 6 dB/oct operates as a temporal derivative, the steep high-frequency slope as a high-order low-pass filter. The damping parameters of the low-pass are high, providing optimal preserving of the time signal. With this filtering strategy shortcomings of conventional band pass filters with symmetrical filter slopes, namely long ringing, are effectively overcome. 2. Time derivation and rectification As stated in the previous section, the low-frequency slope of 6 dB/oct of the cochlear filter operates as a temporal derivation of the time signal. That implies, that an impulse reaching the inner ear is transformed into two impulses with opposite polarity. It is obvious, that one of these impulses is redundant. The transduction system of the inner ear in turn provides a fairly sharp rectification, it is only sensitive to one pulse and discarding the second one. That means that no redundant information is processed. Seen from the opposite standpoint this implies that no information is lost by the rectification of the sensory cells. Temporal derivation and rectification in combination provide therefore an optimum strategy for loss less coding of temporal information: Stimuli at all frequencies are coded by one excitation per period. This, in turn, provides optimum coding of impulses (speech: plosives).