Browse Prior Art Database

Estimating menu choice selection from choices recorded in a single audio file Disclosure Number: IPCOM000023314D
Original Publication Date: 2004-Mar-29
Included in the Prior Art Database: 2004-Mar-29
Document File: 2 page(s) / 9K

Publishing Venue



Disclosed are a number of methods for estimating which option from an IVR menu a caller has selected via interruption when listening to options played in a single wavefile.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Estimating menu choice selection from choices recorded in a single audio file

This disclosure describes how to estimate which option a caller has selected in an interactive voice response (IVR) application when interrupting a single wavefile that includes a list of the options. The interruption can occur when the caller speaks a recognized interruption keyword or phrase (such as "That's it") or any phrase that is out of grammar. Naturally, if the caller speaks an in-grammar option, the system would select that option. The known solutions to the problem of interpreting an interruption are
(1) using text-to-speech (TTS) to produce the audio rather than recorded speech because the system can easily tell what word is playing at the time of interruption and
(2) using a set of audio files rather than a single audio file because the system can easily tell which audio file is playing at the time of interruption. The drawback of using TTS is that the speech output is of inferior quality to that of professionally recorded speech. Using multiple audio files has a similar drawback in that the files must be spliced together during playback, and it is sometimes the case that the transition from one file to another will not sound right to a listener. There are several ways to attack this problem. The methods vary in their degree of requirement for manual work.


For this method, a developer would listen to the audio file (using a media player) and note the times at which he or she needed to set the boundaries for the options. The options and associated times would then be entered into some form interpretable by the application (table, extensions to VoiceXML tags, etc.). At run time, if the application played an audio menu file that the caller interrupted with a non-option utterance, the system would note the time into the audio file at which the interruption occurred, refer to the option/time information, then take a system action. Potential system actions are:

1. Accept the option defined for the time of the interruption.
2. Prompt the caller using the option in the prompt (e.g., "Was that vanilla?").
3. If the interruption is within a programmed distance from an option boundary, ask a disambiguating question using the two options around the boundary in the prompt (e.g., "Was that vanilla or peach?").


For this method, a developer would view the wave file of the menu prompt and mark sections o...