Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Natural Sounding Voice Prompts in a Voice Response Unit System

IP.com Disclosure Number: IPCOM000117815D
Original Publication Date: 1996-Jun-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 4 page(s) / 118K

Publishing Venue

IBM

Related People

Butler, ND: AUTHOR [+4]

Abstract

Many existing Voice Response Unit (VRU) platforms, such as IBM* CallPath DirectTalk/6000*, offer the ability to concatenate prerecorded voice segments to 'read' out to the user specific information. For example, the VRU retrieves an account balance of 100 pounds, and retrieves the appropriate segments for '100' and 'Pounds sterling', concatenates them and plays out to the caller.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 42% of the total text.

Natural Sounding Voice Prompts in a Voice Response Unit System

      Many existing Voice Response Unit (VRU) platforms, such as IBM*
CallPath DirectTalk/6000*, offer the ability to concatenate
prerecorded voice segments to 'read' out to the user specific
information.  For example, the VRU retrieves an account balance of
100 pounds, and retrieves the appropriate segments for '100' and
'Pounds sterling', concatenates them and plays out to the caller.

      Unfortunately, simple waveform concatenation means that the
outgoing message does not contain the relevant appropriate linguistic
information -- specifically the right pitch contour -- to help the
user understand what is being said.  In some cases, e.g., for longer
digit strings like account codes, this may even lead to confusion and
frustration as the caller must listen and relisten to the outgoing
prompt.

      In addition, any naturalness gained by prerecording a real
speaker is lost through the blind concatenation of waveforms.

      The 'intonation' -- specifically the pitch movement --
associated with digit and date strings is relatively simple:  all
digits will tend to be said on a similar pitch (though this is not
always true), expect for digits in pre-pausal or final positions.

      The solution described here covers two possible approaches:  a
low-cost approach based on waveform concatenation, but with context
specific variants of each digit; and a hybrid, using both
pre-recorded segments and the insertion of synthesized context
specific variants.

The Low-Cost Approach

      This proposal involves the recording of different versions of
the same digit spoken with different pitch movement, to enable during
concatenation a more appropriate, communicatively efficient, and more
natural-sounding output.

Background

NOTE: the description below is essentially language-specific (for
English); but the principle extends to other languages as detailed
under Segment Recording and Other Languages.

      During system set-up, a number of versions of each digit is
recorded on different pitch contours:  both level and moving.  These
are described below:
  1.  Basic Version
  Three versions of each digit are recorded with the following pitch
movement:
                   Pitch Movement               Position In String
                   High Level                            A
                   Fall-Rise                             B
                   Fall                                  C

For standard digit strings (phone numbers, etc. etc.): e.g., 1234
maps to positional variants AAAC
               e.g., 402-498-1234 maps to AAB AAB AAAC
               i.e.: Variant A is used in non-final positions;
               variant Bis used before a p...