Browse Prior Art Database

Architecture for Speech Synthesis from Text Recognition Methods

IP.com Disclosure Number: IPCOM000111913D
Original Publication Date: 1994-Apr-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 4 page(s) / 124K

Publishing Venue

IBM

Related People

Sharman, RA: AUTHOR

Abstract

Disclosed is an architecture for the synthesis of speech from a text unit utilising speech recognition. Proposed is a complete synthesis/recognition architecture, the exploitation of particular recognition models (HMM's) in a generative sense for synthesis and development, and use of a complementary set of tools for speech processing which may be applied to both synthesis and recognition. Included is the "training" method of speech recognition to provide speaker-dependent modelling to produce distinctive output under the control of parameterised models.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 51% of the total text.

Architecture for Speech Synthesis from Text Recognition Methods

      Disclosed is an architecture for the synthesis of speech from a
text unit utilising speech recognition.  Proposed is a complete
synthesis/recognition architecture, the exploitation of particular
recognition models (HMM's) in a generative sense for synthesis and
development, and use of a complementary set of tools for speech
processing which may be applied to both synthesis and recognition.
Included is the "training" method of speech recognition to provide
speaker-dependent modelling to produce distinctive output under the
control of parameterised models.

      Producing high quality speech output from only text input is a
difficult, if not impossible, task.  Current text-to-speech systems,
run on low-cost platforms, use linguistically based rule processing
for initial text analysis and photentic transcription, followed by
either rule based formant prediciton, or concatenative diphone
segment methods for acoustic output.  Neither method solves the
following problems of:  truly natural sounding speech, the ability to
produce new "voice types" easily, adaptability to different dialects
or adaptability to different languages.  Consequently nearly all
text-to-speech systems sound like robots speaking.  The speech is
often good in intelligibility tests, but is tiring to listen to,
unnatural sounding, and not suitable for telephone interfaces.
Consequently much development has tended to follow the "pre-recorded
voice segment" approach, as in DT/2 and DT/6, where a pre-prepared
voice response is stored on disk in digital form, and played back
over the telephone, possibly concatenated with other segments, to
form a response to an incoming user call.  As applications grow, and
the type and quantity of data that needs to be read out over the
telephone increases, the requirements on data storage become
unacceptable.  In addition the time taken to design and implement new
applications becomes prohibitive since many messages must be
pre-recorded.

      The principle proposed here to alleviate the conditions
described is to use speech modelling techniques developed for large
vocabulary speaker-dependent speech RECOGNITION to develop
customisable speaker-dependent speech SYNTHESIS models.  The result
of this approach would be to create a speaker-dependent speech
parameter set for a given speaker which would be used by a speech
output generator to produce sounds "like" the prototype speaker.
Benefits would be: the easy adaptation to another speaker to provide
a variety of speaker types; the possibility of modelling regional
accents and dialects without an almost complete rework of the
underlying system and possible automatic, or semi-automatic,
extension to other languages, also without requiring a complete
rework of system design and development.

      The method proposed is to...