Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Techniques for Modifying Prosodic Information in a Text-to-Speech System

IP.com Disclosure Number: IPCOM000114750D
Original Publication Date: 1995-Jan-01
Included in the Prior Art Database: 2005-Mar-29
Document File: 2 page(s) / 34K

Publishing Venue

IBM

Related People

Nishimura, M: AUTHOR

Abstract

Disclosed is a technique for modifying prosodic information in a text-to-speech synthesis system by using a sample of speech. When the generated prosody of the text-to-speech system needs to be modified, it is very difficult to teach the system correct prosody. By analyzing a sample of speech, such prosodic information as phonetic duration, pitch pattern, and stress pattern can be estimated automatically, and these prosodic parameters are used instead of the generated prosody. They are also used to retrain the prosodic models of the text-to-speech synthesis system.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 100% of the total text.

Techniques for Modifying Prosodic Information in a Text-to-Speech
System

      Disclosed is a technique for modifying prosodic information in
a text-to-speech synthesis system by using a sample of speech.  When
the generated prosody of the text-to-speech system needs to be
modified, it is very difficult to teach the system correct prosody.
By analyzing a sample of speech, such prosodic information as
phonetic duration, pitch pattern, and stress pattern can be estimated
automatically, and these prosodic parameters are used instead of the
generated prosody.  They are also used to retrain the prosodic models
of the text-to-speech synthesis system.

      Phonetic durations are estimated by using phonetic Hidden
Markov Models (HMMs) for continuous speech recognition.  Since the
spoken text is known, the sequence of the phonetic HMMs of the spoken
text is aligned with the speech sample by using the Viterbi
algorithm.  On the basis of the alignment, each phonetic duration is
estimated.  On the other hand, the pitch patterns are estimated by
using a conventional pitch detector, modified to keep them within the
original speaker's range.  The stress patterns are also calculated
from the raw power for each frame.

      When these three sets of parameters of the text-to-speech
synthesis system are replaced with those extracted from the speech
sample, the prosody of the synthesized speech becomes very natural.