Method of Interval Prosody Target Prediction
Original Publication Date: 2002-Oct-21
Included in the Prior Art Database: 2003-Jun-21
Method of Interval Prosody Target Prediction

    Prosody model is a very important component in Text-to-Speech(TTS) technology. It is strongly related to the naturalness of synthetic voices.Due to inscience about some unknow factors and the interplay among some known and unknown factors, it is unreasonable to predict the prosody parameters with the precise point estimation algorithms.On the the hand , In the current TTS technology, given the same input text, a TTS system will generate the same synthesized voices. So it sounds too uniform. While for human being, we never speak the same sentence in the same way twice. A method of interval prosody target prediction is invented and described in this disclosure, which facilitates a TTS system to generate different voices given the same input text but still good for perception. It makes a TTS system more similar to the way of a real speaker speaking, hence more natural.


Method to generate the interval prosody target predictions, instead of the precise point approximation for the prosody targets. Method to generate different voices but still good for perception given the same input text in a TTS system using the interval target predictions.


In a general TTS system, the prosody model plays an important role to generate natural voices. For a parameter-based speech synthesizer, it provides the prosody parameters (such as pitch, duration and energy values), and the synthesizer will use these parameters to generate voices of the given text; for a concatenative speech synthesizer, it provides the prosody parameters of the speech segments, and the synthesizer will select the segment from candidates by these parameters for concatenation. If the prosody parameters are predicted improperly or incorrectly, the generated voices or the selected segments will produce synthetic speech with wrong prosody resulting in unnaturalness.

Text Speech

              Phonetic Annotated Parameters


 Text Analysis

Prosody Model

Speech Synthesis

Fig. 1 The overview of a TTS system

Currently, there are 2 approaches to generate prosody parameters. One is rule-based, the other is statistic-based. The former one is based on the linguistic and phonetic research results, and describes some general prosody phenomena under certain contexts with certain prosody paramaters. The later one clusters the similar prosody phenomena automatically under certain


contexts from the corpus (the real speech) using decision tree or nueral network technologies etc., then use the average values of the prosody parameters of the same cluster to represent that class. These approaches eventually give out the precise prediction of the prosody parameters under certain text input, e.g. pitch, duration and energy values etc. We can interprete the former one as a special case of the second one where the conditions is predefined rather than automatically ge...