Browse Prior Art Database

Creating rules automatically for adapting phonetic forms to a speaker in a TTS system

IP.com Disclosure Number: IPCOM000030965D
Original Publication Date: 2004-Sep-02
Included in the Prior Art Database: 2004-Sep-02
Document File: 2 page(s) / 56K

Publishing Venue

IBM

Abstract

Concatenative TTS systems splice together short voice samples extracted from recordings of a real speaker, in order to match a target a speaker-independent phonetic transcription derived from the input text. When the target phonetic forms output by the front-end do not match the recorded speaker pronunciations, the output signal is degraded. We use a set of speaker-dependent rules to map the front-end output pronunciations into speaker-adapted ones. The rules are produced from a decision tree trained on the speaker-dependent recorded data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 2

Creating rules automatically for adapting phonetic forms to a speaker in a TTS system

I. Introduction

Text-To-Speech (TTS) systems convert input text into an output acoustic signal simulating natural speech. Concatenative TTS systems splice together short voice samples extracted from recordings of a real speaker, so that the concatenation correspond to phonetic transcriptions of the input text.

These phonetic transcriptions are generated from the input text by an automatic system, called a Front-End, most often driven by linguistic rules. The Front-End is therefore speaker independent, the phonetic forms being identical for all simulated speakers.

This is somewhat inadequate since it doesn't take into account speaker-specific ways of pronouncing the words (regional accent, speaking style, personal habits). When the phonetic forms output by the speaker-independent Front-End do not match the recorded speaker pronunciations, the output signal has to be rebuilt from non-consecutive recorded samples, which decreases the overall quality of the rendition.

II. Rule-based transformation of phonetic transcriptions

We use a set of speaker-dependent rules to map the Front-End output pronunciations into speaker-adapted ones, thereby reducing the mismatch between the predicted phonetic transcription and the speaker recordings available in the splicing database. The rules are produced from a decision tree trained on speaker-dependent input data. This method derives from the well-known usage of decision trees for spelling-to-sound derivation. Such trees are normally built from pairs of word spelling / corresponding phonetic forms. Here, the decision tree will get trained on pairs of "front-end phonetic transcription" versus "speaker-realization of the word pronunciation". Relevant mapping rules will then be extracted from the most interesting branches of the tree.

At runtime, the Front-End analyzes the input text, proposes speaker-independent phonetic pronunciations. These then are transformed by the speaker-dependent rule-based system into speaker-adapted pronunciations. Eventually, this speaker-adapted transcription is used to retrieve proper recorded speech segments from the speaker-dependent database.

III. Tree training

Each sentence of the script recorded by the speaker gets two phonetic transcriptions: the first is predicted by the TTS Front-End from the script sentence text, whereas the second is that actually realized by the speaker during the recording, manually transcribed. In both cases, the whole sentence is treated as one unit, to cope with phenomena happening at word boundaries (like the French "liaison"), and the word boundaries are encoded as a special "blank" phone.

The two transcriptions are semi-automatic...