Browse Prior Art Database

A Two-Stages Concatenation Speech Synthesis Scheme for Low-Tier Cellular Phone based Application

IP.com Disclosure Number: IPCOM000132018D
Original Publication Date: 2005-Nov-29
Included in the Prior Art Database: 2005-Nov-29
Document File: 4 page(s) / 125K

Publishing Venue

Motorola

Related People

Dong-Jian Yue: AUTHOR

Abstract

Although high quality TTS engines based on concatenation speech synthesis have been developed and applied in many products (such as various call center or information inquiry systems) successfully, the limitation of memory storage and computational power of many embedded products such as most of low-tier phone obstacles their implementation. By accounting for the speech quality, memory storage, computational complexity and reuse of the CELP based vocoder module (generally resident on DSP of almost all cellular phones), a two-stages concatenation speech synthesis scheme for low-tier phone based application is described in this paper. In the two-stages framework, all the back-end processing of TTS is divided into two phases (parameters concatenating and waveform synthesizing) that are conducted by MCU (Micro-Controller Unit) and DSP of mobile phone respectively. Furthermore, a novel four cases smooth concatenation method is proposed to accomplish the smoothing concatenation of AUs (Acoustic Unit) efficiently.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 15% of the total text.

A Two-Stages Concatenation Speech Synthesis Scheme for Low-Tier Cellular Phone based Application

Dong-Jian Yue

Abstract

Although high quality TTS engines based on concatenation speech synthesis have been developed and applied in many products (such as various call center or information inquiry systems) successfully, the limitation of memory storage and computational power of many embedded products such as most of low-tier phone obstacles their implementation. By accounting for the speech quality, memory storage, computational complexity and reuse of the CELP based vocoder module (generally resident on DSP of almost all cellular phones), a two-stages concatenation speech synthesis scheme for low-tier phone based application is described in this paper. In the two-stages framework, all the back-end processing of TTS is divided into two phases (parameters concatenating and waveform synthesizing) that are conducted by MCU (Micro-Controller Unit) and DSP of mobile phone respectively. Furthermore, a novel four cases smooth concatenation method is proposed to accomplish the smoothing concatenation of AUs (Acoustic Unit) efficiently.

1.       Introduction

Generally, concatenation based speech synthesis approach has being become a predominant method and matured gradually. For some server based or baseline TTS systems with large footprint, it is even hard to distinguish the synthetic speech from natural speech.

In almost all concatenation synthesis systems, many short speech segments (here called as Acoustic Units: AU) of prerecorded speech from a single speaker are typically stored and concatenated to synthesize new utterances. Therefore a large scale of inventory of acoustic units and much smoothing concatenation (or modification) are generally required to eliminate or alleviate the audible discontinuity between AUs and achieve high quality and natural sounding synthetic speech. Although the memory storage and computational power are no longer problems in the most desktop and over systems currently, the limitation of such resource still severe exist in many embedded products (such as cellular phone, especially for low-tier phone.) which obstacles the implementation and application of high quality TTS engines.

In order to implement the high quality TTS engines based on the concatenation speech synthesis in most embedded products, the exploration of potential solution must focus on the two aspects: footprint compression and computational cost reduction. As we known, typical vocoders are resident on the most embedded products especially in all mobile phone. A common method or natural choice is to employ the vocoder algorithm to compress the speech signal of acoustic unit inventory of TTS into a compact database in preprocessing procedure. With the compressed AU inventory, the synthetic speech may be generated by retrieving and decompressing the AUs for concatenation with resident vocoder firstly, then applying the general speech modification or smoothing conca...