Browse Prior Art Database

Original Publication Date: 2008-Apr-23
Included in the Prior Art Database: 2008-Apr-23
Document File: 2 page(s) / 81K

Publishing Venue



This disclosure proposes a method to classify spoken words in correct or incorrect class based on the stress pattern of the constituent syllables. Such a classification can be used to (1) evaluate and improve the spoken language skills of an individual, (2) aid Automatic Speech Recognition (ASR) systems, and (3) assist automatic language understanding. The disclosure proposes a paradigm shift by defining the task of syllable stress evaluation as a phone-level speech recognition task instead of the traditional syllable-level speech analysis task.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2


Names: Om D Deshmukh, Himanshu Pant, Vivek Tyagi1

A spoken word can be thought of as a composition of various phones. One or more adjacent phones are grouped into linguistic units called syllables. The speech production mechanism used to produce these phones, and hence their acoustic manifestation, changes based on whether these phones appears in a stressed or in an unstressed syllable. The disclosed approach utilizes this information to train an ASR system that can learn the acoustic differences in stressed and unstressed instances of a given phone. The approach assumes that all the phones in a stressed syllable are stressed whereas all the phones in an unstressed syllable are unstressed. This assumption is merely for the sake of convenience in training and evaluation using the ASR system and can be relaxed or modified with minimal effect on the overall performance of the approach. All the previous syllable stress evaluation techniques use ASR systems with a standard phone-set which does not distinguish between stressed and unstressed manifestations of a given phone.

Assume that all the sounds of a given spoken language can be spanned by a total of P phones, then the phone-set used in the invented approach consists of 2*P phones where each of the P phones has two instances: a stressed instance and an unstressed instance. This phone-set is referred to as the explicit phone-set . A given word has only one correct pronunciation with the right sequence of phones stressed and all the other phones unstressed whereas all the other pronunciations are incorrect. The training data is collected in such a way that it consists of adequate instances of stressed and unstressed occurrences of each phone. Each phone is modeled as a multi-state Hidden Markov Model (HMM) with only left-to-right state transitions. E...