Browse Prior Art Database

Method to build cluster dependent acoustic models for speech recognition Disclosure Number: IPCOM000210959D
Publication Date: 2011-Sep-19
Document File: 5 page(s) / 42K

Publishing Venue

The Prior Art Database


A method to build cluster-aware acoustic models for non cluster-aware speech recognizers applying principles of speech clustering (speech separation in broad classes) to language components of the model (phonetics and language model) and retaining some of the performance advantages of cluster-aware speech recognizers.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 44% of the total text.

Page 01 of 5

Method to build cluster dependent acoustic models for speech recognition

One of the most effective technology that makes up a state of the art speech recognition system is the cluster-dependant acoustic modelling of speech. Speech recognizers leverages both acoustic models of speech and language model, the latter could be in the form called a grammar, like in the exemplary architecure:

(This page contains 00 pictures or other non-text object)


The cluster-dependent acoustic models require an additional two step approach to speech recognition:

first the speech sample under recognition is to be classified into a 'cluster' of


speakers or other classification of speech (eg. noisy environment vs quiet environment)
the more suitable acoustic model (according to the broad classification made in


step 1) is leveraged by the recognizer to actually recognize the uttered text.

Similar approaches are also possible to increase the recognition accuracy: running more than one recognizer in parallel and then comparing the recognition 'score' (a domain specific confidence measure about the recognized output).

The proposed solution is an alternative approach that is able to reproduce the effect of a pseudo cluster-dependent recognition leveraging both the recognition infrastructure and build methods of a cluster-independent recognizer.

This alternative approach is able to capture with lower recognizer complexity the most evident advantages (easy of decoding, hence a lower recognition errors rate) of multi-clusters decoding.

A concise introduction to a speech recognition architecture:

Cluster-independent acoustic model are built according to this general procedure.

Speech samples are collected into a computer readable (usually PCM, Pulse Code


Page 02 of 5

Modulatiom samples, a time based represeantation) medium by recording a large (hundreds / thousand) of speakers reading text of talking freely.

Collected speech samples are

'labeled' (each time frame of speech is associated to a unit of recognition, usally


a 'phoneme' of the language) manually or automatically or by a combination of the two approaches.
'processed' according to a DSP (Digital Signal Processing) algoritm of choice to


represent the PCM samples of a time-frame in a compact notation called feature vector.

Collected feature vectors and associated labels makes up the traning set of feature space.

The acoustic model is a mathematical representation of the training set of feature vectors, one the most used representation is the multi-dimensional mixture of gaussians that better model the labeled feature space. Usually each label correspond to a speech meaningful sub-unit like a sub-phoneme of a certain language.

Cluster-dependent acoustic model are built according to this general procedure:

The procedure can be considered an extension of the procedure above where:

The feature space is partitioned in 'clusters', one approach was gender clustering


where the feature space wa...