Browse Prior Art Database

Method to disambiguate dictation results using distributed speech recognition engines

IP.com Disclosure Number: IPCOM000029513D
Original Publication Date: 2004-Jul-02
Included in the Prior Art Database: 2004-Jul-02
Document File: 3 page(s) / 39K

Publishing Venue

IBM

Abstract

This invention uses a multimodal browser with C3N (Command and Contro & Content Navigation) and a system of distributed speech recognition engines for dictation and grammars that creates FSG grammars to query the user to disambiguate N-best results from dictation.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

Method to disambiguate dictation results using distributed speech recognition engines

Large vocabulary speech recognition systems will sometimes produce low confidence recognition results. Many schemes have developed for dealing with error correction or amiguous recognition results. This invention introduces a new model employing simultaneous large vocabulary and grammar-based speech engines.

This invention describes a system of distributed ASR engines where the N-best recognition results from a large vocabulary dictation engine can be used to generate grammars for a speaker independent grammar based engine. An application using this system can verbally query the user for the correct results.

This invention uses a multimodal browser with C3N (Command and Control & Content Navigation) and a system of distributed speech recognition engines for dictation and grammars. The process transcribing the dictated text is checking the confidence of the recognized results. When the confidence falls below a certain level, the alternate N-best matches are used to construct a grammar that can be used to query the user to correct the results. This grammar is enabled in a local grammar-based speech engine.

In a multimodal user interface the amiguous words are highlighted visually and grammars are enabled that allow for verbal corrections. For example, the user could utter

"I want to travel from Atlanta to New York."

but the place name recognition results have low confidence so text is displayed like this:

1 I want to travel from Atlanta to 2New York .

And the following VoiceXML C3N grammar is enabled in the multimodal browser:

<grammar>

<rule>

correct word number <ruleref uri="#digi...