Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method to Build Language Model for Customer's Private Text Data

IP.com Disclosure Number: IPCOM000113001D
Original Publication Date: 1994-Jun-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 4 page(s) / 118K

Publishing Venue

IBM

Related People

Bandara, U: AUTHOR [+3]

Abstract

Described is a method which allows the automatic cleaning of any kind of text for the purpose of building trigram language models used in a speech recognizer. For that purpose, isolated trigrams are randomized so that original text cannot be reconstructed. Subsequently, the sentence boundaries are marked in isolated trigrams using a scheme developed in such a way that the quality of the language model is retained with respect to the case where the model is built from the continuous text.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Method to Build Language Model for Customer's Private Text Data

      Described is a method which allows the automatic cleaning of
any kind of text for the purpose of building trigram language models
used in a speech recognizer.  For that purpose, isolated trigrams are
randomized so that original text cannot be reconstructed.
Subsequently, the sentence boundaries are marked in isolated trigrams
using a scheme developed in such a way that the quality of the
language model is retained with respect to the case where the model
is built from the continuous text.

      Introduction/State of the Art - The so-called Language Model
(LM) is an essential part of the speech recognizer.  It supplements
the required language information to the acoustic channel which alone
is not sufficient for proper functioning of the system.  The LM
contains the probabilities for all combination of words in the
system's vocabulary.  These probabilities are computed out of a text
(called training text corpus) which is actually observed in the
application domain (e.g., real medical reports, real correspondences
between insurance companies and their customers etc.).  Therefore, in
order to build a speech recognition system for a given commercial
application one has to obtain a sufficiently large text corpus from
that application domain.  Such texts, in their original format,
contain unnecessary and sometimes disturbing strings of characters,
such as format instructions for printers or marks used for archiving
purposes.

      In the process of building language model, according to the
state of the art, these original texts are analyzed by a trained
person who subsequently writes a program to "clean" the text from
such unnecessary string of characters.  The cleaning process also
include the task of marking sentence boundaries, which are treated by
LM as a normal word consisting trigrams.  Furthermore, the first word
after the boundary should be converted to lower case if it turns out
to be a functional word like an article (e.g., The >the).  The job of
cleaning text corpus is not only tedious and time consuming (costly),
but also illegal by federal laws for information security reasons,
because the person, who does the manual analysis, has to read the
text.

      To overcome these disadvantages a method is described which
overcome this barrier without loss of quality of the LM.  The method
has been implemented and applied in real customer situations
successfully to clean the private texts and build the LM out of them,
without the need for a person to read those texts.

The method contains 3 steps:

1.  Pass One: The text is passed through this stage to eliminate any
    character which is not a member of a given alphabet.  (In the
    practice, this set of alphabet vary hardly from customer to
    customer)

2.  Pass Two: Blank characters are treated as word boundaries and
    word trigrams were isolated.  A set of n trigrams were sh...