Browse Prior Art Database

Method for Extracting Unknown Words by Using Strings of Character Classes

IP.com Disclosure Number: IPCOM000122933D
Original Publication Date: 1998-Jan-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 4 page(s) / 104K

Publishing Venue

IBM

Related People

Itoh, N: AUTHOR [+2]

Abstract

Disclosed is a method for automatically tokenizing text into words, by means of an N-gram model and extracting candidates for unknown words. In some non-European languages, such as Japanese, no explicit word-boundaries exist in a text. It is, therefore, necessary to tokenize text into words in order to build a statistical language model. Handling of unknown words is the key to high accuracy in tokenization.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method for Extracting Unknown Words by Using Strings of Character
Classes

      Disclosed is a method for automatically tokenizing text into
words, by means of an N-gram model and extracting candidates for
unknown words.  In some non-European languages, such as Japanese, no
explicit word-boundaries exist in a text.  It is, therefore,
necessary to tokenize text into words in order to build a statistical
language model.  Handling of unknown words is the key to high
accuracy in tokenization.

      N-gram-based tokenization can be described as follows:  Let a
sentence be S = c(1)c(2),..., c(n) and let a tokenized text be W =
w(1) w(2),..., w(m), where c(i) and w(j) designate a character and a
word, respectively.  The optimal tokenization W' is given by Equation
1.

From Bayes' law, Equation 2 is obtained.

      P(S) is independent of W and P(S|W) = 1 (when S equals W) or 0
(otherwise).  In addition, P(W) is approximated by the product of
word trigrams, namely P(w(i) | w(i-1)w(i-2)).  The result is shown in
Equation 3.

      In order to handle unknown words, patterns, which are strings
of character classes, are created from a base dictionary by replacing
each character with a representative symbol of the class that the
character belongs to.  For example, using a "character type", such as
Kanji as a class, the character shown in Fig. 1(a), which means a
word is transformed into the pattern "JJ", where 'J' designates a
Kanji character.  The pattern "JJ" also represents other words
consisting of  two Kanji characters, such as characters shown in Fig.
1(b), which means a sample.  In the same way, a word consisting of
four Katakana characters is transformed into "KKKK", where "K"
denotes a Katakan a character.  Other character classifications, for
example, that are obtained by clustering based on mutual information,
can also be used.  After that, the patterns obtained are classified
in more detail  by using key characters, which are selected according
to the measure shown in Equation 4, where M means a pattern and
<c,pos> (or Equation 5)  denotes that the character 'c' is located
(or not located) at the position pos in the pattern M.  In other
words, c is the...