Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method of Segmenting Texts into Words

IP.com Disclosure Number: IPCOM000118239D
Original Publication Date: 1996-Nov-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 4 page(s) / 89K

Publishing Venue

IBM

Related People

Itoh, N: AUTHOR [+2]

Abstract

Disclosed is a method for segmenting a text into words on the basis of human utterance units. In some non-European languages such as Japanese, word boundaries are not stable and grammatical units do not necessarily coincide with human intuition. For accurate segmentation, it is therefore necessary to create a vocabulary set that covers human utterance units.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method of Segmenting Texts into Words

      Disclosed is a method for segmenting a text into words on the
basis of human utterance units.  In some non-European languages such
as Japanese, word boundaries are not stable and grammatical units do
not necessarily coincide with human intuition.  For accurate
segmentation, it is therefore necessary to create a vocabulary set
that covers human utterance units.

      Fig. 1 shows the proposed method for learning "word boundaries"
in human utterances from relatively small data sets and for
segmenting a large number of texts automatically into "words".  The
learning data consists of a set of paired texts.  One text in each
pair is segmented by a human subject, while the other is segmented
into morphemes (grammatical units) with tags (parts of speech) by the
Japanese Morphological Analyzer (JMA)(*).  For example, let Fig. 2 be
the result of human segmentation, and Fig. 3 be the result of JMA
segmentation for the same text (# and + denote "word boundaries").
In the learning step, the texts in each pair are compared.  The
number of  segmentations made by the human is counted for each
morpheme transition  (e.g., Fig. 4) and the word-boundary probability
P(# | PoS(i), Sp(i) --> PoS(i+1), Sp(i+1)) is calculated, where PoS
and Sp represent  part of speech and character representation of
morphemes, respectively.

      When there is no sufficient data for some parameter values,
degenerate parameters (with one or more parameters omitted) are used.
Fig. 5 shows an...