Browse Prior Art Database

The automatic detection method for compound words

IP.com Disclosure Number: IPCOM000010374D
Original Publication Date: 2002-Nov-22
Included in the Prior Art Database: 2002-Nov-22
Document File: 2 page(s) / 19K

Publishing Venue

IBM

Abstract

Disclosed is a system that categolize Japanese words into 'compound words', 'simple words' and 'obscure words'. A program in the system is using some rules of compound words.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 2

The automatic detection method for compound words

  So far there are some methods to judge whether a Japanese word is a compound word or not, but, unfortunately they have some limitations like below.

They need many statistical information like the frequency of words and the resonance between words. They are not so accurate due to the following two main reasons.

There are many words judged by only meaning. The definition of compound words is not firmly established.

On the other hand, in this system, the program does not have such limitations above and allow users to easily change the definition of coumpound words. Here are the detail flow of the system.

1. To define the standard of compound word, users make a small data file which consists of words and flags. Flags indicate whether the word is compound word or not(Fig.1). If there are more sample words in the data file, the system's accuracy would be increased. Usually, the data file includes about 1,000 words.

素粒子物理学 1 国際人 1

Compound words

Simple words

Fig.1 : Sample of the sample dictionary

2. The follwing are considered as features of compund words. a word length
if a word contains prefix or suffix if a word contains proper noun if a word contains unknown word if a word contains symbol characters if a word consists of Katakana character if a word consists of alphabet character if a word is a name of person if a word is a name of place
etc.

The program is using the value of the features above in the...