Browse Prior Art Database

Method of customizing unknown token in Japanese morphological analysis

IP.com Disclosure Number: IPCOM000019973D
Original Publication Date: 2003-Oct-15
Included in the Prior Art Database: 2003-Oct-15
Document File: 2 page(s) / 21K

Publishing Venue

IBM

Abstract

Provided flexible customization for any numeric tokens by using regular expression rules in a Japanese morphological analysis.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 2

Background 

A Japanese morphological analysis breaks Japanese sentences without any spaces or  delimitters into words with a part of speech gloss. It is using a word dictionary   and Japanese grammatical rules. For example, "2003年7月15日は・・・" is  analyzed as follows: 

2003

Number (unknown) 

 年 Suffix-numeric   7

        Number (unknown)   月

        Suffix-numeric   15

Number (unknown) 

 日 Suffix-numeric   は

Adpostion-particle  

It is quite obvious that all numbers (like "2003", "7" and "15") are not  registered in the dictionary, but fortunately, they all can be assigned as  "number" by checking each   digit character at unknown processing logic in a Japanese morphological analysis.   In the other hand, all numeric tokens (like date, year, day, time, ordinal,  percent, height, depth, length, weight, etc.) also cannot be registered in the  dictionary.   But unfortunately, obviously the unknown processing logic cannot recognize all  kinds of numeric (like "2003年","7月",平成15年7月15日","11時23分45 秒","3980円"  

and so on)  as numeric words, though some of them can be recognized as numeric  words if specific rules are defined. 

Summary of Invention 

In order to handle any kinds of numeric words which cannot be registered in the  dictionary and dictionary lookup cannot return a successful match, we used regular   expression rules in the unknown processing logic (see figure below). 

Japanese Morphological Analysis

(Unknown Processing)

 Token candidate list

Lookup Disambiguation

Dictionary

  Word GDictionaryrammatical

Rules

 Regular Expression Rules

The regular expression rules are analyzed by parser like ICU rule-based break  iterator class, and it can be customizable.   We used RuleBasedBreakIterator class in ICU4C 2.4 when implemented this 

[This page contains 6...