Method of customizing unknown token in Japanese morphological analysis
Original Publication Date: 2003-Oct-15
Included in the Prior Art Database: 2003-Oct-15
Provided flexible customization for any numeric tokens by using regular expression rules in a Japanese morphological analysis.
A Japanese morphological analysis breaks Japanese sentences without any spaces or delimitters into words with a part of speech gloss. It is using a word dictionary and Japanese grammatical rules. For example, "２００３年７月１５日は・・・" is analyzed as follows:
年 Suffix-numeric ７
Number (unknown) 月
日 Suffix-numeric は
It is quite obvious that all numbers (like "2003", "7" and "15") are not registered in the dictionary, but fortunately, they all can be assigned as "number" by checking each digit character at unknown processing logic in a Japanese morphological analysis. In the other hand, all numeric tokens (like date, year, day, time, ordinal, percent, height, depth, length, weight, etc.) also cannot be registered in the dictionary. But unfortunately, obviously the unknown processing logic cannot recognize all kinds of numeric (like "２００３年","７月",平成１５年７月１５日","１１時２３分４５ 秒","３９８０円"
and so on) as numeric words, though some of them can be recognized as numeric words if specific rules are defined.
Summary of Invention
In order to handle any kinds of numeric words which cannot be registered in the dictionary and dictionary lookup cannot return a successful match, we used regular expression rules in the unknown processing logic (see figure below).
Japanese Morphological Analysis
Token candidate list
Regular Expression Rules
The regular expression rules are analyzed by parser like ICU rule-based break iterator class, and it can be customizable. We used RuleBasedBreakIterator class in ICU4C 2.4 when implemented this