Browse Prior Art Database

Dictionary for Japanese Morphological Analysis with Regular Expressions

IP.com Disclosure Number: IPCOM000116596D
Original Publication Date: 1995-Oct-01
Included in the Prior Art Database: 2005-Mar-30
Document File: 4 page(s) / 124K

Publishing Venue

IBM

Related People

Itoh, N: AUTHOR [+2]

Abstract

Disclosed is a dictionary for Japanese morphological analysis in which regular expressions as well as (literal) words can be entries. Also described is the use of the dictionary as a descriptor for templates to handle unknown words in morphological analysis.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 42% of the total text.

Dictionary for Japanese Morphological Analysis with Regular Expressions

      Disclosed is a dictionary for Japanese morphological analysis
in which regular expressions as well as (literal) words can be
entries.  Also described is the use of the dictionary as a descriptor
for templates to handle unknown words in morphological analysis.

      The dictionary, which is an enhancement of a trie dictionary
commonly used for Japanese morphological analysis, supports Regular
Expressions (REs) which is supported by UNIX* egrep command.

      A trie dictionary can be represented as a tree where a node
consists of a pair of <c, w> where c is a character and w is a flag
that marks the legal word (wordflag).  Each non-leaf node in the tree
has two branches (pointers) to another node.  When a character in a
string matches the character in a node, the next character in the
word is matched against the character in the node pointed by one of
the pointers (the 'success' pointer).  When the character does not
match the character in a node, the same character is matched against
the character in the node pointed by the other pointer (the 'fail'
pointer).  A string is a legal word when there is a path from the
root node to a node with the wordflag that consists entirely of
success pointers.

      Regular expressions without repetition are easily incorporated
into a trie.  However, an ordinary trie dictionary cannot have
regular expressions with repetition as its entry.  The enhanced trie
dictionary has special nodes which represent the beginning and the
ending of repetition to handle such regular expressions.  Using the
special nodes, a regular expression is registered recursively into
the enhanced trie as follows:
  1.  If there is no repetition in the expression, then the
expression
       is equivalent to a set of literal strings, so it is
incorporated
       into a trie as such a set.
  2.  If there is a repetition R+ (one or more repetition of R where
R
       is an RE) in the expression, then, the expression is
incorporated
       into a trie by the following procedure.
      a.  Transform R into a (sub-) trie.
      b.  Attach a special node for the beginning of repetition (BOR
           node) at the start of the subtrie for R.  BOR has another
           pointer which points to the matching EOR.
      c.  Attach a special node for the end of repetition (EOR node)
at
           the end of the subtrie for R.  EOR has another pointer to
the
           matching BOR.
      d.  Treat the unit (BOR-the subtrie-EOR) as one node and
           incorporate it into the (larger) trie.

      As a result, R and R* are stored on different subtries.  Sort
the tree so that the node for R* is searched before the first node
for the subtrie for R.

      For example, the Figure shows a trie for regular expressions
{(ab|ac)+d, abc}. ...