Browse Prior Art Database

Method to Distinguish Japanese Words and Foreign Words Written in Katakana Characters

IP.com Disclosure Number: IPCOM000116582D
Original Publication Date: 1995-Oct-01
Included in the Prior Art Database: 2005-Mar-30
Document File: 2 page(s) / 69K

Publishing Venue

IBM

Related People

Ogino, S: AUTHOR

Abstract

Disclosed is a system which determines whether a word written in Japanese katakana characters is a Japanese word or not by using statistical information. The system includes the following three main components and their subsidiary parts. Following is the overview of the main components: o Data for pre-processing and pre-processing module - The system calculates the probabilities of all substrings that consist of n characters and appear in words in these data for pre-process and keep these probabilities in database. Each word in the data should have its language category (Japanese or non-Japanese). The data can consist of entries of system dictionaries or morphologically segmented corpora. The pre-processing module has following four substages. 1.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 53% of the total text.

Method to Distinguish Japanese Words and Foreign Words Written in
Katakana Characters

      Disclosed is a system which determines whether a word written
in Japanese katakana characters is a Japanese word or not by using
statistical information.  The system includes the following three
main components and their subsidiary parts.  Following is the
overview of the main components:
  o  Data for pre-processing and pre-processing module - The system
      calculates the probabilities of all substrings that consist of
n
      characters and appear in words in these data for pre-process
and
      keep these probabilities in database.  Each word in the data
      should have its language category (Japanese or non-Japanese).
      The data can consist of entries of system dictionaries or
      morphologically segmented corpora.  The pre-processing module
has
      following four substages.
     1.  At the first stage, the system reads a word from the data
for
          pre-processing, concatenates a head marker and a tail
marker
          to the word in order to make an input string for the
          calculation stage.  For instance, a word "abc" is changed
to
          "^abc$" with concatenating a head marker "^" and a tail
          marker "$" at the first stage.
     2.  At the second stage, the system makes all distinct
substrings
          that consist of n characters from an input string.  N can
be
          defined by users as a value of a parameter.  For instance,
          three substrings "^ab", "abc" and "bc$" are made from an
          input string "^abc$" when n is defined as 3.
     3.  At the third stage, the system adds 1 to the number of
          occurrences of each substring that is made in the second
          stage, and add 1 to the number of occurrences with language
          categorization of each substring.  For instance, the system
          adds 1 to each number of occurrences of "^ab", "abc" and
          "bc$", and also adds 1 to each number of occurrences of
          "^ab":non-Japanese, "abc":non...