Browse Prior Art Database

Method for Extracting Domain-Dependent Compound Words

IP.com Disclosure Number: IPCOM000123951D
Original Publication Date: 1999-Aug-01
Included in the Prior Art Database: 2005-Apr-05
Document File: 2 page(s) / 73K

Publishing Venue

IBM

Related People

Nasukawa, T: AUTHOR [+3]

Abstract

Disclosed is a method for automatically extracting from a collection of texts compound words that describe unique concepts in a specific domain.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 54% of the total text.

Method for Extracting Domain-Dependent Compound Words

   Disclosed is a method for automatically extracting from a
collection of texts compound words that describe unique concepts in a
specific domain.

   In order to analyze a large amount of textual data, it is
useful to extract appropriate keywords as units for various types of
statistical analysis.  In addition, since such keywords are
dependent on the target data, the recognition of appropriate
keywords should be made by analyzing the target texts.

   However, most existing methods extract keywords mainly
according to their
  * grammatical features
    (E.g., if a sequence of adjectives and nouns is specified
    as a pattern for a keyword, both "personal computer" and
    "new computer" may be treated as keywords.)
  * co-occurrence within a limited number of words
    (E.g., if words that co-occur frequently within a
    5-word-window are specified as keywords, idiomatic
    expressions such as "take (good) care of" may be treated
    as keywords.)  Such methods not only miss important
    keywords but also pick up too many useless keywords.  In
    contrast, the disclosed method uses information on
  * modifier-modifiee relationships
  * distance and frequency of the modifier-modifiee
    relationships
    Distance: number of words in the modifier-modifiee chain.
      E.g., the distance is 1 if word X directly modifies word Y.
    Frequency: number of occurrences of a modifier-modifiee
      pattern that appeared in the target data) to improve the
      accuracy with which appropriate compound words are
      recognized as keywords
.

   This method consists of the following steps:
  1.  Apply morphological analysis to the target texts by usi...