Browse Prior Art Database

Method for Constructing Corpus-Based Thesaurus

IP.com Disclosure Number: IPCOM000114611D
Original Publication Date: 1995-Jan-01
Included in the Prior Art Database: 2005-Mar-29
Document File: 2 page(s) / 35K

Publishing Venue

IBM

Related People

Uramoto, N: AUTHOR

Abstract

Disclosed is a method for locating unknown words in existing thesaurus using statistical data from large-scale corpora. The input of this method is the word that does not appear in the thesaurus. The output is a part of thesaurus (sub-tree) that the input word should be located in the thesaurus.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 82% of the total text.

Method for Constructing Corpus-Based Thesaurus

      Disclosed is a method for locating unknown words in existing
thesaurus using statistical data from large-scale corpora.  The input
of this method is the word that does not appear in the thesaurus.
The output is a part of thesaurus (sub-tree) that the input word
should be located in the thesaurus.

This method consists of the following four parts:
  1.  The co-occurrence data for the unknown word are extracted from
       the corpora.  Example A shows the example of the 3-gram data
for
       the input word "bus".  Each data is generalized by deleting
the
       unknown word from the data.  They are called co-occurrence
       patterns for the unknown word (Example B).
  2.  The co-occurrence data for the words in the thesaurus are
       extracted using the same method described in (1) and the
       co-occurrence patterns for words in thesaurus are created.
  3.  The similarity value between the co-occurrence patterns for the
       unknown word and each co-occurrence patterns for words in
       thesaurus is calculated.  The word in the thesaurus is marked
if
       the value precedes a certain threshold.
  4.  From the multiple marked words, the sub-trees are constructed.
       The sub-tree that has the largest number of words is selected
as
       the place that the input word should be located.

      This method for constructing l...