Browse Prior Art Database

Word Sense Disambiguation using an Untagged Corpus

IP.com Disclosure Number: IPCOM000110315D
Original Publication Date: 1992-Nov-01
Included in the Prior Art Database: 2005-Mar-25
Document File: 4 page(s) / 186K

Publishing Venue

IBM

Related People

Justeson, JS: AUTHOR [+2]

Abstract

A substantial amount of data is required to derive cues for word sense disambiguation automatically from text databases. The standard approach is to tag words by sense, and then extract cues from their contexts, but this requires a massive amount of manual sense-tagging. This article offers a general word sense disambiguation procedure that eliminates the manual sense-tagging bottleneck.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 40% of the total text.

Word Sense Disambiguation using an Untagged Corpus

       A substantial amount of data is required to derive cues
for word sense disambiguation automatically from text databases.  The
standard approach is to tag words by sense, and then extract cues
from their contexts, but this requires a massive amount of manual
sense-tagging.  This article offers a general word sense
disambiguation procedure that eliminates the manual sense-tagging
bottleneck.

      A target word is a word with multiple senses that we intend to
disambiguate based on evidence from a large corpus.  Let T be a
target word with multiple senses; let C1,...Cn be a set of n
subcorpora of the total corpus; and let Si be a sense of T that is
much more likely to occur in subcorpus Ci than in any of the other
subcorpora.  Finally, let F be any feature of a sentence or small
text segment containing T, e.g., a word, a phrase, a syntactic
structure, a grammatical pattern, or a paragraph topic.  F provides
evidence concerning the sense of T if the probability P{sense of T =
Si F is in the context of T} is significantly different from P{sense
of T = Si}; F is an indicator for the sense Si of T if
the conditional probability of Si given F is significantly higher
than the conditional probability of any other sense given F.

      The proposed algorithm is an iterative application of expanding
sets of cues to sense disambiguation.  Two lists of cues are
maintained: a list of all cues discovered so far (the current list),
and a list of newly discovered cues (the new list).

      The algorithm consists of an initialization step (step 0),
followed by a 3-step iteration:
0.   The current list of cues is initialized, for a target word T, by
defining the criteria for extracting initial subcorpora for a set of
senses Si,...Sn .
1.   Text segments (typically sentences) that satisfy cues on the
current list are put into two or more separate subcorpora, each
segment being put into the subcorpus that corresponds to the sense
the cues indicate.
2.   Candidate cues not on the current list are tested to determine
whether they discriminate between the current subcorpora.  The new
list comprises those candidates that do discriminate the subcorpora.
3.   If the new list is not empty, return to step 1; otherwise, the
procedure terminates.

      Initialization (step 0): Initialization consists of the
selection of a set of criteria that on a priori grounds or based on
other information, discriminate among senses of a target word T;
discrimination must be much better than chance, but need not be
highly reliable.  Some likely examples include nearby occurrences of
sense-specific words (i.e., words semantically related to specific
senses of the target); translation equivalents for T in translated
text; the overall subject of a work, or the topic of a paragraph; a
title or author; and the date of the work.

      One well-studied feature that provides such information is...