Browse Prior Art Database

Method for Inferring Lexical Associations From Textual Co-Occurrences

IP.com Disclosure Number: IPCOM000100960D
Original Publication Date: 1990-Jun-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 2 page(s) / 91K

Publishing Venue

IBM

Related People

Justeson, J: AUTHOR [+2]

Abstract

Disclosed is a new criterion for evaluating lexical associations: a direct probabilistic measure of the likelihood of observed numbers of textual co-occurrences. A standard computational method of establishing associations among n words is to create an association whenever the number of co-occurrences of the n candidates within a corpus segment (sentence, paragraph, chapter, etc.) exceeds a threshold value established by some criterial function. Criterial functions are intended to provide evidence that the number of co-occurrences is beyond what would be expected by chance; current criteria do not guarantee this, while a direct probability measure does.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 51% of the total text.

Method for Inferring Lexical Associations From Textual Co-Occurrences

       Disclosed is a new criterion for evaluating lexical
associations: a direct probabilistic measure of the likelihood of
observed numbers of textual co-occurrences.  A standard computational
method of establishing associations among n words is to create an
association whenever the number of co-occurrences of the n candidates
within a corpus segment (sentence, paragraph, chapter, etc.) exceeds
a threshold value established by some criterial function. Criterial
functions are intended to provide evidence that the number of
co-occurrences is beyond what would be expected by chance; current
criteria do not guarantee this, while a direct probability measure
does.

      Associations between pairs or n-tuples of words are used for a
variety of purposes in natural language processing.  For example,
research on word senses in context indicates that many ambiguous
words can be disambiguated by certain "index" words in the same text
segment; the index is lexically associated with the ambiguous word.
To implement this, an association is posited when the number of
occurrences of the index with the word in the same text segment is
unusually high -- above a value set by some criterial function.

      Currently, criterial functions normally involve weighted and
scaled ratios of observed to expected numbers of co-occurrences
[1,2].  The intuition behind these criteria is that, if a special
association exists between two words, A and B, then the occurrence of
word A given an occurrence of word B should be more likely than it
would be if word B did not occur, and conversely.  However, any given
ratio of observed to expected co-occurrences corresponds to a wide
range of probabilities that at least as many co-occurrences as
observed would take place under a random distribution of the two
words.  For example, in a corpus of 60,000 sentences -- typically
around 1,000,000 words -- the probability of getting a 3:1 ratio by
chance is about 8% when the expected number of co-occurrences is 1
and the frequencies of the two words are comparable, but is only 1.5
v 10-7 when the expected number of co-occurrences is 10. Th...