Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Learning Optimal Phrase Labels for a Grammar With Respect to A Corpus

IP.com Disclosure Number: IPCOM000120318D
Original Publication Date: 1991-Apr-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 3 page(s) / 109K

Publishing Venue

IBM

Related People

Sharman, RA: AUTHOR

Abstract

Existing grammars use non-terminal labels chosen by humans for intelligibility, tradition, etc. Such classifications are rarely optimal in terms of making just the right decisions needed to parse a corpus. A method of finding an optimum classification is disclosed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Learning Optimal Phrase Labels for a Grammar With Respect to A Corpus

      Existing grammars use non-terminal labels chosen by
humans for intelligibility, tradition, etc.  Such classifications are
rarely optimal in terms of making just the right decisions needed to
parse a corpus.  A method of finding an optimum classification is
disclosed.

      Needed is the set of non-terminal symbols to use for a Context-
Free Phrase-Structure Grammar (CF-PSG) to describe a natural language
(NL) such that the grammar has wide coverage, has a low error rate
when parsing, and is not over expensive of computational resources.
Linguists normally avoid the ideal quest by choosing an arbitrary set
(e.g., NP, VP, PP ...).  Such arbitrary sets owe more to human
readability than optimal classification.  It is well known that in
that corresponding case of part-of-speech labels (e.g., N, V, P ...)
the classification is severely non-optimal because it has classes of
widely differing size and frequency.  The following method attempts
to find an optimal classification.

      METHOD
1.   Collect phrase-types from a corpus of analyzed text (a
treebank).  Assume initially that each non-identical phrase-type is a
unique label.
2.   Calculate the probability of the corpus as the product of the
rule probabilities and the rule frequencies, Pr(corpus) equals the
Product over i of Pr(rulei) to the power Freq(rulei).
3.   For each pair of rules (chosen at random, in the worst case, but
more intelligently, if possible) recompute the probability of the
corpus as in step (2) assuming the combined rules to act as different
instances of the same rule.
4.   Choose that pairing which causes the least loss of likelihood of
the corpus as the best pairing.  Make the pairing permanent in the
rule set.  Try again from step 3 until no further reductions are
possible, or the desired number of categories has been reached.  This
occurs when no loss of likelihood occurs, or no rules can be
combined.

      EXAMPLE
CLUSTERING PHRASES INTO GROUPS TO LEARN PHRASE LABELS

      The method is the following:
1.   Collect phrases from the treebank, using the brackets to denote
a phrase, but ignoring the label given.
2.   Use the 265 distinct tags as the words of this corpus. That is,
assume that the treebank tagging of actual words is good.
3.   Ignore low frequency phrases.  The 1000 most frequent phrases
from 20,000 observed phrases have been selected.
4.   Cluster the 1000 most frequent phrases into 100 phrase groups by
the greedy algorithm, choosing to join phrases that result in the
least loss of likelihood of the corpus being generated.

      The result is that some well known phrase types emerge. A list
of some of the types discovered is given below.  It is sometimes a
bit difficult to understand the meaning of some of the classes.  The
first few members of the first few classes are given.  It appears to
have discovered certain types of n...