Browse Prior Art Database

Obtaining a Grammar of Natural Language Which Has Wide Coverage

IP.com Disclosure Number: IPCOM000120657D
Original Publication Date: 1991-May-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 4 page(s) / 110K

Publishing Venue

IBM

Related People

Sharman, RA: AUTHOR

Abstract

Many methods exist for parsing strings according to the grammar of a language. These methods have never succeeded with Natural languages because of the lack of a grammar. Only handwritten (incomplete) grammars exists. A method for producing a complete grammar definition is disclosed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Obtaining a Grammar of Natural Language Which Has Wide Coverage

      Many methods exist for parsing strings according to the
grammar of a language.  These methods have never succeeded with
Natural languages because of the lack of a grammar. Only handwritten
(incomplete) grammars exists.  A method for producing a complete
grammar definition is disclosed.

      For every Natural Language (English, French, etc.) a statement
of the rules of grammar in terms of a context-free-phrase-structure
grammar (CF-PSG) is required. Such systems have been written by hand
for thirty years, since no method exists for generating or learning
one.  This is a method for obtaining such a definition which has been
tested and shown to work.  The steps are as follows:
1.   Assume a language is an infinite sequence of words.  A subset of
this infinite sequence is a CORPUS in the language.
2.   Assume that the language is ERGODIC and, therefore, a corpus is
representative of the language as the size of the corpus grows.  In
the limit as the corpus size n increases the corpus approximates to
the language ever more closely.
3.   Annotate the corpus to mask the actual phrases used, e.g., noun
phrase, verb phrase, according to some parsing scheme.
4.   Collect all the unique instances of phrase types observed and
their frequency of occurrence.
5.   Label each unique phrase type with the type of phrase. The type
of phrase becomes the left-hand side of a rule in a CF-PSG, and the
immediate constituents of the phrase become the right-hand side of
the rule.  The count of occurrences of the rule is used to calculate
the probability of the rule.  The example shows the process performed
on a sample sentence and results for a corpus of one million words.

      The resulting grammar is an approximation to a true grammar of
the Natural Language which can be used in a probabilistic parser to
yield the most likely parse of test sentence under the (previous)
assumption that the test data is similar to the training data .
Parsing with a probabilistic grammar is a well-known process,
performed by the CKY algorithm with O(n3) costs, as per theory.
Sample Sentence

      The words in each sentence have been tagged with a
part-of-speech code, using a set of 267 tags.  The sentences have
been parsed into constituents using a set of 64 possible constituent
types.
1.   bare words: Consumers continued to spend heavily for clothes,
household goods and other items in February, reports by the nations
's large retailers indicated Thursday
2.   tagged words:
   Consumers_NN2  continued_VVD  to_TO spend__VVO heavily_RR for _IR
clothes_NN2 ,_,  household_NN1 goo...