Browse Prior Art Database

Lexical Part of Speech Labeling without a Lexicon for Use in Natural Language Parsing

IP.com Disclosure Number: IPCOM000110129D
Original Publication Date: 1992-Oct-01
Included in the Prior Art Database: 2005-Mar-25
Document File: 3 page(s) / 129K

Publishing Venue

IBM

Related People

Black, EW: AUTHOR [+4]

Abstract

Disclosed is a method for performing lexical categorization without the use of a lexicon, for input to a natural language parser. The method eliminates the considerable labor-intensivity and the serious lack of portability of the only existing alternative method for lexical categorization---lexicon-based labelling.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Lexical Part of Speech Labeling without a Lexicon for Use in Natural Language Parsing

       Disclosed is a method for performing lexical
categorization without the use of a lexicon, for input to a natural
language parser.  The method eliminates the considerable
labor-intensivity and the serious lack of portability of the only
existing alternative method for lexical
categorization---lexicon-based labelling.

      The lexical categorization phase of existing natural language
parsing programs takes the form of "lexical lookup", i.e., a list of
words and their possible parts of speech is consulted for each word
of an input sentence.  But lexical lookup has four significant
drawbacks as a method of lexical categorization: (1) the construction
of lists (lexicons) of word/label associations is notoriously
expensive of both time and labor; (2) lexicons are by definition
seriously incomplete:  as finite lists of words, they cannot bear the
weight of the profusion of technical terms and neologisms, nonce
words, personal and place names in any modern language; a shift in
the domain of discourse to be treated by the parser can easily swamp
with novel words a lexicon not specifically conceived with the new
domain in mind, and even in "general" or conversational varieties of
a modern language, neologisms crop up daily, and nonce formations
particular to each individual conversation or document are common;
(3) lexicons misrepresent the range of appropriate parts of speech
for most words, since a shift of domain or simply the parsing of
additional data often adds new possible parts of speech for a given
word and effectively eliminates others; for example, in news articles
on baseball, the spelling "As" frequently serves as a plural proper
noun naming the Oakland Athletics, and almost never as the
subordinating conjunction used in more formal varieties of English to
mean "because": "As you already own a copy, I will not give you
another;" in the domain of computer manuals, "on" can serve as an
abstract common noun, as in "Change this setting to on and proceed;"
(4) applying a lexicon to a different accompanying grammar from the
one for which it was originated can be extremely difficult if the set
of parts of speech of the new grammar differ significantly from those
of the old one.

      Together these four limitations of lexicons constitute a
bottleneck in parsing technology: even a parser which conducts
syntactic analysis perfectly, if given the correct range of
part-of-speech labels for the constituent words of a sentence, will
fail if the correct part of speech for an input word is not even
suggested to it.

      The invention below eliminates the drawbacks (1) and (4) of all
methods of lexical categorization known to date---those which rely on
lexical lookup.  While drawbacks (2) and (3) are eliminated in
principle, as well, experimentation is needed to produce results
using the invention in order to compare it for overall accura...