Browse Prior Art Database

Shape Descriptor for Unidentified Words in a Natural Language Processing System

IP.com Disclosure Number: IPCOM000047308D
Original Publication Date: 1983-Oct-01
Included in the Prior Art Database: 2005-Feb-07
Document File: 2 page(s) / 14K

Publishing Venue

IBM

Related People

Damerau, FJ: AUTHOR

Abstract

This interface enhancement procedure expedites the processing of natural language inquiries to a data base system when the inquiries contain words that are in the data base but not in the system dictionary. It accomplishes this by utilizing the symbolic shapes or patterns of such words to associate them with the columns of the data base in which they are most likely to occur. Questions contain words that are not in the system dictionary because the columns of the data base which pertain to such words contain too many entries for all of them to be included in the system dictionary. One proposed solution to this problem requires the user to designate, for each unidentified word, the column in the data base to which it pertains.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Shape Descriptor for Unidentified Words in a Natural Language Processing System

This interface enhancement procedure expedites the processing of natural language inquiries to a data base system when the inquiries contain words that are in the data base but not in the system dictionary. It accomplishes this by utilizing the symbolic shapes or patterns of such words to associate them with the columns of the data base in which they are most likely to occur. Questions contain words that are not in the system dictionary because the columns of the data base which pertain to such words contain too many entries for all of them to be included in the system dictionary. One proposed solution to this problem requires the user to designate, for each unidentified word, the column in the data base to which it pertains. Another proposal requires the user to search all of the columns of the data base to determine the column in which the word occurs. The present approach differs from the proposals just described and is based upon the observation that in many cases, while the number of different entries in a column may be very large, such entries have a very limited number of characteristic shapes or configurations. For example, telephone numbers have the shape: three numerals followed by a dash or blank and four more numerals. Part numbers may have a shape such as the following: two letters, two numbers, two letters, and two or three numbers. Hence, an interrogating word which has either of these shapes can safely be identified as a telephone number or a part number, as the case may be, without asking the user to classify it or searching the entire data base for a corresponding entry. In natural language processing systems, dictionaries generally contain more than the part of speech for each word. The exact method of recording additional information varies somewhat, but a very common way is by means of "features" attached to one of the nonterminal nodes. For this reason, the use of shape information to classify unknown words presented to a natural language data base interface may conveniently be discussed in terms of "features", as will be done herein. However, the use of such terminology is not meant to imply that shape analysis is necessarily limited to this particular implementation. Words which are values of a c...