Browse Prior Art Database

Tree-Structured Word Lists for Natural Language Processing

IP.com Disclosure Number: IPCOM000036443D
Original Publication Date: 1989-Sep-01
Included in the Prior Art Database: 2005-Jan-29
Document File: 3 page(s) / 16K

Publishing Venue

IBM

Related People

Maier, M: AUTHOR [+2]

Abstract

In a word list (WL) the essential information, such as words of a specific natural language, is often generated by the application of morphological rules to single lemmas and then it is interpreted as bunches of words having the same stem completed by different suffixes. Such a concept is used for building a tree-structured word list (TWL) containing a significant amount of inflected words instead of lemmas, leaving the application of morphological rules to very special situations (enclitics, etc.). The application of morphological rules to any given word turns out to be excessively time consuming, even if it is an effective strategy from the space saving point of view. Moreover, the crossing of a TWL results to be flexible enough and particularly suitable for spelling correction strategies, i.e.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 1 of 3

Tree-Structured Word Lists for Natural Language Processing

In a word list (WL) the essential information, such as words of a specific natural language, is often generated by the application of morphological rules to single lemmas and then it is interpreted as bunches of words having the same stem completed by different suffixes. Such a concept is used for building a tree- structured word list (TWL) containing a significant amount of inflected words instead of lemmas, leaving the application of morphological rules to very special situations (enclitics, etc.). The application of morphological rules to any given word turns out to be excessively time consuming, even if it is an effective strategy from the space saving point of view. Moreover, the crossing of a TWL results to be flexible enough and particularly suitable for spelling correction strategies, i.e., for applications whose goal is to propose existing words as a substitute for any "not- in-the-dictionary" character string found in a text.

The lexical trees, as many as the different initial characters existing in a predefined WL, are organized in two memory levels: the first level consists of a unique slot of contiguous memory, the root of the structure, containing the initial parts of the lexical trees; and the second level consists of a set of slots of contiguous memory, the leaves of the structure, containing the remaining parts of the lexical trees.

The TWLs have monodirectional arcs only, that is, a tree can be entered at its root node and down-navigated, going from any parent node to its children but not vice versa.

The relevant string of a node is given by one or more characters. This is the longest string such that any word which has an initial character string equal to the catenation of the contents of the nodes passed to reach the current one followed by the first character of the relevant string, necessarily has the entire relevant string.

A node in a TWL structure is represented by a string of 32 bits (i.e., 4 bytes).

The first character of the relevant string is contained in bits 0-7. Moreover, a node is marked as "short" or "long" (bit 8 on/off) to indicate if the relevant string consists of one or more characters, and as "local" or "remote" (bit 9 on/off) to indicate if the set of descending nodes is located inside the root or in a leaf. This last distinction is meaningful only for the nodes inside the root. A transit node is a node which constitutes an intermediate or initial node in the path(s) representing one (more) word(s) in the WL (bit 10 on/off). A terminal node is a node which contains the final relevant string of a word, i.e., which ends the path on the tree-structure which represents a word in the WL (bit 11 on/off). A node can be simultaneously "transit" and "terminal". A transit node has at least two children, or it is also a terminal node.

The meaning of bits 12-31 depends upon the four indicator settings. In case of pure terminal (i.e., no-long, n...