Browse Prior Art Database

Methods and apparatus for encoding of explicit morphological knowledge into a full-form lexicons compiled as finite-state automatons for usage in morphological guessing and stemming

IP.com Disclosure Number: IPCOM000011990D
Original Publication Date: 2003-Mar-31
Included in the Prior Art Database: 2003-Mar-31
Document File: 7 page(s) / 86K

Publishing Venue

IBM

Abstract

A program is disclosed that converts an existing Directed Acyclic Word Graph (DAWG) morphological dictionary into a highly efficient morphological guesser without needing access to any linguistic knowledge.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 19% of the total text.

Page 1 of 7

  Methods and apparatus for encoding of explicit morphological knowledge into a full-form lexicons compiled as finite-state automatons for usage in morphological guessing and stemming

A program is disclosed that converts an existing Directed Acyclic Word Graph (DAWG) morphological dictionary into a highly efficient morphological guesser without needing access to any linguistic knowledge.

    Morphosyntactical characteristics of words (such as part-of-speech, plural/singular, ...) in economically significant Indo-European languages can be guessed by postfix pattern. Traditional guessers use explicit linguistic knowledge. In this paper we describe the method how to extract the knowledge from preexisting dictionaries (annotated lists of words), and how to efficiently use finite-state device for guessing morphosyntactic characteristics of out-of-vocabulary words while preserving the correct morphosyntactic information about the words in the dictionary.

    DAWG is efficient finite-state device. Logically it is list of words (surface forms) provided with glosses. The basic idea is to create new DAWG dictionary from already existing DAWG dictionary, by inverting the sequence of characters in each word so that the ending (which is most likely to determine the word type) is examined first. This means that the part of speech and inflection can be determined by examining a minimal set of letters. Furthermore the algorithm which is presented will allow the size of the dictionary to be substantially reduced by pruning multiple nodes which lead to the same gloss and only storing nodes which represent paths for words that do not follow the normal default rules. Our initial trials indicate that 95-97% of the nodes in an unminimized DAWG can be eliminated by this method. Furthermore the algorithm to eliminate unnecessary nodes runs very quickly. Small scale trials have also indicated that it seems to make good guesses for out of vocabulary words.

    This program is conceptually similar to that which is described in the paper by Jan Daciuk entitled "Treatment of Unknown Words", which was published in the Proceedings of the workshop on Implementing Automata WIA'99, Postdam, Germany,
1999. The main difference between this paper and the algorithm described here is: Whereas Daciuk discusses generic Finite State Transducers (FSTs) which can

have non-deterministic transitions and cycles, the algorithm described in this paper only discusses Directed Acyclic Word Graphs (DAWGs). His rules will miss out some annotation possibilities (see his rules 6&7), while

ours will not. However, this is more a tuning factor rather than an inherent advantage of our method. Daciuk does not actually disclose an algorithm, he only lists the rules that the

    algorithm must obey. Many search and categorization systems want to reduce the word in the text to their root or lemma form (e.g. walk, walking and walked would all get treated the same as walk for indexing purposes). This task is com...