Browse Prior Art Database

Splitting a dictionary into two parts so that only nodes likely to be frequently used get pre-loaded.

IP.com Disclosure Number: IPCOM000030471D
Original Publication Date: 2004-Aug-17
Included in the Prior Art Database: 2004-Aug-17
Document File: 2 page(s) / 75K

Publishing Venue

IBM

Abstract

A method is disclosed of only partially loading a finite state dictionary so that the memory required is minimised.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Splitting a dictionary into two parts so that only nodes likely to be frequently used get pre-loaded.

On-Demand Loading of Nodes in a Finite State Dictionary

Disclosed is a method of only partially loading a finite state dictionary so that the memory required is minimised.

Background

    Finite state machines are commonly used in natural language processing applications. In such applications the dictionary is structured as a set of nodes with links between them which are labelled with characters from the alphabet. One of the nodes is identified as the start node , while several are identified as potential end nodes (also known as accepting nodes ) When processing text with such a transducer you process text by following these steps:
1. Start by setting the start node to be the current node.
2. As each character of input text is processed the set of links staring from the current node is examined to see if any are labelled with the current input character. If a matching link is found, the target of the link becomes the new current node.
3. The previous step is repeated until the end of the word is reached or until a character is encountered for which no matching links can be found.
4. If we end up in an accepting node after processing a word, we consider that word valid otherwise we reject the word.

    The benefit of using finite state dictionaries in natural language applications is that the processing algorithm is very simple and hence fast. In addition it is possible to deal with any language, by simply constructing a new finite state machine for that language.

    One disadvantage of finite state dictionaries is that they can be quite large. For example, a finite state dictionary for the English dictionary would typically be several Megabytes in size. Most modern personal computers would have this amount of memory available, but since the memory is typically shared between many applications, loading such a large dictionary into memory can cause significant degradation of other applications. Many novel devices such as mobile phones and PDAs are becoming very popular, these devices could benefit from having NLP dictionaries but they typically don't have enough memory to load an entire finite state dictionary for a real language. Therefore any mechanism which allows only part of a dictionary to be loaded will be very useful.

Observation

    We observed that when processing natural language text, the pattern of which nodes received most visits was far from random. A relatively small number of nodes received virtually all the visits, while other nodes were rarely visited if at all. This means that much of the space allocated for storing the dictionary in main memory is wasted storing nodes which will not get visited.

    It is possible to make some reasonable prediction about which nodes are likely to receive traffic. The definition of the processing algorithm implies that the sta...