Browse Prior Art Database

Method and apparatus for performing decomposition of an input string into the concatenation of non-overlapping non-gapped words or wordformation elements from large electronic dictionary

IP.com Disclosure Number: IPCOM000016103D
Original Publication Date: 2002-Aug-16
Included in the Prior Art Database: 2003-Jun-21
Document File: 3 page(s) / 96K

Publishing Venue

IBM

Abstract

Disclosed is a method of combination of finite-state processing with programming logic in application to specific regular expression, which describes decomposition of the input string into substrings. Run-time matching of this regular expression is done with simultaneous providing of glosses for constituents. Run-time algorithm relies on the dictionary of preprocessed units of the language and uses substring search, which can be efficiently implemented in finite-state electronic dictionaries. While finite state processing ensures high speed of processing, the usage of programming logic gives extensibility of this solution to various linguistic problems. The applications include decomposition of solid compounds in Germanic languages, clitic processing in Romance languages and identification of word boundaries in continuous text. The problem of decomposition of the input string into the concatenation of non-overlapping non-gapped substrings is considered from the point of view of the usage of large finite-state electronic dictionaries in text information retrieval systems, where the speed of processing is critical. Such problems are typically solved by the standard technique of regular expressions. But this standard approach is suitable mainly for the construction of segmentators; to provide constituents with information, stored at the dictionary, will require additional dictionary look-up. Standard approach might also require non-deterministic finite-state processing, which may be not so efficient as deterministic processing. We separate the two key elements of the processing: substring search, and programming logic, which perform functions similar to the backtracking in non-deterministic finite-state processing. Substring search can be done in the finite-state dictionary; one dictionary look-up allows to extract all prefixes of the input string together with their glosses; the dictionary can be implemented as deterministic transducer with the output at final nodes only. Formal description of the decomposition into non-overlapping non-gapped constituents We have four (intersecting) sets of strings: L 0 , L 1 , L 2 , L 3 . For any input string s we need to find its all possible decompositions into one of the following forms, where the plus sign means string’s concatenation: 1. Simple words:

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

  Method and apparatus for performing decomposition of an input string into the concatenation of non-overlapping non-gapped words or wordformation elements from large electronic dictionary

Disclosed is a method of combination of finite-state processing with programming logic in application to specific regular expression, which describes decomposition of the input string into substrings. Run-time matching of this regular expression is done with simultaneous providing of glosses for constituents. Run-time algorithm relies on the dictionary of preprocessed units of the language and uses substring search, which can be efficiently implemented in finite-state electronic dictionaries. While finite state processing ensures high speed of processing, the usage of programming logic gives extensibility of this solution to various linguistic problems. The applications include decomposition of solid compounds in Germanic languages, clitic processing in Romance languages and identification of word boundaries in continuous text.

    The problem of decomposition of the input string into the concatenation of non-overlapping non-gapped substrings is considered from the point of view of the usage of large finite-state electronic dictionaries in text information retrieval systems, where the speed of processing is critical. Such problems are typically solved by the standard technique of regular expressions. But this standard approach is suitable mainly for the construction of segmentators; to provide constituents with information, stored at the dictionary, will require additional dictionary look-up. Standard approach might also require non-deterministic finite-state processing, which may be not so efficient as deterministic processing. We separate the two key elements of the processing: substring search, and programming logic, which perform functions similar to the backtracking in non-deterministic finite-state processing. Substring search can be done in the finite-state dictionary; one dictionary look-up allows to extract all prefixes of the input string together with their glosses; the dictionary can be implemented as deterministic transducer with the output at final nodes only. Formal description of the decomposition into non-overlapping non-gapped constituents

    We have four (intersecting) sets of strings: L0 , L1 , L2 , L3 . For any input string s we need to find its all possible d...