Browse Prior Art Database

Method for matching terms in rule-based text analysis Disclosure Number: IPCOM000235832D
Publication Date: 2014-Mar-26
Document File: 5 page(s) / 114K

Publishing Venue

The Prior Art Database


This idea describes a simple formal language for matching sets of terms (simple or multi-word) extracted from a corpus. Expressions in this formal language can be used by linguists or ordinary users in a rule-based text analysis system to provide category descriptions, semantic typing rules, synonymy rules, and other sorts of rules which match terms. Expressions in this language can be compiled into regular expressions.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 30% of the total text.

Page 01 of 5

Method for matching terms in rule-based text analysis


Disclosed is a formal language for simply and conveniently describing sets of related terms in rule-based text analysis. The disclosure shows how expressions in this language can be compiled into regular expressions.

    The expressions consist of a space-separated sequence of 1 or more elements, where each element can be either

 a word or part of a word beginning or ending with an optional wild-card which matches a sequence of one or zero characters,or

 a wild-card character which matches a sequence of 0 or more words.

As an example, consider the following expression in the language:
* air con* *
This expression matches terms such as;
air con, air conditioning, air conditioner, air conditioning fault, faulty air conditioner, fault air conditioning system , …

The expression can be used in a text-analysis system in a rule such as:
* air con* * → air conditioning

which makes any term which matches the expression a synonym of air conditioning. As shown later, such an expression can be used in other sorts of rules.

I Background I.1 Context
This idea addresses the area of rule-based text analysis and text mining, more specifically the identification of words and patterns or sequences of words of interest in unstructured text.

I.2 Problem
In text analysis, a text analyzer typically finds terms in continuous text. A term can be a single word (a uniterm) or a sequence of words (multiterm). The list or set of terms found in a document can be matched in various ways against term-based rules; rule matches can be used to assign a document to a category, or to select certain terms from the document, etc. In general such rules may specify individual terms or sets of terms which are to be matched.

    Given that a single set of rules is not adapted to all subject areas and use cases, rule-based text analysis tools typically allow end users to modify and adapt existing rules and/or to create new rules. In such use contexts, the problem is to provide a compact, intuitive and flexible way of specifying (possibly infinite) term subsets for use within rules.

I.3 Related art
In many domains which involve specifying strings of characters wild-card symbols are used to match sequences of characters. Typically the symbol '?' is used to match 0 or 1 character, and '*' to match 0, 1 or more characters. Character wild cards can be used as a way of specifying term subsets. For example, the expression cost* can be used to match terms such as cost, costs, costing, costings etc.

    The present proposal provides a practical means of creating term-matching expressions that include not only wild cards over characters but also wild cards over


Any term expression constructed as described in this disclosure can be

Page 02 of 5

translated into an exactly equivalent character-based regular expression ie a regular expression that matches the same set of terms. However this equivalent form is much more com...