Browse Prior Art Database

A key term acquisition method that combines a multi-iterative statistical mechanism, lexicon knowledge, and machine learning

IP.com Disclosure Number: IPCOM000006430D
Publication Date: 2002-Jan-02
Document File: 2 page(s) / 51K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a key term acquisition method that combines a multi-iterative statistical mechanism, lexicon knowledge, and machine learning. Benefits include improved information retrieval on a real-time basis without the requirement for an off-line dictionary or segmentation.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

A key term acquisition method that combines a multi-iterative statistical mechanism, lexicon knowledge, and machine learning

Disclosed is a key term acquisition method that combines a multi-iterative statistical mechanism, lexicon knowledge, and machine learning. Benefits include improved information retrieval on a real-time basis without the requirement for an off-line dictionary or segmentation.

              The algorithm presented in the invention consists three modules: Key Term Primitive Selection module, Multi-iteration Control module, and Amendment module.

              The term links represents the linkage between two consecutive English words or Chinese characters. The number of links is one smaller than the length of the corresponding sentence.

              The Key Term Primitive Selection module uses the extend average link strength method to pick out the primitive key term set. Extend average link strength is a statistic value coming from the documents currently manipulated. The algorithm constructs a suffix tree using units of Chinese Characters or English words and executes the following procedure (see Figure 1):

1.           All units’ occurrences are counted.

2.           All mutual information is computed and stored in the tree nodes.
3.           The algorithm computes all of units segments’ mutual information.
              a.           The sum of the mutual information is divided by the number of units, and multiplied by
                            a function factor to determine the extent average link strength. The unit segment with the
                            highest link strength is the most likely key term in the set (see Figure 2). In Figure 2, the                                                  Control Factor Line is the division of key term boundaries. The points above the line,                                                     which have a length greater than 1, belong to the key term sets.
              b.           The algorithm sets a threshold to eliminate the least likely term and keep a key

                            term set consisting of one or several key terms in one sentence, if existing.

4.           The Multi-iteration module uses repeated processing to determine the most relevant key
              terms.

              a.           The module adopts the rational that the key terms in the primitive set includes extraneous
                            terms and determining their boundaries is difficult. Adding th...