Browse Prior Art Database

Method for selecting features based on the subjective Bayes approach

IP.com Disclosure Number: IPCOM000028040D
Publication Date: 2004-Apr-21
Document File: 4 page(s) / 158K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for selecting features based on the subjective Bayes approach for computing the uncertainty of knowledge. Benefits include improved accuracy in information retrieval and pattern recognition tasks.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 71% of the total text.

Method for selecting features based on the subjective Bayes approach

Disclosed is a method for selecting features based on the subjective Bayes approach for computing the uncertainty of knowledge. Benefits include improved accuracy in information retrieval and pattern recognition tasks.

         The disclosed method is an algorithm designed to provide a feature selection method for information retrieval and text categorization. It is based on the subjective Bayes approach to compute the uncertainty of knowledge. The knowledge uncertainty expresses the relation between text feature and text label. Although the algorithm is originally designed for Chinese text, this method is also applicable to other languages.

         A method for performing text categorization is contained in Figure 1.

         A method for performing information retrieval is contained in Figure 2.

         A block diagram illustrating the disclosed method is contained in Figure 3.

         Automatic feature selection methods include the removal of non-informative words according to corpus statistics, and the construction of features which combine lower level features into higher level orthogonal dimensions.The disclosed method is the combination between feature selection and knowledge uncertainty reasoning and is based on explicit theoretical foundations. Furthermore, after redefining proof and proposition and obtaining necessary sample data, the method can be used in many other pattern recognition tasks in addition to text processing.

         Conventional feature selection methods include:

•         Document frequency

•         Information gain

•         Mutual information

•         Chi-square

•         Correlation coefficient

•         Relevancy score

•         Odds ratio

•         Simplified Chi-square

         The disclosed method is theoretically different from them. Proof and proposition are defined according to the practical task, and uses the subjective Bayes approach to make knowledge uncertainty reasoning and then obtains the sufficiency factor of text feature.

         The method can be a general feature selection approach in pattern recognition. After changing proof and proposition, the method can be used in other recognition tasks.

Text categorization:

Proof :                  Word occur in the document

         Proposition:          Document has category label

         Proposition ~:         Document does not has category label

         According to Bayes Rule, it has

                                                 (1)

                                         (2)

         If (1) is divided by (2), then it has

                                         (3)

         Here, , then

                                                                                 (4)

                         So

                                                 (5)

Equation (5) shows that the odds of is changing with proof

Let                                                          6)

is the sufficiency factor. It expresses the effect degree         
of . It is the uncertainty of knowledge. When ,
, then proof do not have effect on , so word
‘s occurring in the document does not affect docume...