Browse Prior Art Database

Feature Selection for Classification Using Stochastic Complexity

IP.com Disclosure Number: IPCOM000119264D
Original Publication Date: 1991-Jan-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 2 page(s) / 86K

Publishing Venue

IBM

Related People

Dom, BE: AUTHOR [+3]

Abstract

Statistical pattern recognition is often concerned with classifying an object based on a set of features describing the object. In cases where the number of features is large, it is desirable to select some subset of the features to perform the classification. This reduces computation and may also improve the classification error rate. This article describes a technique for selecting the optimal feature subset for classification based on the encoding length of a training data that includes a set of feature vectors X along with their corresponding class assignments c, using an appropriate probability model for both the useful and the useless features, and using stochastic complexity (SC).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Feature Selection for Classification Using Stochastic Complexity

      Statistical pattern recognition is often concerned with
classifying an object based on a set of features describing the
object.  In cases where the number of features is large, it is
desirable to select some subset of the features to perform the
classification.  This reduces computation and may also improve the
classification error rate.  This article describes a technique for
selecting the optimal feature subset for classification based on the
encoding length of a training data that includes a set of feature
vectors X along with their corresponding class assignments c, using
an appropriate probability model for both the useful and the useless
features, and using stochastic complexity (SC).

      The SC criterion (code length) for model selection is given by:
where p(X,c r) represents a probability model for the training data
parameterized by a parameter set r, p(r o) is a prior parameterized
by a short nuisance parameter vector o, and H is the code length
needed to encode o^.  The hat ^ denotes optimal (i.e., code-length
minimizing) values.

      Specifically, the following method is proposed for feature
selection:  exhaustively search (try all possible feature subsets)
for the subset U that minimizes the following code length:
where U denotes the set of useful feature vectors obtained from X by
omitting the useless features, V denotes the useless feature vectors,
p(U c,ru) denotes the probability density model (parameterized by ru)
assumed for the useful observations, p(V U,rv) is the conditional
probability model assumed for the useless features (the fact is used
that the conditional probability density function of the useless
features, given the useful features, does not depend on c), and k is
an index indicating what subset is being used.

      For models that assume that all the samples are Gaussian and
independent, the useless features can be shown to be a linear
combination of the useful feat...