Browse Prior Art Database

Feature Selection for Classification Using the MDL Principle

IP.com Disclosure Number: IPCOM000119262D
Original Publication Date: 1991-Jan-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 2 page(s) / 79K

Publishing Venue

IBM

Related People

Dom, BE: AUTHOR [+3]

Abstract

Statistical pattern recognition is often concerned with classifying an object based on a set of features describing the object. In cases where the number of features is large, it is desirable to select some subset of the features to perform the classification. This reduces computation and may also improve the classification error rate. This article describes a technique for selecting the optimal feature subset for classification. It is based on the encoding length of the training data (which includes a set of feature vectors X along with their corresponding class assignments c) using an appropriate probability model for both the useful and the useless features.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Feature Selection for Classification Using the MDL Principle

      Statistical pattern recognition is often concerned with
classifying an object based on a set of features describing the
object.  In cases where the number of features is large, it is
desirable to select some subset of the features to perform the
classification.  This reduces computation and may also improve the
classification error rate.  This article describes a technique for
selecting the optimal feature subset for classification.  It is based
on the encoding length of the training data (which includes a set of
feature vectors X along with their corresponding class assignments c)
using an appropriate probability model for both the useful and the
useless features.  The Minimum Description Length (MDL) principle
states that the best model M for a set of data D is the one
that minimizes the description length of M and D or, equivalently,
maximizes the joint probability p(M,D).  For the feature selection
problem, the model includes a specification of which features are
useful for classification and which are not (i.e., are useless), and
the optimal subset is the one for which the code length of the
feature vectors, the parameters, and the class vector c is minimal.

      Specifically, the following method is proposed for feature
selection:  exhaustively search (try all possible feature subsets)
for the subset U that minimizes the following code length: where U
denotes the set of useful feature vectors obtained from X by omitting
the useless features, V denotes the useless feature vectors, p(U
c,ru) denotes the probability density model (parameterized by ru)
assumed for the useful observations, p(V U,rv) is the conditional
probability model assumed for the useless features, (note, the
conditional probability density function o...