Browse Prior Art Database

MDL Based Pruning of 2n Tree Classifiers

IP.com Disclosure Number: IPCOM000109548D
Original Publication Date: 1992-Sep-01
Included in the Prior Art Database: 2005-Mar-24
Document File: 3 page(s) / 154K

Publishing Venue

IBM

Related People

Dom, BE: AUTHOR [+2]

Abstract

The technique described here achieves the dual objectives of decreasing the memory requirement and improving the performance of 2n-tree classifiers (described in (1,2)). (A more detailed description of this MDL-based pruning technique can be found in (3)). The technique is based on the Minimum Description Length principle (MDL), proposed by (4,5), as a measure of the goodness of models for describing data. In this technique the classification tree produced by the 2n-tree training algorithms is considered to be a model for the class assignments of the training data conditioned on the associated feature vectors. A scheme is proposed for encoding the data using this model.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

MDL Based Pruning of 2n Tree Classifiers

       The technique described here achieves the dual objectives
of decreasing the memory requirement and improving the performance of
2n-tree classifiers (described in (1,2)).  (A more detailed
description of this MDL-based pruning technique can be found in (3)).
The technique is based on the Minimum Description Length principle
(MDL), proposed by (4,5), as a measure of the goodness of models for
describing data.  In this technique the classification tree produced
by the 2n-tree training algorithms is considered to be a model for
the class assignments of the training data conditioned on the
associated feature vectors.  A scheme is proposed for encoding the
data using this model.  This scheme consists of a code for the model
(the 2n-tree in this case) and a code for the class assignments
conditioned on the model and the feature vectors.  The latter
consists of an encoding of the correct class codes for those training
vectors misclassified by the 2n-classification tree.  Associated with
this encoding scheme is a total code length, consisting of two parts
corresponding to the two components just mentioned.  This codelength
may be considered to be a measure of the complexity of the data with
respect to the 2n-tree model class and the model that produces the
lowest complexity (code length) will be sought.

      This complexity measure is used in an algorithm where the
initial tree is pruned, until the total code length begins to
increase.  When a branch is pruned from the tree, the code length of
the model (tree) decreases and the code length of the class codes
conditioned on the model either increases or remains unchanged.

      The classification trees produced by this pruning technique are
shown experimentally to be both significantly smaller and better at
classifying new feature vectors than the unpruned trees.  The
objective function obtained is:

                            (Image Omitted)

                                                       (1)
where L(C m,X) is the encoding length of the class codes conditioned
on the tree (m) and the feature vectors (X) and is given by:
                                                       (2)
Also, L(m X) is the codelength of the tree and is given by:
                                                       (3)
where the following definitions apply:
1.  m = the number of bits of precision used to encode the feature
measurements ({xj}). There are thus 2m levels for each feature.
Also, the maximum tree depth is m.
2.  n = the number of components of x (i.e., the number of feature
measurements).  This is also the n in the 2n-tree classi...