Browse Prior Art Database

Data Preprocessing With Clustering Algorithms

IP.com Disclosure Number: IPCOM000119994D
Original Publication Date: 1991-Mar-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 2 page(s) / 74K

Publishing Venue

IBM

Related People

Herskovits, EH: AUTHOR

Abstract

Many classifiers require the classes to have multivariate normal distributions; this requirement is rarely met by the training data. By preprocessing the data with a clustering algorithm, we obtain clusters that have multivariate normal distributions, at the cost of increasing the number of classes.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Data Preprocessing With Clustering Algorithms

      Many classifiers require the classes to have multivariate
normal distributions; this requirement is rarely met by the training
data.  By preprocessing the data with a clustering algorithm, we
obtain clusters that have multivariate normal distributions, at the
cost of increasing the number of classes.

      Previous work on image classification has been based on either
unsupervised methods, such as clustering algorithms, or supervised
methods, such as the maximum-likelihood (ML) algorithm.  The former
suffer from a lack of semantics; that is, the results of clustering
algorithms must themselves be labeled by an expert.  Statistical
classifiers, in contrast, are supervised; that is, there is a
training phase during which images of known composition are presented
to the classifier, which, in turn, generates summary statistics for
later classification of new images.  The primary problem with
statistical methods, such as the ML algorithm, is that for the
purposes of computational tractability the assumption is made that
each class has a multivariate-gaussian distribution of pixel
intensities. This assumption is often violated by the training data
as well as the images to be classified, resulting in poor
classification accuracy.

      In combining an unsupervised clustering algorithm with a
supervised algorithm, a classification method is obtained that
retains the advantages of each, while minimizing their disadvantages.
The prototype system, in the domain of medical-image analysis,
preprocesses an image training set with the ISODATA clustering
algorithm before presentation to the ML statistical classifier.  The
statistics thus generated classify a new image into many clusters
relative to t...