Browse Prior Art Database

Method of Classification of Complex Data Items by Nearest Neighbors

IP.com Disclosure Number: IPCOM000113393D
Original Publication Date: 1994-Aug-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 89K

Publishing Venue

IBM

Related People

Sharman, RA: AUTHOR

Abstract

Some types of data represent items, or samples, which may be individually highly variable, but which fall naturally into certain classes. A method is required to (1) determine the classes which best characterise the data, and (2) to be able to decide which class a new item(observation, sample) should be placed in. The data is characterised by being (3) highly complex, and (4) very large. A method which satisfies all these constraints is sought. Additionally the method must be (5) fast, and (6) not take up excessive amounts of storage.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method of Classification of Complex Data Items by Nearest Neighbors

      Some types of data represent items, or samples, which may be
individually highly variable, but which fall naturally into certain
classes.  A method is required to (1) determine the classes which
best characterise the data, and (2) to be able to decide which class
a new item(observation, sample) should be placed in.  The data is
characterised by being (3) highly complex, and (4) very large.  A
method which satisfies all these constraints is sought.  Additionally
the method must be (5) fast, and (6) not take up excessive amounts of
storage.

      Simple data items can be defined by their numeric value, and
the concept of a class average, mean, and variation can be easily
established.  However, complex data, such as the Fourier Coefficients
of a sampled waveform, cannot be averaged in the same way.  Often the
only metric which can be established is the concept of the "distance"
of one observation from another.  This is an important restriction
which prevents the use of standard methods.

      The amount of data is important in selecting the method to use.
It is assumed here that very large amounts of data need to be
handled.  For example, if there are 10,000 observations, then a
simple table recording the distance from each point to every other
point would require 10,000 x 10,000 cells = 100,000,000 cells.  If
each distance is a float number occupying only 4 bytes, then 400Mb
memory would be required.  If disk storage were used the algorithm
would be too slow.  If the data is not stored, but re-calculated,
then the algorithm would be too slow.

      The proposed method is to start with the unclustered
observations, where each observation conceptually belongs to a class
of its own.  The clustering algorithm proceeds by finding the nearest
neighbours and merging their classes, thus reducing the total number
of classes.  When the number has been reduced to the required number,
or some evaluation metric (such as the entropy of the entire
classification) has reached the required level, the clustering
process stops.

Assuming an array of observations, x[1..N], the basic algorithm can
be described as follows:
 N1 = N;                    /* the number of clusters, initially */
 while(N1 > limit)
 { distMin = infinity;
   for i=1 to N
     { if x[i]  has not yet been clustered
         { for j = 1 to N
              { d = distance between x[i]  and x[j];
                if d < distMin
                  { save (i,j)  as (iMin,jMin);
                    distMin = d;
     }   }    }   }
   merge items (iMin,jMin) into the same class;
   N1 = N1 - 1;
 }

      This algorithm has essent...