THE STABILITY OF A HIERARCHICAL CLUSTERING
Original Publication Date: 1899-Dec-30
Included in the Prior Art Database: 2007-Apr-12
Software Patent Institute
Smith, Stephen P.: AUTHOR [+3]
THE STQBILITY OF R HIERARCHICRL CLUSTERING* Stephen P. Smith and Richard Dubes
THE STQBILITY OF R HIERARCHICRL CLUSTERING*
Stephen P. Smith and Richard Dubes
Department of Computcsr Science
Michigan State University
East Lansing, MI 48824
Clustering algorithms have the annoying habit of finding clusters, even
when the data are generated randomly. Verifying that potential clusters are
indeed the real clusters in some objective sense is receiving more attention as
the number of new clustering algorithms and their application is growing. We
consider one aspect of this question of cluster Validity and study the
feasibility of a variation on a measure of cll~stering stability proposed in the
1 iterature C 1,21.
Exploratory data analysis is for situations when little prior information
is available about a set of data and one wants to nlooK" a+. the data and study
its "structure". Clustering C3-71, also cal led "unsupervised pat tern
recognition" and "classification", is an important tool in exploratory data
analysis. Cluster validity C81 is concerned with the objective interpretat-ion
*Research supported by NSF grant ENG 76-11936 A01
of the results of clustering algorithms and tries to separate "truen structures
from artifacts of clustering algorithms. This paper is restricted to the
stability of clustering structures generated by hierarchical clustering
algorithms operating on rank-order proximity matrices. Most of the
quantitative results in cluster validity have been generated under these
restrict ions C9-141.
R proximity matrix is an N x N symmetric matrix in which each raw and
column represents a data item, or pattern to be clustered, and whore entries,
called proximities, express the degree of similarity (e.g., correlation) or
dissimilarity (e.g., Euclidean distance) between data items. Furthermore, the
original proximities are replaced by their rank orders, with no ties.
fi clustering method, in the form of a suitable algorithm, is applied to
the rank-order proximity matrix and a sequence of N-l nested partitions of the
N data items is formed. The two clustering methods treated in this paper,
single-link and complete-link C15,163, depend only on the ranK order of the
proximities, so the assumption of ordinal data is not restrictive.
dendrogram, or binary tree, records the successive merginq of clusters as the
algorithm proceeds from the disjoint clustering IN data items in N individual
clusters) to the conjoint clustering Call data items in a single cluster),
assuming an aglomerative algorithm is employed. The actual ranK order at which
the mergings occur is also recorded.
By way of example, a ranK-order proximity matrix is shown in Fig. 1 in
which ranK 1 means the closest pair. This matrix was derived by selecting six...