Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

THE STABILITY OF A HIERARCHICAL CLUSTERING

IP.com Disclosure Number: IPCOM000148965D
Original Publication Date: 1899-Dec-30
Included in the Prior Art Database: 2007-Apr-12
Document File: 34 page(s) / 1M

Publishing Venue

Software Patent Institute

Related People

Smith, Stephen P.: AUTHOR [+3]

Abstract

THE STQBILITY OF R HIERARCHICRL CLUSTERING* Stephen P. Smith and Richard Dubes Department of Computcsr Science Michigan State University East Lansing, MI 48824 Clustering algorithms have the annoying habit of finding clusters, even when the data are generated randomly. Verifying that potential clusters are indeed the real clusters in some objective sense is receiving more attention as the number of new clustering algorithms and their application is growing. We consider one aspect of this question of cluster Validity and study the feasibility of a variation on a measure of cll~stering stability proposed in the 1 iterature C 1,21. I. INTRODUCTION Exploratory data analysis is for situations when little prior information is available about a set of data and one wants to nlooK" a+. the data and study its "structure". Clustering C3-71, also cal led "unsupervised pat tern recognition" and "classification", is an important tool in exploratory data analysis. Cluster validity C81 is concerned with the objective interpretat-ion

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 10% of the total text.

Page 1 of 34

THE STQBILITY OF R HIERARCHICRL CLUSTERING*

Stephen P. Smith and Richard Dubes

Department of Computcsr Science
Michigan State University

East Lansing, MI 48824

    Clustering algorithms have the annoying habit of finding clusters, even
when the data are generated randomly. Verifying that potential clusters are
indeed the real clusters in some objective sense is receiving more attention as
the number of new clustering algorithms and their application is growing. We
consider one aspect of this question of cluster Validity and study the
feasibility of a variation on a measure of cll~stering stability proposed in the
1 iterature C 1,21.

I. INTRODUCTION

    Exploratory data analysis is for situations when little prior information
is available about a set of data and one wants to nlooK" a+. the data and study

its "structure". Clustering C3-71, also cal led "unsupervised pat tern
recognition" and "classification", is an important tool in exploratory data
analysis. Cluster validity C81 is concerned with the objective interpretat-ion

............................

*Research supported by NSF grant ENG 76-11936 A01

[This page contains 1 picture or other non-text object]

Page 2 of 34

PAGE 2

of the results of clustering algorithms and tries to separate "truen structures
from artifacts of clustering algorithms. This paper is restricted to the
stability of clustering structures generated by hierarchical clustering
algorithms operating on rank-order proximity matrices. Most of the
quantitative results in cluster validity have been generated under these
restrict ions C9-141.

R proximity matrix is an N x N symmetric matrix in which each raw and

column represents a data item, or pattern to be clustered, and whore entries,
called proximities, express the degree of similarity (e.g., correlation) or
dissimilarity (e.g., Euclidean distance) between data items. Furthermore, the
original proximities are replaced by their rank orders, with no ties.

fi clustering method, in the form of a suitable algorithm, is applied to

the rank-order proximity matrix and a sequence of N-l nested partitions of the
N data items is formed. The two clustering methods treated in this paper,
single-link and complete-link C15,163, depend only on the ranK order of the
proximities, so the assumption of ordinal data is not restrictive.
dendrogram, or binary tree, records the successive merginq of clusters as the
algorithm proceeds from the disjoint clustering IN data items in N individual
clusters) to the conjoint clustering Call data items in a single cluster),
assuming an aglomerative algorithm is employed. The actual ranK order at which

[This page contains 1 picture or other non-text object]

Page 3 of 34

PRGE 3

the mergings occur is also recorded.

    By way of example, a ranK-order proximity matrix is shown in Fig. 1 in
which ranK 1 means the closest pair. This matrix was derived by selecting six...