Browse Prior Art Database

Improved Segmentation for Data Analysis

IP.com Disclosure Number: IPCOM000118879D
Original Publication Date: 1997-Aug-01
Included in the Prior Art Database: 2005-Apr-01

Publishing Venue

IBM

Related People

Sharman, R: AUTHOR

Abstract

Data relating to business activities, such as customer records, shop transactions, and financial trading, are typically complex multi-dimensional objects. They are difficult to understand and interpret and have caused the growth of a vigorous "data mining" activity to use such data effectively in business management. One of the key methods of analyzing such data is the method of "segmentation" into subsets of data to show underlying structures within the data. Segmentation methods demand substantial computational resources and often use such resources inefficiently.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 23% of the total text.

Improved Segmentation for Data Analysis

      Data relating to business activities, such as customer records,
shop transactions, and financial trading, are typically complex
multi-dimensional objects.  They are difficult to understand and
interpret and have caused the growth of a vigorous "data mining"
activity to use such data effectively in business management.  One of
the key methods of analyzing such data is the method of
"segmentation" into subsets of data to show underlying structures
within the data.  Segmentation methods demand substantial
computational resources and often use such resources inefficiently.

      The approach described here offers improved efficiency by
providing a method of segmentation using "relational analysis".

      The known approach to commercial data mining is to:  (1)
extract relevant data from a company's operational data base, and
clean and check the values, (2) segment the multi-variate data into
its constituent subset using an algorithm suitable for this purpose,
and (3) interpret the results by tabular, or visual, methods as best
fits the needs of the data.

      This approach is directed to improving step (2).  A well-known
method of segmentation is "relational analysis" as pioneered by the
ECAM group under Marcotochino and exemplified by the work of
Herscovici.  This method arranges the multi-variate data in a
two-dimensional table, of size N by M, where N=number of customers
and M=number of independent variables.  Typically, the rows of the
table are customer details, and the columns are characteristics such
as age, income, etc.  Special techniques are applied to map both
continuous and categorial data into a form which can be easily
managed.

      From the raw data table, a "preference table" of size N by N is
derived, showing how record i differs from record j in terms of the
number of differences of independent variables.  A method of scoring
the preference table is derived from the concepts of preference
voting first  described by Condorcet (1788).

      The preference table is reordered in both rows and columns to
achieve a table in which the Condorcet scoring is maximized.  This is
done by considering the equivalent contingency tables and maximizing
the difference between groups while minimizing the difference within
groups.  The search for the optimal re-ordering is bounded by
considerations about the contingency table error distributions.  The
number of possible table re-orderings is (N! * M!) which is an
impossibly large number to search for any practical value of N.
Also, the nature of  commercial data is such that no simple function
or heuristic will lead  to a guaranteed method of finding the optimal
solution.

      The improvement is achieved by using a "genetic algorithm" to
search for good re-orderings.  The method is as follows:

      Create an N by M table to hold the data.  This would be best in
memory, since a large number of...