Browse Prior Art Database

Multivariate Statistical Data Reduction Method

IP.com Disclosure Number: IPCOM000104352D
Original Publication Date: 1993-Apr-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 4 page(s) / 114K

Publishing Venue

IBM

Related People

Ghosh, SP: AUTHOR [+2]

Abstract

A table driven algorithm for statistical data reduction of multivariate data and decoding of the reduced data are claimed. The method is based on multivariate grid files with special treatment for outlyers. The method works very well for highly repetitive multivariate data from manufacturing or socio economic environment. Multivariate relations in the data are preserved in this type of data reduction.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Multivariate Statistical Data Reduction Method

      A table driven algorithm for statistical data reduction of
multivariate data and decoding of the reduced data are claimed.  The
method is based on multivariate grid files with special treatment for
outlyers.  The method works very well for highly repetitive
multivariate data from manufacturing or socio economic environment.
Multivariate relations in the data are preserved in this type of data
reduction.

      In many manufacturing, medical and socio-economic environment
observations are recorded for multiple variates (parameters).  The
raw data is usually collected by micro-processors, (e.g., PCs, ATs)
thus large volume of data are generated very rapidly.  These raw data
also have high multivariate repetitiveness.  Thus, it is important to
statistically reduce the raw data to achieve storage space reduction,
but at the same time preserve the validity of all multivariate
statistical analysis performed on the reduced data.  The usual
multivariate frequency distribution reduction used in statistical
analysis is not suitable in manufacturing defect analysis or medical
diagnosis because the identity of the all the individuals are lost in
statistical frequency distributions.  Thus, the structure of
multivariate grid files [*] are used to attack this problem.  The
solution is obtained by combining structure of the multivariate grid
file and techniques of multivariate frequency distribution and
statistical quality control techniques.

      The structure of a raw data record generated by an individual
(e.g., component, or patient, or city, etc.) is assumed to be as
follows:  R(ID, P sub 1 ,P sub 2 ,...., P sub k), where the ID is the
identifier attribute and may also contain other category attributes;
P sub 1 ,P sub 2 ,...., P sub k are numeric attributes also known as
parameters.  All statistical analysis are performed on these P sub 1,
P sub 2,...., P sub k parameters and the vector of raw values
generated by them are highly repetitive in nature.  The values of ID
attribute are usually distinct in each record.  In any practical
database file all vector values are not repeated, a small number of
them occur only once and usually they are far out and are called
outlyers.  Outlyers play a very important role in many practical
analysis.  As the raw value vectors associated with the outlyers do
not repeat, this algorithm treats them as a separate class and no
data reductions are performed on them.  The basic steps in the
invention are:

STEP-I: Identify the statistical range of each of the parameter P sub
1, P sub 2,...., P sub k within which the raw data values are
repetitive, and divide each range into class intervals.  Statistical
Quality Control limits of a (each) parameter can be used to define
it's statistical range.

S...