Browse Prior Art Database

High Affinity Low Confidence Cluster To High Confidence Cluster Data Enhancement

IP.com Disclosure Number: IPCOM000226487D
Publication Date: 2013-Apr-08
Document File: 2 page(s) / 19K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a novel approach to optimizing the generation of higher quality data mining clusters.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 2

High Affinity Low Confidence Cluster To High Confidence Cluster Data Enhancement

Often when working with high dimension data, the cleansing routines analyze the data sets and choose to drop impure or inaccurate data attributes. These actions are used to form clusters or classifications that enable researchers to generate an intuition of the population. This generalized intuition can be improved through manually updating records, getting new data, or reviewing the data by hand.

There is a need to improve the generation of clusters and classifications with a manageable density in clusters or classifications.

The invention is a novel approach to optimizing the generation of higher quality data mining clusters. The approach includes methods to:


• Identify a cluster's membership and associated confidence


• Acquire the attributes of the dataset and the attributes of the cluster


• Analyze the low impact attributes for the cluster based on a dynamic threshold

• Generate a survey of the low impact attributes, including: - Creating a question for each of the low impact attributes - Creating an answer set from the unique values of the low impact attributes - Transmitting the survey to the users of the associated data
- Receiving and updating the data set with the survey data.

The steps for implementing the invention in a preferred embodiment follow. The invention:


1. Automatically extracts the user identifiers when grabbing the low impact

attributes. It may automatically send the generated survey to the associated user population. The invention can automatically update the data set or create a second model based on the new data set, and then implement a voting mechanism between the models. The invention can add questions to the survey for existing data to get a second level of confidence in the data. This may be randomly determined.

2. Cleanses high dimensional data sets prior to running a classifier or clusterer. During the cleansing operation, the invention extracts the low integrity attributes, data identifiers, and the corresponding social identifiers (e.g., email identification, social security number, etc.). The low integrity attributes are then moved to a secondary dataset.

3. Uses the cleansed data set to generate clusters. The precision, confidence or density of a cluster, while it may be determined by a metric or algorithm (e.g., Simple K-Means or Nearest-Neighbor), is used. Upon using the cluster, the invention records the confidence, or satisfaction, in the use of the generated clusters. Upon detecting a low confidence associated with the operations associated with the generated...