Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Automatically Creating And Enhancing Reference Data From A Standardization Process

IP.com Disclosure Number: IPCOM000239118D
Publication Date: 2014-Oct-13
Document File: 3 page(s) / 36K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to automate the creation and enhancement of reference data as part of a feedback loop during a standardization process.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 3

Automatically Creating And Enhancing Reference Data From A Standardization Process

Standardized and cleansed data with known recency are critical facets of data within a variety of domains across organizations. Within name, address, or product domains, rule sets exist which are designed to expedite and normalize the applications of standards across data sources. These rule sets are replete with reference data, which at present is challenging to maintain and augment in an automated fashion in a cost effective manner.

Typically a subject matter expert (SME) painstakingly creates reference data using manual inspection and judgment, at times aided by data profiling and column analysis. The techniques used by the SME are time consuming, error prone, and require significant expertise. Issues are presented by unavailability of SMEs, the labor-intensive steps to identify and create reference tables, and the high cost of maintaining reference data.

The disclosed method automates creation and enhancement of reference data as part of a feedback loop during a standardization process. According to embodiments of the present invention, responsive to identification of a new data source, the associated metadata details are discerned. Data profiling and overlap analysis are performed to suggest relevant domains within the data. The overlap analysis considers all available reference data to suggest a domain(s) that may be present in the data. If present, an existing standardization rule set is also applied to a set of samples. The data from those columns, where overlap analysis and the output of standardization (as measured by the completely handled records) for the samples, suggest a correlation with a specific domain is extracted. If the frequency of terms exceeds a minimal threshold (e.g., 5%), then terms are suggested as candidates for inclusion in the reference data

which corresponds to the class suggested by the value in the pattern. These may be

automatically added to the reference data or an approval of the inclusion can be triggered for review.

Responsive to the overlap analysis suggesting that a column(s) contains a specific domain, or more exactingly, a specific class (e.g., QualityStage usage of class, such as firstname in USNAME rule set) in a recognized pattern that exceeds a threshold, the method adds the data values to the reference data.

The method herein disclosed includes the following features. First, the method can apply knowledge of patterns in which reference data appears in a given data set, and determine other data values that qualify as reference data. Second, the method incrementally builds reference data with suggestions for new data that can be added to the reference data...