Browse Prior Art Database

Synthetic Data Generator that simulates realistic data quality errors and entity size distribution

IP.com Disclosure Number: IPCOM000238636D
Publication Date: 2014-Sep-09
Document File: 4 page(s) / 110K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method for synthetic data generation that simulates realistic data quality errors and entity size distribution is disclosed.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 01 of 4

Synthetic Data Generator that simulates realistic data quality errors and entity size distribution

Disclosed is a method for synthetic data generation that simulates realistic data quality errors and entity size distribution.

Master data in an organization are the data that reference the entities that represent the organization's non-fungible assets, such as customers, employees, products, and equipment. Master data management (MDM) comprises the policies, procedures, and infrastructure needed to accurately capture, integrate, and manage master data. This is done by taking records from within separate parts of the operations and linking related records together and defining that set of records as an entity. In these cases, an entity would typically refer to a person or an organization. The number of records linked together to define that entity is aptly called the entity size.

Data resources on authentic individual records are hard to obtain to use for test cases, and can only provide a small amount of data set in comparison to real-word data even if they were to be collected. As a result, many have turned to consider producing artificial or synthetic data. Synthetic data should be made as realistic as possible. This extends to error mimicry as well as the distribution of specific attributes like names and locations. Multiple records of the same individual or entity can be made due to possibly having records from different industries or companies.

How should synthetic data distribute entity sizes on an extremely large scale? It may be very common for a lot of individuals to have only one record, while only several would have 200 records or higher. There has been various attempts to imitate this distribution, all of which hasn't made any smooth fluid record distribution, which can heavily affect testing on related systems. The existing methods have failed to capture this realistic characteristic.

A typical way of generating the synthetic data for MDM is to define the total amount of records to be generated and the average or minimum and maximum number of entity size. For example, generate 10,000 records, with minimum entity size 1 and maximum entity size 500. With this method, the entity size distribution (entity sizes with their frequencies) usually follow uniform distribution or Gaussian distribution. Sometimes, the records in the same entity are just exactly duplicated.

When releasing software, it has been observed from actual customers that the data used in the lab does not always reflect that in the real-world. For instance, not all information from actual customers would be perfect and that they could have made misspellings and errors. There also may be a customer with 3000 records. In these cases, test data does not reflects a realistic model, and as a result, the product would have performance, system and accuracy problems, like higher latency, crashes and underperformed services.

In order to overcome those problems, a syn...