Browse Prior Art Database

A method for the generation of large synthetic test datasets based on cluster models, and using database views as the generation mechanism Disclosure Number: IPCOM000238143D
Publication Date: 2014-Aug-05
Document File: 3 page(s) / 43K

Publishing Venue

The Prior Art Database


This article describes a novel method for generating synthetic data in a database context in which the distribution of values in the generated data conforms to the patterns described in a cluster model built on an original (training) dataset and the data generator is implemented using a database view which generates each synthetic data record on the fly.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 39% of the total text.

Page 01 of 3

A method for the generation of large synthetic test datasets based on cluster models, and using database views as the generation mechanism

In testing business analytics products we require test data that is representative of real world data. However customers are often unwilling to share their data due to privacy concerns, and there are also logistical problems involved in transporting very large datasets. This invention comprises two aspects:

    (1) the use of a cluster model built on the customer's original dataset, as the method to drive the data generation process

    (2) embodiment of the data generation process in a database view definition, which means that the data does not need to be stored and is generated when needed

    Cluster models are built from a training dataset by analytic software to look for sub-populations in an original data set which share common characteristics. Such models describe for each cluster, the characteristics of that sub-population of the data assigned to the customer, and usually provide univariate statistics for each modelled field, within the cluster's sub-population. The counts of training data records assigned to each cluster are also incorporated into the model.

    The cluster model therefore provides a statistical summary of the training data which is more detailed than simple univariate statistics that describe the entire dataset. The cluster model can be used to generate a synthetic dataset of any desired size, but with similar sub-populations. The synthetic data generation process is effectively the reverse of the training process which built the model from the training dataset, and recovers a synthetic dataset which has similarities to the training data.

    In this approach the similarity of the synthesised dataset compared to the original training dataset can be increased by increasing the number of clusters in the model. For example, if the number of clusters is set to the same as the number of training records, each training record will be assigned to exactly one and only one cluster, and the synthesised data will match the training data exactly.

    Finally we note that the storage of test data (in files, or in a database table) may be expensive, especially when very large quantities of test data are required. For this reason we will build the data synthesis algorithm as a database view definition. Applications can obtain the test data by querying the view, but the data is not stored anywhere, it is generated on demand. The novelty of the invention is the use of a database view to generate the synthetic data on demand without incurring storage costs.

    The invention uses a process by which a cluster model, as described by a file using the predictive markup modelling language (PMML) is converted into a database view definition described in the database definition language (DDL) of the database which will host the view. We outline how this will work for the DB2 database but the same methods can be applie...