Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

System and Method for Generating Synthetic Dataset with Realistic Data Distribution for Structured Data

IP.com Disclosure Number: IPCOM000243740D
Publication Date: 2015-Oct-16
Document File: 9 page(s) / 164K

Publishing Venue

The IP.com Prior Art Database

Abstract

Comprehensive real data is often hard to come by even though they are desired in many cases. There are very few tools with the capability of flexible data generation which supports the same varying distribution of the real. This invention is to provide a system and method to generate synthetic dataset with realistic data distribution for structure data. It takes in the sample dataset and a few sample required data files (XML e.g.) to perform a set of analysis, to identify the key features and the correlation relationships. Template based approach is applied, together with the data generator which takes the analysis result, to generate the realistic mimic data set. The key advantage of this invention is to get the base knowledge from real data by performing data structure analysis and feature analysis to identify variables. It identifies and handles the correlation relationship between variables. It especially enables the coded value generation with the real scope and probability. It also provides an open framework to enable the external plug-ins to generate the fake data that can leverage the existing tool, which makes it extensible for complex random variables generation

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 23% of the total text.

Page 01 of 9

System and Method for Generating Synthetic Dataset with Realistic Data Distribution for Structured Data

     Synthetic Datasets are in demand, whether in the case that for systems under test, or database system selection based on the test of real data storage, or sharing and using data set without exposing actual personal data to theft, even in the case of validating analysis algorithms on sample data. The Synthetic data set is ideally expected to be real-like, with the reasonable human data, meaningful values, real scope of coded value, etc. Given the example of dataset needed in Healthcare Domain, The complex Clinical Document Architecture(CDA)documents in XML should include, not only the real-like person information, meaningful measurement result, and A large range of Coded Clinical terms for possible problem list/medication , reasonable number for clinical events as encounter times. It is desired that system testing, architecture decision, algorithm validation experiments can be performed on realistic requirements or situations.

     But there is a very few tools can generate the real-like synthetic datasets, not to mention the real data distribution. The existing tools to generate synthetic dataset are in three types. One is the kind of tools that populates Database storage with repeated or generated random data to specific table columns. These random data are possible with no meaning, and human not readable. For instance, the generate data for address field may be the strings constructed with letters of an alphabet which are in random sequence. This kind of tools are usually database oriented. The second type of tools are these emerging online web tools. They can provide or generate some real-like subset data from existing real data set or known rules, but these data now only limited to the types of: name, address, ID card, etc. The generated data can often be packaged in files and for demo case usage. The third type of tools is very few, they provide framework to customize the rule of string/numeric value generation with pattern expression, generate dataset target to database, in csv files, or in XML file format. So far, no commercial or open source tools was found to claim to generate dataset with the same data distribution and correlation of the real data.

     An academic paper [1] presents a language approach for generate synthetic data. The main purpose of this language is to allow for generation of data that conforms to exact characteristics such as a normal or a Zipfian distribution ( *note 1). Another approach of graph -model approach presented in academic paper [2] is supposed to generate realistic test data for On-Line Transaction Processing(OLTP), On-Line Analytical ProcessingOLAP) and data streaming applications. Both of them are generating dataset according to the existing database schema design. They did not address how to identify the key features (possible correlative features) from real data.

     This invention takes the...