Browse Prior Art Database

Method and process to detect insertions, deletions and mutations over time in large scale datasets with no known natural keys

IP.com Disclosure Number: IPCOM000243029D
Publication Date: 2015-Sep-09
Document File: 2 page(s) / 56K

Publishing Venue

The IP.com Prior Art Database

Abstract

We solve the problem of detecting change in a dataset over time. We describe method and process for detecting duplicates and edits over time in a large scale dataset where there are no unique keys inherent in the data. We describe how to perform a subtractive operation that yields a minimal number of potential edits and duplicates. We then describe a statistical mechanism to determine if the remainder are edits, duplicates or new data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 55% of the total text.

Page 01 of 2

Method and process to detect insertions, deletions and mutations over time in large scale datasets with no known natural keys

The solution disclosed is a statistical model of the data and a user defined acceptable risk of not detecting a duplicate. We utilise the cardinality of the properties of objects to locate duplicates to an acceptable probability. The uniqueness of this solution is not in the development of de-anonimization techniques but their application to detecting duplicates in large datasets with no known natural keys.

    We assume that two datasets have been created via some process. We assume that the properties of an object may have changed. We assume that the set of properties discovered may have changed between the first collection and second collection.

    Our goal is to determine which of the elements in the first set have been deleted, which in the second set have been added, and which elements in the first set have been modified to become an element in the second set.

Property Selection

    We start with property selection. Each element has one or more properties. We examine the properties found across the elements in the dataset. We do not require that all elements have the same properties or even the same number of properties. We select properties as follows:

Determine the number of elements in each dataset.


1.

Do not consider any property that is a generated unique ID.


2.

Consider only one of a set of co-dependent variables (e.g temp in C or F but not


3.

both)

Determine the cardinality (number of unique values in the combined data sets)


4.

for each property.

Select any properties that do not have a cardinality equal to the number of


5.

objects in the both sets combined.

Problem Space Reduction

    The problem space can be reduced by removing any elements that have not changed. The process is broken down into 3 steps. Using the selected properties, remove any exact matches.

The remaining objects from dataset 1 may have been:


1.

Modified to be one of the objects from dataset 2; or


1.

The object was removed from dataset 2.

2.

After accounting for all objects in dataset 1, the unmatched objects in dataset 2


2.

are new.

Find the Edits = Discover the Additions and Deletions

    The basic concept is to locate shared properties between elements in two views of the same data taken at different times. Any elements that are exact matches across the identifie...