Browse Prior Art Database

Method for the Detection and Estimation of Sparse Replication

IP.com Disclosure Number: IPCOM000049067D
Original Publication Date: 1982-May-01
Included in the Prior Art Database: 2005-Feb-09
Document File: 3 page(s) / 24K

Publishing Venue

IBM

Related People

Martin, GNN: AUTHOR

Abstract

This article concerns a method of estimating how many values in a file are not unique, and extracting the values. It uses much less storage than loading the entire value set, and much less time than sorting.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 60% of the total text.

Page 1 of 3

Method for the Detection and Estimation of Sparse Replication

This article concerns a method of estimating how many values in a file are not unique, and extracting the values. It uses much less storage than loading the entire value set, and much less time than sorting.

Let us suppose that we have a set of N records with a key attribute that we suspect may not be unique. We wish to extract all records for which the key is not unique.

If we allocate a bit map, initialized to zeros, of a size
N.c/Ln(2) bits, we may in one pass through the records extract a list of keys that contains all the replicated keys, and only a fraction f of the unduplicated keys, where f is: c f

1 0.28

2 0.098

3 0.038

4 0.0157

5 O.0066

6 0.0029

7 0.00127

The method is as follows: Read each record in turn, and hash the key onto c (independent) bits of the bit map. If all of those c bits are set to ones, then the key may be a duplicate. and is added to the file of possible duplicates. Otherwise, all c bits are set to ones. At the end of the pass, the file of possible duplicates will contain all but one copy of every replicated key, plus a fraction f of the remaining keys. To obtain an exact list of replicated keys. we load the entire approximate list into direct storage, together with a count field for each, and make a second pass through the records. Provided duplication is sparse, and the size of count+key in bits is much less than 1/f, then we are able to load the list of possible duplicates into the space that was occupied by the bit map.

For the calculati...