Browse Prior Art Database

Stochastic Identification of Duplicate Computer Files

IP.com Disclosure Number: IPCOM000019333D
Publication Date: 2003-Sep-11
Document File: 3 page(s) / 27K

Publishing Venue

The IP.com Prior Art Database

Abstract

This invention inserts a stochastic filtering procedure before any attempt to compare actual file contents, by calculating K-bit checksums for each of the candidate files and discarding files having unique checksums from further consideration as potential duplicate files. It performs the comparisons of actual file contents only between files having identical checksums, reducing the time required to confirm the identification of actual duplicates.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 53% of the total text.

Stochastic Identification of Duplicate Computer Files

Proliferation of computer files often causes redundancy. That is, two or more files are identical. The problem is to identify the redundant files so that they may be removed.

The traditional means of identifying duplicate files in a collection has been to compare the contents of each pair of files for equivalence, with occasional optimizations in which, for example, the sizes of the files are compared first, because the files cannot be duplicates if their lengths differ. Such techniques are too slow to be of practical use for large collections. As a result, attempts to discover and remove redundancy among computer files are infrequent, leading to proliferation.

This invention inserts a stochastic filtering procedure before any attempt to compare actual file contents, by calculating K-bit checksums for each of the candidate files and discarding files having unique checksums from further consideration as potential duplicate files. It performs the comparisons of actual file contents only between files having identical checksums, reducing the time required to confirm the identification of actual duplicates.

This invention is expressed as an algorithm using mathematical notation, but it is intended to be implemented in software.

1.      Initial definitions.

1.1.  Define A as the set of files among which duplicates are to be sought.

1.2.  The symbol N is defined as the number of files in A (i.e., the cardinality of A).

1.3.  Then we denote the ith file in A as A[i], for 0 < i ≤ N.

2.      Filter by file size.

2.1.  Define the function S(f) as the size, in octets, of file f.

2.2.  For each i such that 0 < i ≤ N, calculate s = S(A[i]), and place A[i] in set B[s].

2.3.  For each set B[s] that contains fewer than 2 files, discard the set from further consideration; the files in B[s] do not duplicate any other files in A, because they have unique lengths.

3.      Filter by checksum.

3.1.  Define the function T(f) as the K-bit checksum of the contents of file f. For purposes of this algorithm, T(f) is expected to take all of the contents of file f into account to produce the same checksum value when applied to identical file contents and to distribute its results uniformly over the 2K possible values of the checksum.

3.2.  For each remaining B[s], …

3.2.1.     Denote the ith file in B[s] as B[s][j].

3.2.2.     For each file B[s][j] in B[s], calculate t = T(B[s][j]), and place B[s][j] in set C[t].

3.3.  For each set C[t] that contains fewer than 2 files, discard the set from further consideration; the files in C[t] do not duplicate any other files in A, because they have unique checksums.

3.4.  At this point, the probability that the contents of any 2 files in any given C[t] differ is 2-K.

4.      Verify by file content comparison.

4.1.  If the certainty given by the above probability assertion is sufficient for a given application, then …

4.1.1.     Each set C[t] is deemed an equivalency set. That is, the files contained in C[t] are de...