Browse Prior Art Database

Stochastic Identification of Duplicate Computer Files

IP.com Disclosure Number: IPCOM000033691D
Publication Date: 2004-Dec-23
Document File: 1 page(s) / 25K

Publishing Venue

The IP.com Prior Art Database

Abstract

This invention inserts a stochastic filtering procedure before any attempt to compare actual file contents, by calculating K-bit checksums for each of the candidate files and discarding files having unique checksums from further consideration as potential duplicate files. It further performs the comparisons of actual file contents only between files having identical checksums, further reducing the time required to confirm the identification of actual duplicates.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 100% of the total text.

Stochastic Identification of Duplicate Computer Files

Computer files abound, and with their proliferation has come the problem of redundancy:  Often files in any given collection are duplicates.  Therefore, there often is no need to maintain the second version.  The problem, then, is to identify the redundant files, so that they may be dealt with efficiently.

The traditional means of identifying duplicate files in a collection has been to compare the contents of each pair of files for equivalence, with occasional optimizations in which, for example, the sizes of the files are compared first, because the files cannot be duplicates if their lengths differ.  Such techniques yield their results too slowly to be of practical use for large collections.  As a result, attempts to discover and remove redundancy among computer files are made infrequently, leading to their proliferation.

This invention inserts a stochastic filtering procedure before any attempt to compare actual file contents, by calculating K-bit checksums for each of the candidate files and discarding files having unique checksums from further consideration as potential duplicate files.   It further performs the comparisons of actual file contents only between files having identical checksums, further reducing the time required to confirm the identification of actual duplicates.