Browse Prior Art Database

Method and Apparatus for Searching a Filesystem for Identical Files

IP.com Disclosure Number: IPCOM000022047D
Original Publication Date: 2004-Feb-20
Included in the Prior Art Database: 2004-Feb-20
Document File: 2 page(s) / 30K

Publishing Venue

IBM

Abstract

This article describes a method for finding identical files in a filesystem.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Method and Apparatus for Searching a Filesystem for Identical Files

Searching for files on a computer system is a common problem encountered every day by anyone who uses a computer. There are numerous tools available to users that allow the user to search by the file's name, date, size, type, or owner, some subset of the contents. Sometimes, this search is very intensive, as the searching application will access every object in the filesystem looking for matches. Other times, this process is somewhat optimized by regularly caching the particulars of each file in some database, and then simply searching the database. Current search tools do not provide a method for searching a filesystem for exact matches of entire file contents - duplicate files. Described herein is an mechanism that allows users to search a filesystem for exact matches of a particular file (binary or text).

Tools, such as "md5sum", process strings of data and generate 128-bit checksums according to the MD5 hashing algorithm. Any time a particular file or string (binary or text) is processed by md5sum, it always returns the same 128-bit checksum. However, it is highly unlikely that two even slightly different strings will hash to the same checksum. For these reasons, MD5 checksums are often used to ensure that someone has not tampered with data or binary files. Note, other hashing algorithms, such as SHA1, may also be used.

File searching tools, such as the GNU "locate" application maintain databases of filenames and properties. When a user then wishes to "locate" a file, the application simply searches the database, which is considerably more efficient than scouring an entire disk for matching files.

These two existing tools can be used in tandem to allow users to search for exact matching files. When the locate-like application periodically updates its database by cataloging all of the files, it would gather one additional piece of information--a 128-bit checksum for each file. When the user wishes to search for a file based on its content, the checksum of the input file would be...