Duplicate Elimination Sub-System for Content Management Systems based on MD5 Hash Checks
Original Publication Date: 2009-Jun-03
Included in the Prior Art Database: 2009-Jun-03
In a content mangement system, file storage is crucial. However, empirically often times the same file gets stored multiple times needlessly. Thus, file duplication is a real problem not only in terms of resource wastage, but duplicate files also cause problems for administrators who have to organize more redundant resources, as well as performance problems for users who had to needlessly re-import documents that already exist in the system.
Duplicate Elimination Sub-System for Content Management Systems based on MD 5 Hash Checks
First, we will illustrate our proposal with md5 hashing, but other types of hashing can be used. The Duplicate Elimination Subsystem consists of several phases. Initially, we will have an import phase, where the user imports a document. During that time, we will has this document as we read the document in, based on several factors specific to the file, such as but not limited to, filesize, filename, file type, mimetype, timestamp, last modified time, etc, or any attribute specific to the content manangement systems's data model (namely, attribute values
. And of course, we can also consider the actual content to be hashed upon as well. Given the current technology, we think matching the content exactly would be difficult, especially wth documents that are like jpegs, but we believe in a few years when this patent is finalized by the US patent office that such content classification technology would be more mature, as currently we see image analysis technology emerging rapidly. Neverthless, we will let the set of attributes to hash on be configurable by the administrator. But once this setting is changed, we will need a subprocess to rehash all documents inside the system, as we will detail below. Two documents that are the same need not have exactly the same attributes however, such as last modified time, and thus we allow the administrator to select which attributes to hash on, as the defnition of same should be dependent on the particular content manager environment. For example, are a pdf and a word document containing the same form 1040 the same? Nevertheless, we allow the administrator to consider any combination of these factors because documents that are the "same", or at least very similar, should have similar attributes (such as filename, filetype).
Now once a document is imported, and the hash value is calculated based on the set of attributes chosen, it is stored in a new table inside the library server. This table should keep track of the item id, as well as the hash value. The reason why we propose md5 hashing is because it will hash into a constant length string of hex values, and thus makes the database utilization predictable based on the number of imported documents in the system. Neverthless, on subsequent adds, a hash value is calculated, and a lookup is performed on this table. If there was a matching value, we have a very likely chance that the document has been stored before, and we can thus eliminate the file transfer and have an internal mapping that points to the existing document in the system. Alternatively, we can prom...