Browse Prior Art Database

A system and method for global deduplication with multiple storage appliance

IP.com Disclosure Number: IPCOM000236964D
Publication Date: 2014-May-23

Publishing Venue

The IP.com Prior Art Database

Abstract

A system and method for global deduplication within multiple storage appliances, for each backed-up client, calculates Locality-Sensitive hash of data block(i.e. 8 consequent segments), and uses the hash value as base hash function for consistent hash, which determines corresponding storage appliance for each block. Similar blocks tend to be stored on same storage appliance, thus data deduplication could be conducted to eliminate duplicate data within each storage.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 18

A system and method for global deduplication with multiple storage appliance

With the rapid growth of data to be managed, effective and efficient dedup enabled storage system should be developed to fully utilize storage appliance dedup capability. Data from different production servers(backup client) may duplicate each other to a certain degree. To minimize the amount of data to be stored in backup server, duplication within data from certain client should be removed, and duplication cross clients should be eliminated either.

To resolve the problem above, the most naïve approach is store all the data onto one single storage appliance that support either in-line or post deduplication. Such method, however, is not scalable and the workloads may overwhelm that single storage appliance. More advanced global dedup technique tries to split data onto several storage appliances and conduct dedup on them. The solution to be proposed here is one embodiment of global deduplication.

There are other related solutions to this problem. A similar approach [1] is to employ two controllers and distributes workloads onto these two controllers. For each block, they calculate feature value like MinHash, and MOD to determine "bin" for storing. The method they used is not efficient enough since their feature value can not eliminate duplication between blocks with similar fingerprints (i.e. small hamming distance). In addition, MOD operation needs to change MOD value whenever adding/removing storage appliances, thus hard to extend. Another approach shard index for dedup into several server. While this approach might bring as good dedup rate as that of a single server, the reading speed will drop dramatically due to the fact that consequent data is very likely to be stored on different servers . [2] presents the idea of calculating similarity hash on chunk level(i.e. consequent segments), then routes the chunk to storage appliance where segments are most likely to be duplicated. However, this solution requires a logical center for each storage appliance and keep updating them, thus inefficient and not practical for real world backup servers with heavy I/O operations.


[1] Tradeoffs in Scalable Data Routing for Deduplication Clusters (research paper) http://www.usenix.org/event/fast11/tech/full_papers/Dong.pdf

[2] Use of similarity hash to route data for improved deduplication in a storage server cluster (patent US8321648) http://www.google.com/patents/US8321648

We provide a system and method for global deduplication within multiple storage appliances. The system calculates Locality-Sensitive hash of data block(i.e. 8 consequent segments) in each backed-up client, and uses previous hash as base hash function of consistent hash, which determines corresponding storage appliance for each block. By this way, similar blocks tend to be stored on same storage appliance, thus data deduplication could be conducted to eliminate duplicate data within each storage....