Browse Prior Art Database

High-speed Rolling Hash chunking for Content-Aware Data Deduplication

IP.com Disclosure Number: IPCOM000225908D
Original Publication Date: 2013-Mar-12
Included in the Prior Art Database: 2013-Mar-12
Document File: 2 page(s) / 36K

Publishing Venue

Linux Defenders

Related People

Moinak Ghosh: AUTHOR

Abstract

This document describes an approach to perform high-speed rolling hash based chunking using a sliding window that scans only a small proportion of the given dataset. Rolling hash computation is a critical but expensive part of content-aware data deduplication. The approach described here avoids the cost of computing rolling hashes over all of the data thereby significantly improving the throughput of content-aware deduplication.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

High-speed Rolling Hash chunking for Content-Aware Data Deduplication

Described is a sliding window chunking optimization mechanism that helps speed up deduplication of data. Efficient sliding window chunking is a critical component of variable block-length content-aware data deduplication that typically results in better deduplication effectiveness. Most research till date have focused on tuning the rolling hash based fingerprint computation to result in good distribution of chunk sizes around the desired average. In addition different computation techniques have been devised to efficiently do the rolling hash polynomial computations without requiring big number arithmetic.

    This approach described here uses uses existing sliding window based rolling hash computation techniques along with predefined limits on chunk sizes. Preset minimum and maximum chunk sizes are defined. The minimum size is kept somewhat less than the desired average and the maximum can be significantly more. The sliding window rolling hash computation is started from a point which is a little less than the minimum chunk size. The rolling hash is computed till the minimum chunk size is reached. After the minimum chunk size has been reached, every hash computation is accompanied by a fingerprint computation and comparison with a predefined breakpoint to determine whether a chunk boundary should be marked. If the maximum chunk size byte position is reached and no breakpoints were found, a chunk boundary is marked anyway.

    Using minimum and maximum chunk sizes to limit chunk size variation from average is a commonly known technique. However the new idea in this approach is to start scanning at point near to the minimum chunk size and avoiding scanning the whole of the given data. Thus a large performance and deduplication throughput advantage is gained. In order for this to be effective a standard rolling hash based on n-gram Rabin hashing techniques can be employ...