Browse Prior Art Database

Finding Matching Data Occurrences in Large Streams

IP.com Disclosure Number: IPCOM000032411D
Original Publication Date: 2004-Nov-03
Included in the Prior Art Database: 2004-Nov-03
Document File: 2 page(s) / 25K

Publishing Venue

IBM

Abstract

A matching algorithm is described which allows matching over areas far larger than memory. This can be used as a replacement or an enhancement of standard sliding-window compression schemes.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Finding Matching Data Occurrences in Large Streams

One standard compression technique involves searching through the last N bytes of data to see if we have a match to the current data, and if so, simply refer to the previous data. This search can be improved through use of hashes: the rzip algorithm (based on the popular rsync program) uses a simple rolling hash over a window of the last 31 bytes, and records that hash value as it progresses, keeping hashes for each position in a sliding window (100k to 900k). This makes looking for identical patterns very fast: find any points in the window where the rolling hash value was the same as the current value, and if there are any, see if they do actually match.

In addition, two optimizations are used. Rather than checking every single character, rzip by default only checks the hash when the bottom 4 bits are zero (ie. approximately every 16 characters), which speeds compression significantly. This also means it only needs to record the hash when the bottom 4 bits are zero, reducing the number of hashes recorded by a similar factor of 16.

The second optimization is that when a match is found, rzip opportunistically tries to extend the match forwards and backwards. ie. far more than 31 characters might match, so we might as well try.

One disadvantage of this approach is that going back to check if there really is a match requires the compressor to keep the previous window in memory. If we were to use a strong hash (such as SHA1, or MD4 in our prototype), we could be confident that a matching hash implied a definite match without going back to look, and we could keep a much longer window in the same amount of memory. One disadvantage is that we cannot check if the match was longer than the length we found, because we cannot go back and try to expand it opportunistically. The other disadvantage is that strong hashes do not "roll": to move the rolling hash of the last 31 bytes forward, rzip adds in the new byte and subtracts the one which falls out of the window.

The lack of opportunistic expansion cannot be solved, but the lack of "rolling" can be mitigated, and this is the subject of this research. We use a weak rolling hash, identical to the rolling hash in...