Browse Prior Art Database

Deduplicating data that has Data Integrity or Check data embedded in the data

IP.com Disclosure Number: IPCOM000241550D
Publication Date: 2015-May-09
Document File: 2 page(s) / 35K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for deduplicating data that has data integrity or check data embedded in the data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Deduplicating data that has Data Integrity or Check data embedded in the data

Deduplication requires the exact duplication of pages of data. These pages can be in different block sizes (e.g., 4K, 8K, 16K, etc.) and are compared by using hash-based signatures such as SHA1 or SHA256 or bit-by-bit compares. For hash-based compare algorithms, a dictionary is kept that stores the hash for each page, the storage location on the persistent media for that page, which host addresses reference it.

This works exceedingly well if data flows into the system and nothing is inserted within a page. However, T10 DIF checking or any other form of check metadata that is added on a 512 byte or 4K byte basis may cause this check to fail and prevent deduplication from working. The Data Integrity Fields (DIF) that do this are the Reference Tag and

Application Tag, although any field that either contains an address or combines an address with some other deterministic quantity can also do this. For example, the low order 2 or 4 bytes of the Logical Block Address (LBA), an LBA seeded Longitudinal Redundancy Check (LRC) or Cyclical Redundancy Check (CRC) value also makes deduplication very unlikely.

The novel contribution is a method for deduplicating data that has data integrity or check data embedded in the data. As

data enters the system, it fills a buffer with only the data portion of the page; optionally this can be along with deterministic fields such as CRC or LRC. The method calculates the hash value of this data and searches the dictionary for this value to see if the page is stored somewhere else in the system. If the hashes match, then the system stores the new host address...