Browse Prior Art Database

Method to select file system deduplication mode for better resource optimization

IP.com Disclosure Number: IPCOM000240416D
Publication Date: 2015-Jan-29
Document File: 5 page(s) / 63K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system that can be integrated with a filesystem (Write Once Read Many) which understands the nature of stored data, access methods etc and also estimates the nature of incoming workload. Based on these noted results, the system automatically configures the appropriate depulication type, block size etc.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 5

Method to select file system deduplication mode for better resource optimization
Data deduplication can be defined as a data compression technique that create a "signature" for a block or file. If two signatures are equal, their corresponding blocks are considered equal. If multiple files share identical content, that content would only be recorded once. All subsequent "copies" would simply reference the first copy of the content. Files can have the same content even when file names, dates, permissions, and other metadata differ. Deduplication primarily comprises for two elements;

1. The first element of data deduplication is the use of message digest hashes as a substitute for byte-by-byte comparisons of data. If the hashes are equal, the data is considered to be equal.

2. The second element of data deduplication is the grain of deduplication and the strategy for breaking large data sets (e.g. streams, files) into smaller chunks.

Data deduplication can generally operate at the file, block or byte level thus defining minimal data fragment that is checked by the system for redundancy.

File level deduplication:

The file-level deduplication can be performed most easily. It requires less processing power since in file, hash numbers are relatively easy to generate.

Block level deduplication:

Block deduplication operates at the block level, every file is split into multiple block sequences of fixed or variable length. If any minor changes are made to a large file, the system will only store its changed fragments. In average, file deduplication allows disk space savings as high as 5:1, while block deduplication performs the deduplication rate at the level of 20:1. Block deduplication requires more processing power than the file deduplication, since the number of identifiers that need to be processed increases greatly. Correspondingly, its index for tracking the individual iterations gets also much larger.

Use case:

Consider a file system that supports both file and block based deduplication, but can be configured only in one mode either in file deduplication mode or block deduplication mode for

1


Page 02 of 5

consistency. Here are few scenarios that can arise while configured for unstructured workloads;

- Assume a scenario where a production system is built using a very old file system which does not have the deduplication capability and in-order to have it you have an option to use an external vendor utility on top of your file system. In this situation, there exists a confusion for system administrator to which type of deduplication mode to be used, in order to optimize the current resources and will also help to sustain future work load changes?

- Assume the file system is configured with "File deduplication", and where most of the content among the incoming unstructured data is "immutable" (As the file system is meant for unstructured data it becomes hard to predict the future types of workloads in the initial planning phase). In thi...