Browse Prior Art Database

Method to Improve Computation Performance in a Deduplication Enabled Globally Distributed Embedded Compute Infrastructure built Object Storage namespace

IP.com Disclosure Number: IPCOM000241148D
Publication Date: 2015-Mar-31
Document File: 6 page(s) / 149K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is an algorithm that can be integrated with embeded compute engine within an object storage unit and helps to choose objects that are required for the computation algorithm and the selection of objects are based on the original content instead of choosing deduplicated object locations.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 01 of 6

Method to Improve Computation Performance in a Deduplication Enabled Globally Distributed Embedded Compute Infrastructure built Object Storage namespace

Traditionally object storage (Figure-1) architecture is meant for storing enormous amounts of unstructured data and would require additional computation clients to perform analytics over the data fetched from the storage units. But with the evolution of new embedded compute infrastructure built-in object storage (Figure-2, Figure-3) architecture the computation is off-loaded to the storage units (wherein the compute algorithm is deployed directly to the storage unit and the storage units response back with the analytic results or processed outputs) instead of using traditional client for computation purpose.

Figure-1: Traditional object storage architecture

1


Page 02 of 6

Figure-2: Embedded compute infrastructure built object storage architecture

Figure-3: Architecture representing storlet workflow

On the other side, deduplication can be defined as a technique that helps in eliminating the

2


Page 03 of 6

storage of duplicate copies on entire storage unit tied to clustered file system by comparing chunks / whole of data among multiple files and map's a pointer to memory of original data at duplicated location. With both embedded compute engine and deduplication in object storage, let us assume a scenario wherein if an end user has deployed a computation algorithm (for example: pdf to txt conversation, DICOM image to doc conversion etc.) that requires reading / parsing of entire object. Now consider an object (say Object-X) is required by the computation algorithm and which is in turn deduplicated with another object (say Object-A) residing on the same node but within a different disk or within a different track of same disk. In this scenario the read operation on Object-X content consumes plenty of time and hardware resources (CPU, RAM, Cache etc.) as it is just a memory pointer to Object-A which is residing on other disk data or can be pointing to data at INNER disk track (slow speed).

The current placement algorithms (Consistency hashing / modified consistency hashing / Controlled Replication under Scalable Hashing) are not aware of deduplication and has chance of assigning or pointing to a duplicated object which can be in turn a pointer to data residing on other disk or to INNER tracks of same disk (slower speed) and this read operation can cost a penalty on storlet performance as it requires one more hop or complete fetch of original data situated at a different location.]

Figure-4: Problem of choosing deduplicated data location

3


Page 04 of 6

This literature proposes an algorithm or method that helps the storlet engine to select objects that are required for the computation algorithm and the selection of objects are based on the original content instead of choosing deduplicated object locations which contain dummy pointers to the original content. When a situation arises lik...