Browse Prior Art Database

Avoiding exhaustive search on the data stored on distributed file systems using map of areas and mapping functions

IP.com Disclosure Number: IPCOM000238682D
Publication Date: 2014-Sep-11
Document File: 4 page(s) / 113K

Publishing Venue

The IP.com Prior Art Database

Abstract

Article describes the mechanism how to improve the speed of lookup on the data stored on parallel file systems.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 27% of the total text.

Page 01 of 4

Avoiding exhaustive search on the data stored on distributed file systems using map of areas and mapping functions

The goal of this work is to describe a mechanism how to improve the speed of lookup on the data stored on parallel file systems. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. Loaded file is split into a number of chunks and each chunk is saved on separate node. Distributed file systems are widely used in modern enterprises for processing huge amount of data. DFS is also a core of Hadoop oriented technologies.

One of the biggest challenges for "Big Data" applications is providing quick access to particular piece of information. Lack of good data organization and lookup optimization may ends with full disk scan, which basically means that in order to find a piece of data requested by the user the system need to scan all the data it stores (for example full text search). That results in significant performance penalty and waste of resources.

There are methods that help to get rid of this issue. That includes metadata search (titles, abstracts, selected sections of original document) or indexing (especially when the data is well structured). But even than with queries that retrieves medium to large piece of data the cost of using indexes (from computational and storage perspective) becomes dominant.

The main idea of this work is to leverage the architecture of parallel file systems and for each chunk of data build a descriptor which helps to determine where specific the data may be located. What we want to protect in this disclosure is a method of narrowing down the scope of a search for information when the data is stored on distributed file system. During loading the data each block of information (called area) is annotated with a description what information it may contain. All descriptors of areas create a map of areas. When a search for information is done the map is used to decide which blocks may contain the information the user is looking for. In this description HDFS (Hadoop Distributed File System) is used as an example of DFS. Implementation of different Distributed File Systems varies but the concept is always the same.

DFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in

files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsib...