Browse Prior Art Database

Method and System for Providing a Consolidated File Store for Distributed File Systems

IP.com Disclosure Number: IPCOM000236937D
Publication Date: 2014-May-22
Document File: 4 page(s) / 164K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system is disclosed for providing a consolidated file store for distributed file systems.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 35% of the total text.

Page 01 of 4

Method and System for Providing a Consolidated File Store for Distributed File Systems

Distributed file systems such as Hadoop* Distributed File System (HDFS) were

designed for huge data files, and may not be able to handle small files. Generally, a file allocation table is maintained in a name node's memory for fast look-up. The name node requires large memory to track metadata for large numbers of files at 150 bytes of metadata per file. Even small files are allocated with the default block size. The key is that space is calculated in blocks at an allocation time, regardless of file size, and "one block * replication factor" is always required to be available to put the file in the cluster. Some software of the distributed file systems requires support for storage and retrieval of potentially small files in the distributed file systems. Zip files and other similar archives are slow to access and require a new archive file to be generated with every update.

Disclosed is a method and system for providing a consolidated file store for distributed file systems. The consolidated file store is an ObjectStore which provides a medium for quickly storing and retrieving multiple small to medium sized files in distributed file systems such as, but not limited to, HDFS. The ObjectStore also

works with standard local file systems. By providing concurrent access support, as

well as fast storage and retrieval, the ObjectStore can be used by applications that need to store and retrieve large numbers of small files on distributed file systems.

Additionally, the ObjectStore avoids Hadoop NameNode overload and out-of-memory errors by significantly reducing the number of files stored in HDFS.

Each file stored within an ObjectStore is referenced by a unique string identifier. Each file also maintains a modification time stamp and optionally, a map of string attributes. Additionally, an optional parent identity (id) and a list of child identifiers allow for hierarchical relationships between files within the ObjectStore. The ObjectStore defines a single point of access for storing large numbers of small files on the distributed file systems or on local file systems. Each item in the ObjectStore is identified by a unique identifier string. Support for concurrent reads and updates from multiple threads or processes are provided. The support for access from mapper and reducer nodes in Hadoop is also provided. In addition to file content, each store item supports optional parent-child relationships between files within the ObjectStore, and an optional map of attributes for each store item.

In addition, timestamps are maintained on each item of the ObjectStore for tracking access and modification times. There is support for direct access to items and also for item id iteration for accessing all items within the ObjectStore. New and updated store items are cached in memory until a user-configurable size threshold is reached, at which time the cache is flu...