Browse Prior Art Database

RELIABLE, SCALABLE, AND HIGH-­PERFORMANCE DISTRIBUTED STORAGE: Distributed Metadata Management

IP.com Disclosure Number: IPCOM000234959D
Original Publication Date: 2014-Feb-19
Included in the Prior Art Database: 2014-Feb-19
Document File: 9 page(s) / 483K

Publishing Venue

Linux Defenders

Related People

Sage Weil: AUTHOR

Abstract

This system and method describes distributed metadata management architecture that provides excellent performance and scalability while seamlessly tolerating arbitrary node crashes. Ceph’s MDS diverges from conventional metadata storage techniques, and in doing so facilitates adaptive file system and workload partitioning among servers, improved metadata availability, and failure recovery. Specifically, file system metadata updates are initially written to large, lazily ­trimmed per­ MDS journals that absorb temporary and repetitive updates. File (inode) metadata is then embedded in the file system namespace and stored inside per­ directory objects for efficient read access and metadata prefetching. Notably, MDS defines the namespace hierarchy in terms of directory fragments, facilitating fine­ grained load distribution even for large or busy directories, and implements a traffic control mechanism for dispersing load generated by flash crowds—sudden concurrent access by thousands of client nodes—across multiple nodes in the MDS cluster.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 9% of the total text.

Page 01 of 9

RELIABLE, SCALABLE, AND HIGH­PERFORMANCE DISTRIBUTED STORAGE: Distributed Metadata Management

Authored by: Sage Weil

Abstract

This system and method describes distributed metadata management architecture that provides excellent performance and               scalability while seamlessly tolerating arbitrary node crashes. Ceph's MDS diverges from conventional metadata storage

                    techniques, and in doing so facilitates adaptive file system and workload partitioning among servers, improved metadata

                      availability, and failure recovery. Specifically, file system metadata updates are initially written to large, lazily­trimmed

                     
per­MDS journals that absorb temporary and repetitive updates. File (inode) metadata is then embedded in the file system

                       namespace and stored inside per­directory objects for efficient read access and metadata prefetching. Notably, MDS

                     defines the namespace hierarchy in terms of directory fragments, facilitating fine­grained load distribution even for large or

                     busy directories, and implements a traffic control mechanism for dispersing load generated by flash crowds-sudden

                  concurrent access by thousands of client nodes-across multiple nodes in the MDS cluster.

Keywords: metadata server, metadata, Ceph

Introduction:

Described is the design, implementation, and performance characteristics of a metadata server (MDS) such as one used

    

                                    
by Ceph. The focus is on the design implications of the unconventional approach to file (inode) storage and update

           

                                                    journaling on metadata storage, dynamic workload distribution, and failure recovery. A variety of static file system

                      snapshots and workload traces are analyzed to motivate design decisions and performance analysis, and performance is

                      eval​

uated under a range of micro­benchmarks and workloads under both normal­use and failure scenarios. This approach

     

                                      addresses the performance requirements of a clustered metadata server that is capable of tolerating arbitrary node

                     crashes. In contrast to previous work in this area, this architecture and implementation maintain meta­data performance

            

   

before, during, and after failure by simultaneously addressing the efficiency of metadata I/O, cluster adaptability, and

      

          failure recovery.

The method and system differs from that in conventional file systems in two key ways. First, the relatively small per­file

                      inode metadata structures in our environment (due in part to our use of an object­based stora...