Browse Prior Art Database

System and method for Hadoop Application Monitoring

IP.com Disclosure Number: IPCOM000230912D
Publication Date: 2013-Sep-18
Document File: 3 page(s) / 60K

Publishing Venue

The IP.com Prior Art Database

Abstract

Performance of a Hadoop application depends on the data being processed, cluster resources, and configuration parameters. User has to specify suitable configuration parameters for optimized run of MapReduce jobs on a given Hadoop cluster. There is no application level performance analyzer and tuner available for Hadoop. Existing tools like Hadoop Metrics, Vaidya, Google performance kit, Cloudera Hadoop manager, Ganglia, etc. give only cluster/platform statistics which are not sufficient for analyzing application level utilization of resources. Aim of the article is to give run time view of Hadoop applications so that applications can be categorized as data intensive or computation intensive, and, hence, optimized for future runs. Existing solutions put the responsibility of getting application level metrics on user which may not be easy to calculate. Further, existing solutions only consider only some of the run time parameters. Users run applications and we proposed to monitor applications so that users understand the effect of parameter setting in the application domain and optimize the application runs.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 43% of the total text.

Page 01 of 3

System and method for Hadoop Application Monitoring

MapReduce provides a parallel programming model for large scale data analysis. It has emerged as a viable data analysis solution for various application domains like web indexing, social network analysis, business intelligence, etc. Hadoop provides the most notable implementation of MapReduce on commodity clusters and has proven to be a competitive alternative to big data analysis on traditional parallel database systems. However, users of the Hadoop framework often execute complex analytical queries that cannot be expressed as a single MapReduce job. Higher-level languages or tools such as Hive, Jaql, Pig, etc., facilitate execution of such complex queries by converting them into a workflow of MapReduce jobs. We use the term applications to denote such complex queries that represent a workflow of jobs to be executed over the Hadoop framework. Monitoring of such applications on Hadoop is a time-consuming and cumbersome activity as it necessitates monitoring individual jobs and aggregating performance statistics for the overall workflow manually. We present, to our knowledge, the ~rst application monitoring framework for Hadoop. The monitoring framework gathers

job-level statistics for an application and aggregates them to return the client a set of application-level statistics. The modular framework is capable of monitoring jobs on Hadoop next generation as well as earlier Hadoop releases. We create pro~les for an application-run which can be used to perform a cost-based optimization for future application runs.

Introduction: Hadoop has emerged as a major platform for performing large scale data analysis. Users need to define appropriate map and reduce functions for their application in order to submit a job to the Hadoop cluster. The performance statistics are vital for future job executions as they can be used to tune the con~guration settings that a~ect performance of a

job on the available cluster resources. We present, Hadoop Applications Monitoring (HAM), a modular framework that provides application-level performance
statistics to the client. HAM collects and aggregates individual MapReduce job-level performance statistics for the workflow and computes counters for overall application performance on the available cluster resources. Specifically, HAM provides APIs to retrieve application counters similar to the JobCounters API of Hadoop [1]. HAM APIs are im- plemented for Hadoop NextGen [2] as well as older Hadoop versions. These APIs can be used for monitoring current as well as historical runtime statistics. Figure 1 shows the
HAM architecture. We describe Hadoop applications and HAM architecture in next two sections.

1


Page 02 of 3

Hadoop applications: Hadoop is the most popular open source implementation of MapReduce framework. Recently, Hadoop has undergone a major architectural re-factoring to address various issues like memory consumption, scalability to support clust...