Browse Prior Art Database

A MECHANISM TO IMPROVE PERFORMANCE IN MAPREDUCE/HADOOP CLOUDS

IP.com Disclosure Number: IPCOM000214663D
Publication Date: 2012-Feb-01
Document File: 2 page(s) / 13K

Publishing Venue

The IP.com Prior Art Database

Related People

Debo Dutta: AUTHOR [+3]

Abstract

Hadoop/Mapreduce is a popular compute primitive for large scale data analysis. Typically, a hadoop scheduler could use rack level information to improve scheduling of job nodes (mappers and reducers) based on bandwidth requirements. For example, the bandwidth requirements may be determined by node communication patterns. If these tasks are scheduled on an Infrastructure as a Service (IaaS) cloud (e.g., a service provider cloud), then the scheduler has no topology information. A network can maintain a distributed topology server with detailed distance and traffic information for every virtual data center.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 77% of the total text.

Page 01 of 2

A MECHANISM TO IMPROVE PERFORMANCE IN MAPREDUCE/HADOOP CLOUDS

AUTHORS:

Debo Dutta

Rajendra Shinde Subrata Banerjee

CISCO SYSTEMS, INC.

ABSTRACT


Hadoop/Mapreduce is a popular compute primitive for large scale data analysis.

Typically, a hadoop scheduler could use rack level information to improve scheduling of job nodes (mappers and reducers) based on bandwidth requirements. For example, the bandwidth requirements may be determined by node communication patterns. If these tasks are scheduled on an Infrastructure as a Service (IaaS) cloud (e.g., a service provider cloud), then the scheduler has no topology information. A network can maintain a distributed topology server with detailed distance and traffic information for every virtual data center.

             DETAILED DESCRIPTION
The network maintains a distributed topology server with detailed distance and

traffic information for every virtual data center (i.e., properties between every virtual resource and virtual network elements). This could be mapped to a distributed graph database that runs on data center aggregation switches. In one example, a distributed graph database could be implemented within a control plane (e.g., in the operating system of a switch). The network is configured to export a well known application programming interface (API) that presents the topology to the hadoop scheduler. As a result, the hadoop scheduler will have complete information about the virtual link properties of nodes within a virtual d...