Browse Prior Art Database

Method and Apparatus for Runtime Elastic Scheduling of Heterogeneous HPC Workloads Disclosure Number: IPCOM000209088D
Publication Date: 2011-Jul-27
Document File: 5 page(s) / 49K

Publishing Venue

The Prior Art Database


Disclosed is a method to address the need for scheduling heterogeneous workload (batch and dedicated jobs) with runtime elasticity in a High Performance Computing (HPC) environment. The method builds on an existing Dynamic Programming based Lookahead Optimizing Scheduler (LOS), to design Delayed-LOS and Hybrid-LOS, two novel scheduling algorithms. The approach further proposes elastic versions of these algorithms that incorporate runtime elasticity as well.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 32% of the total text.

Page 01 of 5

Method and Apparatus for Runtime Elastic Scheduling of Heterogeneous HPC Workloads

The cloud computing model is emerging as the de facto mechanism for offering computing services. Not surprisingly, this new model is being embraced to improve the consumability of High Performance Computing (HPC) services. This patent studies the impact of demand elasticity-a key ingredient in the cloud service model-on resource scheduling. In particular, the investigation of the limitations of today's HPC schedulers in handling demand elasticity and advocating the need for new techniques that are better suited for this emerging workload model.

From a historical perspective, cloud computing is not an entirely new concept in the HPC domain. Grid Computing, for instance, has attracted significant research interest over the last decade or so, much of which focused on fundamental problems in federated resource management [1]. At a high level, HPC systems have generally used a queuing model to schedule incoming jobs [2, 3, 4, 5, 6, 7, 8]. Most optimizations revolve around how an HPC system is packed and how the queue is managed to maximize system utilization while minimizing job wait times. Much of the complexity then arises when balancing a job's relative importance, its resource needs, and expected runtime against available system capacity and scheduling of future jobs, each with varying importance and resource needs. To some extent, elasticity in the cloud model also operates across similar dimensions: time and (resource) space. Basically, all users are expected to get what they want, when they want it, and pay for what they use.

There are two types of elasticity: submit-time elasticity and runtime elasticity. The former allows varying resource requirements to be specified at submission time. In contrast, runtime elasticity gives its users the ability to change their resource requirements on-the-fly. Today's cloud resource demand model allows for both types of elasticity; whereas, general HPC schedulers implement submit-time elasticity only. The challenge, then, is to find a method for HPC schedulers to best manage the underlying resources under the complete demand model, similar to what is being offered by mainstream clouds. Part of the difficulty is due to the aggressive system utilization levels that HPC systems target. It is not uncommon for an HPC system to exceed 80% utilization. In contrast, mainstream data centers often run at 15% utilization. Especially with the use of virtualization, cloud data centers have significant spare capacity to provide runtime elasticity. Even in the absence of both abundant spare capacity and virtualization, HPC schedulers can provide a certain degree of runtime elasticity. This can be done by decomposing the problem into the following subcomponents:
• Heterogeneous Workloads:Unpredictable wait times have been long recognized as a key issue in batch schedulers. For certain workloads, this unpredictability can b...