Browse Prior Art Database

A Method and Implementation for Workload Synthesis using Application Projectors Disclosure Number: IPCOM000184041D
Original Publication Date: 2009-Jun-09
Included in the Prior Art Database: 2009-Jun-09
Document File: 3 page(s) / 33K

Publishing Venue



Disclosed is a method for synthesizing a representative application program from the workload characteristics of one or more surrogates that have been found to minimize performance error versus the application in real systems. The workload characteristics of the surrogates are carefully superimposed to create a synthetic workload that represents the mean workload characteristics found when running the surrogates sequentially on the target processor, and therefore represents the performance-critical behavior of the original application. The characteristics of the surrogates are superimposed by weighting each characteristic based on the calculated performance of that characteristic for each surrogate. The resulting synthetic represents the application and yet executes in a much shorter runtime, which permits its use in performance simulations, hardware model simulations, and hardware model acceleration environments.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 35% of the total text.

Page 1 of 3

A Method and Implementation for Workload Synthesis using Application Projectors

Workload synthesis is motivated by the need for an executing a program binary that is short enough to complete on a cycle-accurate processor model in a reasonable amount of time, which is often not possible if a full binary is run. Many programs used for benchmarking and performance projections in the industry today (SPEC2006, TPC-C, etc.) have pathlengths of hundreds of billions to trillions of instructions or more. Given that a cycle-accurate processor model can run at less than 1K instructions per second, a 1 trillion instruction workload would take over 30 years to complete. Therefore it is necessary to use synthetic workloads for cycle-accurate simulation. Executing real workloads on processor models is necessary to establish the performance of the processor before the expensive process of chip manufacture is undertaken.

In the past, hand-coded synthetic programs, such as DAXPY, STREAM, and LMbench, have been used to measure cycle-accurate processor model performance characteristics before manufacture. These codes are small microbenchmarks that do not represent the performance of real programs. Typically, the performance of applications like SPEC2006 and TPC-C cannot be determined until hardware is available in the lab.

The solution is to create small representative benchmarks automatically from the performance-critical characteristics of real programs, a process known as "benchmark synthesis." The characteristics include basic block instructions sequences and cache miss rates per basic block. This has been done in prior art for individual programs, such as gcc from the SPEC suite. In order to project the performance of gcc on a processor simulator, the program is executed on a machine, its performance characteristics are collected, the synthetic c-code is generated from the basic block execution characteristics, and the compiled synthetic binary can then be executed on the slow cycle-accurate simulator. However, more complicated applications like AMBER, BLAST, CHARMM, and WRF are difficult to execute on systems, even to collect workload characteristics, and these applications are important to the performance projections of HPC applications on IBM* processors prior to manufacture.

Prior art has used "performance projectors" to assess application performance when it is difficult to characterize the applications themselves. The projector process uses machine execution characteristics to find surrogate programs that accurately project the performance of the application on future systems. When the program execution characteristics of the surrogates are superimposed, the resulting workload accurately represents the application performance. However, prior art does not teach how to synthesize representative codes from the separate workload characteristics of the surrogates, which prevents the synthesis of workload...