Browse Prior Art Database

CPU Core Performance Diagnostics for a HPC Cluster Environment Disclosure Number: IPCOM000238431D
Publication Date: 2014-Aug-26
Document File: 4 page(s) / 54K

Publishing Venue

The Prior Art Database


The proposed solution presents a Core level CPI (Cycles per Instruction) analysis model to easily identify and isolate the processor or the core which is performing poorly in the large cluster environment. First the base CPI stall components are calculated globally for each of the cores running the application. This base CPI stall component will give a detailed picture of the high level component of the microprocessor core creating the bottleneck. The CPI Component is then associated with the FLOPS per core to identify the work done by each of the core in a specific period of time. Once the high level issue is identified, it then nails down to individual component and identify the issue related to the degradation in performance of the individual core. One of the key point about the proposed solution is that it doesn’t need any hooks to insert and do the instrumentation. It is being done seamlessly while the deployed code is running in the production environment.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 01 of 4

CPU Core Performance Diagnostics for a HPC Cluster Environment

In High-performance computing, large systems with a number of processor cores and chips are common. Deployed in Message-passing Interface (MPI) and other highly parallelized computing scenerios, the consistency of performance of all the computing and allied resources in a system is essential for efficient functioning. The multitude of processing units, memory cards, system interconnect devices, etc can create problems in verify the functioning as well as the

performance of each unit in particular. Sub-standard performance of one processing unit(core) can slowdown the entire application. Identification of such problem cores and other poorly functioning units, such as DIMMs, network interfaces, etc is difficult in the production environment. Here we propose a framework to calibrate and verify the performance of various units making up a system.

There are several instances where such faults occurred at production systems and team spent several weeks of effort to diagnose the root cause.

The following diagram gives a detailed picture of various components of a simple cluster environment. It contains numerous multi-chip modules, memory DIMMS and high speed interconnects.


Page 02 of 4

The above diagram gives a description of the proposed framework. The framework consists of CPI calibrator components, etc

The CPI calibrator component is described in detail below. For instance let us take the Power 7

processor. The CPI component of the Power 7 system constitute of the following three key events.

1. Effective Cycles completed PM_GRP_CMPL (group completed)

2. Completion Stalls PM_CMPLU_STALL (No groups completed : GCT not empty)
3. GCT empty cycles PM_GCT_NOSLOT_CYC (No itags assigned )

The diagnosing framework is a non-interruptive system which runs and collects the above mentioned events during the context switch of the process and sums up into a global table maintained by the diagnoiser. The two primary data structure maintained for each core is to count the user level counts and the kernel level counts. This provides the flexibility to differentiate the bottlenecks arises out of the user level activity or the system level activity.

Now along with the above mentioned basic events, the three primary events which is counted by default are

1. Run cycles PM_RUN_CYC (Run_cycles)

2. Instructions completed PM_RUN_INST_CMPL (Run_Instructions)

3. Instruction Dispatched

With these events the Basic CPIc (Completed instructions CPI) and CPId (Dispatched instructions CPI) are calculated along with the FLOPS per core.

When the CPI of the Cores varies for a certain period of time, it is a certain indication of a


Page 03 of 4

performance loss. If the CPI of a particular core posses a different behavior while the common code is running in all the cores, then the possibility of the performance loss could be arising out of some of the core based stalls. Here the FLOPS rate will be going dow...