InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

A compiler optimization approach for cache-efficient global scalar data arrangement

IP.com Disclosure Number: IPCOM000246944D
Publication Date: 2016-Jul-18
Document File: 9 page(s) / 249K

Publishing Venue

The IP.com Prior Art Database


We address an approach to exploit D-cache efficiency by optimizing data arrangement. Our approach is an inter-procedural approach targets global data optimization. First, we map data to an affinity-graph and to evaluate affinity degree with static and run-time information. Then, a HEM(Heavy-Edge Matching)-based approach is leveraged to partition affinity graph. It finally groups data in “virtual cache lines” that adapts to cache structure.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 23% of the total text.

Page 01 of 9

A compiler optimization approach for cache-efficient global scalar data arrangement

1. Overview

The performance gap between processor and memory is continuously widening following Moore's Law. Cache has become very important to shorten this gap. Cache performs well for programs that exhibit sufficient locality in spatial and temporal. One measure of the benefits of cache is miss rate. Figure 1 shows the MPKI (Cache Misses Per Thousand Instructions) report of SPEC (Standard Performance Evaluation Corporation) benchmarks on Intel Pentium 4, Linux kernel 2.6.23 with GCC(GNU Compiler Collection) 4.2. [1]

(a) L1 MPKI measurements for SPEC CPU 2006 (b) L2 MPKI measurements for SPEC CPU 2006

(a) L1 MPKI measurements for SPEC CPU 2000 (b) L2 MPKI measurements for SPEC CPU 2000
Figure 1. MPKI measurement for SPEC benchmarks.

Modern general-purpose processors adopt set associative cache. A cache set is a group of cache blocks (cache lines). Memory address is first mapped onto


Page 02 of 9

a set, and then the block can be placed anywhere within that set (n-way). Table 1 shows Power 8 on-chip D-Cache hierarchies [2].

                  Table 1. Power 8 on-chip D-Cache hierarchies hierarchy Size (per core) set

128 B/line, 8-WAY

Reasons that cause misses in a set associative D-Cache can be categorized as below. [3]


The very first access to a block cannot be in the cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you had an infinite sized cache.

If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved.

If the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if multiple blocks map to its set and accesses to the different blocks are intermingled.

On the one hand, hardware technique is usually implemented in processor's micro-architecture, however, limited in effect; on the other hand programs can be optimized by compiler, in the perspective of programming from software level, to make the program more adaptive for cache which our work mainly focus on.

Generally, local auto variables locate adjacently on stack and usually not very hard to trace in a single procedure. But for inter-procedural memory access, i.e. global scalar variables, it is hard to trace and analyze. Besides, scalar data don't show definite locality compare with array data. It is more difficult to analyze the locality by a compiler. Global variable is generally used in program, but there is rare study on this topic.

Optimization for data arrangement can ameliorate cache efficiency. Data arrangement is always hard to optimized, because there are two major challenges:

1. Global scalar variables live across different procedures;

2. Optimal arrangement of data in cache line is NP-complete [4].

Our goal is to explore spatial localit...