Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Technique for Parallel Trace-Driven Cache Simulation

IP.com Disclosure Number: IPCOM000100993D
Original Publication Date: 1990-Jun-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 2 page(s) / 111K

Publishing Venue

IBM

Related People

Heidelberger, P: AUTHOR [+2]

Abstract

This article describes a technique for achieving almost N-fold speed-up by using N processors to evaluate caches from a trace of instructions.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Technique for Parallel Trace-Driven Cache Simulation

       This article describes a technique for achieving almost
N-fold speed-up by using N processors to evaluate caches from a trace
of instructions.

      Cache evaluation studies have historically been done by
simulating the performance of a variety of caches on typical
workloads.  The workloads are supplied in the form of instruction
traces.  Many techniques have been used to reduce the effort in
performing such analyses (*) but the techniques have been developed
solely for implementation on serial machines. The technique described
here demonstrates that parallel trace-driven analysis can be done
with near-perfect efficiency.

      As an aside, we mention that along one dimension,
parallelization is simple, but not statistically efficient. Practical
implementations of caches avoid a fully associative search through
the cache by dividing the address space into M independent address
spaces.  The caches are structured as M independent sets of entries.
An address search is carried out only in the set that could possibly
contain the target address.  This reduces the extent of the search by
a factor of M.

      Set associativity also produces the possibility for parallel
simulation as follows.  The instruction trace should be partitioned
into M independent sequential traces, one for each set, and each such
trace can be treated by an independent processor. This allows the
number of processors N to be as large as M.

      Although the speed-up is available, the actual value of this
form of parallelism is reduced because the simulation of M sets does
not reduce the variance in the estimated ratio by a factor of M.  The
behavior of misses in time is highly correlated across sets, so that
the extra work involved in the parallel simulation of distinct sets
does not contribute commensurately to the accuracy of the answers.

      Parallel Simulation For parallel simulation of a serial trace
we assume that the input trace to the parallel analysis contains just
the collection of sets that are to be simulated. If the cache to be
evaluated is fully associative, the full trace is assumed to be the
input trace.  We also assume that the cache replacement policy is
least recently used (LRU) within a set for set-associative caches or
LRU over the whole cache for fully associative caches.

      The input trace is partitioned into N sequential subtraces, the
first consisting of the first 1/Nth of the original trace, the next
consisting of the next 1/Nth of the original trace, etc.  If the
length is not a multiple of N, the traces should be as equal as
possible in length.

      Each trace is analyzed on a separate processor.  Each processor
initializes its cache contents to an illegal tag value denoted here
as infinity.  Each processo...