Browse Prior Art Database

Technique for Parallel Trace-Driven Multiprocessor Cache Simulation With Write Broadcast

IP.com Disclosure Number: IPCOM000103280D
Original Publication Date: 1990-Aug-01
Included in the Prior Art Database: 2005-Mar-17
Document File: 7 page(s) / 308K

Publishing Venue

IBM

Related People

Heidelberger, P: AUTHOR [+2]

Abstract

Cache evaluation studies have historically been done by simulating the performance of a variety of caches on typical workloads. The workloads are supplied in the form of instruction traces. Many techniques have been used to reduce the effort in performing such analyses (1), but the techniques have been developed mainly for implementation on serial machines.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 19% of the total text.

Technique for Parallel Trace-Driven Multiprocessor Cache Simulation With Write Broadcast

       Cache evaluation studies have historically been done by
simulating the performance of a variety of caches on typical
workloads.  The workloads are supplied in the form of instruction
traces.  Many techniques have been used to reduce the effort in
performing such analyses (1), but the techniques have been developed
mainly for implementation on serial machines.

      Simulation is very costly to perform and is a candidate for
speedup on a parallel computer.  However, the fact that the cache
performance depends on the present state of the cache, and the state
has to be computed sequentially, suggests that the serialization in
the simulation precludes effective parallelism in the implementation.
A parallel simulation technique for evaluating the performance of
multiprocessor caches was described in (2).  The technique described
in (2) uses one processor per simulated cache, and requires
preprocessing of the input traces to produce a merged trace
consisting of local access plus global accesses (from other
processors).  Depending upon the cache architecture, processors may
have to synchronize with each other on global accesses.

      Here, we describe a technique for parallelizing the simulation
of a multiprocessor cache memory system from instruction traces.  The
technique is efficient in the simulation, and achieves speedup that
is independent of the number of processors in the system simulated.
In this approach, processors are assigned nonoverlapping time
intervals and each processor simulates the activity of all caches
within its time interval.  However, since the states of the caches at
the beginning of the time intervals are unknown, a resimulation phase
is required to produce correct results.  Thus, very little processor
synchronization is required.  For long traces the resimulation phase
will (probably) be much faster than the first pass simulation,
resulting in good speedup.  For a given number of processors, the
speedup should increase with the length of the input traces.  In
contrast, the speedup obtained in (2) is limited to (at most) the
number of caches, no matter how long the traces are.

      A Simple Approach: Practical implementations of caches avoid a
fully associative search through the cache by dividing caches into M
independent sets of entries.  An address search is carried out only
in the set that could possibly contain the target address.  This
reduces the extent of the search by a factor of M.

      Set associativity produces the possibility for parallel
simulation as follows.  Each cache's instruction trace should be
partitioned into M independent sequential traces, one for each set,
and the combined traces from each set can be treated by an
independent processor.  This allows the number of processors N to be
as large as M.

      Although the speedup is available, the actual value of...