Browse Prior Art Database

Technique for Parallel Trace-Driven Multiprocessor Cache Simulation With Write Invalidate

IP.com Disclosure Number: IPCOM000101382D
Original Publication Date: 1990-Aug-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 5 page(s) / 201K

Publishing Venue

IBM

Related People

Heidelberger, P: AUTHOR [+2]

Abstract

We are interested in parallel, trace-driven simulation techniques for evaluating the performance of memory hierarchies. The technique described here can be used to evaluate parallel multiprocessors with caches that are maintained in a consistent state by invalidating lines held in remote caches. A different parallel simulation technique for evaluating the performance of multiprocessor caches was described in (*).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 30% of the total text.

Technique for Parallel Trace-Driven Multiprocessor Cache Simulation With Write Invalidate

       We are interested in parallel, trace-driven simulation
techniques for evaluating the performance of memory hierarchies.  The
technique described here can be used to evaluate parallel
multiprocessors with caches that are maintained in a consistent state
by invalidating lines held in remote caches.  A different parallel
simulation technique for evaluating the performance of multiprocessor
caches was described in (*).

      In our approach, processors are assigned nonoverlapping time
intervals and each processor simulates the activity of all caches
within its time interval.  However, since the states of the caches at
the beginning of the time intervals are unknown, a resimulation phase
is required to produce correct results.  Very little processor
synchronization is required, and for long traces the resimulation
phase will (probably) be much faster than the first-pass simulation,
resulting in good speedup.

      The Cache System We assume that the multiprocessor system to be
simulated consists of N processors, each with a cache attached to a
global bus, and each sharing a connection to a global memory.  We
assume a bus-based algorithm with write invalidates to maintain cache
coherency.  The algorithm works as follows:
1.   Each line can be held either exclusive (E), nonexclusive (NE),
or Invalid (I).  Any valid line held by more than one cache must be
NE.
2.   A line is brought in E from the global memory.  This requires a
bus transaction.
3.   If a cache misses to a line, but that line is held in another
cache, the line is loaded remotely from the other cache.  If the
access was a read, the line is held NE in all caches in which it
resides.  If the access was a write, the line is held E locally and
it is invalidated in all other caches in which it resides. This
requires a bus transaction.
4.   If a processor writes to a line it holds NE, the status is
changed to E locally, and the line is invalidated in all other caches
in which it resides.  This requires a bus transaction.
5.   Accesses to lines held E or reads from lines held NE are
satisfied locally and do not require bus transactions.
6.   Within a set, LRU replacement is used.

      The goal of the simulation is to obtain the following counts:
1.   MISS - the number of misses to global memory.
2.   HIT  - the number of accesses (read or write) to lines held
locally (whether E or NE).
3.   RHIT  - the number of hits (read or write) to lines held in a
remote cache, but not the local cache.
4.   BUS  - the number of bus transactions.

      We assume that there is one input trace per cache to be
simulated.  Each input trace contains the accesses (in sequential
order) from that processor along with a time stamp which is used to
serialize bus transactions.

      The Parallel Simulation Technique We assume that there are N
processors pe...