Browse Prior Art Database

Parallel Simulation of Multiprocessor Caches

IP.com Disclosure Number: IPCOM000114149D
Original Publication Date: 1994-Nov-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 6 page(s) / 284K

Publishing Venue

IBM

Related People

Stone, HS: AUTHOR [+2]

Abstract

The design and performance analysis of memory hierarchies in uniprocessors is typically done by means of trace-driven techniques. The evaluation of multiprocessor memory hierarchies, in principle, can be done as well by means of trace-driven techniques. This discussion describes a trace-driven technique suitable for execution on a multiprocessor for evaluating a multiprocessor memory hierarchy.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 19% of the total text.

Parallel Simulation of Multiprocessor Caches

      The design and performance analysis of memory hierarchies in
uniprocessors is typically done by means of trace-driven techniques.
The evaluation of multiprocessor memory hierarchies, in principle,
can be done as well by means of trace-driven techniques.  This
discussion describes a trace-driven technique suitable for execution
on a multiprocessor for evaluating a multiprocessor memory hierarchy.

      (1,2,3) explored techniques for parallel trace-driven analysis
of multiprocessor systems, and provided the background for the
invention described here.  In their analysis, the implicit assumption
is made that the traces of each processor in the multiprocessor
system being simulated remain synchronized, reference by reference,
through the full trace.  In fact, such synchronization does not occur
in actual systems, since individual processors may be delayed
relative to others while waiting for items from memory when those
items are not available in cache.  The algorithm described below
takes such loss of synchronization into account, and produces more
accurate information for performance studies than does the work of
Heidelberger and Stone.  The algorithm also includes bounds on the
precision of the simulation, and provides for early termination when
the precision reaches acceptable levels.

      Our objective is to analyze the performance of a cache-based
multiprocessor when executing the M address traces.  The M traces are
partitioned into N segments for processing by N processors of a
simulation multiprocessor.  The ith simulation processor receives the
ith partition from all M traces.  Hence, the number of processors
used in the simulation analysis is independent of the number of
processors being simulated, as has been proposed by [1,2,3].

      Since each simulation processor sees all simulated processors
over a slice of time, the simulation of the cache coherence overhead
can be done within each processor and generally does not require
communication between simulation processors except at the endpoints
of the trace segments.

      The problem to be solved concerns how to account for trace
segments when they lose synchronization with respect to each other.
At any given phase of the algorithm, the relative execution times of
the first instruction of each trace segment are known with some
uncertainty.  The bounds on when the first instruction of each trace
segment is executed is stored with the trace segment.  These bounds
are recorded in a trace segment as variables named early trace time
and late trace time.

      The algorithm maintains the uncertainty bounds, and goes
through a sequence of phases of simulation.  Initially, the bounds
are unknown.  One round of simulation produces estimates of the
bounds and outputs enough information to resimulate the instructions
that require resimulation.  During a resimulation phase, each
processor resimulates o...