Browse Prior Art Database

Algorithm for Instruction Cache Arbitration Among Multiple Instruction Streams

IP.com Disclosure Number: IPCOM000118011D
Original Publication Date: 1996-Aug-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 4 page(s) / 169K

Publishing Venue

IBM

Related People

McGarity, R: AUTHOR [+3]

Abstract

In processor designs utilizing branch prediction, imperfect branch prediction accuracy leads to performance degradation. The severity of the performance degradation depends on the prediction accuracy and the penalty per misprediction. This paper presents an approach which focuses on reducing the penalty per misprediction to enhance processor performance.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 39% of the total text.

Algorithm for Instruction Cache Arbitration Among Multiple Instruction
Streams

      In processor designs utilizing branch prediction, imperfect
branch prediction accuracy leads to performance degradation.  The
severity of the performance degradation depends on the prediction
accuracy and the penalty per misprediction.  This paper presents an
approach which focuses on reducing the penalty per misprediction to
enhance processor performance.

      Reducing the misprediction penalty by using multiple
instruction streams has been widely discussed.  This technique
involves replicating machine resources in order to process both the
taken and not-taken paths of a conditional branch.  A significant
drawback to this  technique is the amount of hardware required to
effect the penalty reduction.  In particular, replicating the
instruction cache, instruction  decoders, and execution resources is
prohibitively expensive.  Less aggressive implementations of the
multiple instruction streams technique  restrict the resource
replication to just the instruction cache and perhaps the instruction
decode logic.

      An alternative approach to supporting multiple instruction
streams is to time multiplex use of existing resources rather than
replicating them for the alternate branch path(s).  For instance,
time multiplexed use of a single one-ported instruction cache can
support the fetching of multiple instruction streams into the
processor core.  While the time multiplexing approach offers a
reduced hardware cost in comparison to the hardware replication
approach, it does not provide all the performance benefits of that
approach.  In the obvious  implementation, when a conditional branch
is encountered the predicted  and alternate branch paths access the
instruction cache on every-other  cycle.  In this scheme both the
predicted and alternate paths are processed at roughly one-half of
the original machine's efficiency until  the branch is resolved.
This scheme has severe limitations, especially  if the prediction
mechanism is correct the vast majority of times.

      A variation on the instruction cache time multiplexing approach
involves using an algorithm to arbitrate among the fetch requests
from the predicted and one or more alternate paths.  To maximize
performance, the algorithm proposed selects between the predicted and
alternate path  fetch requests based on a number of factors including
pipeline interlocks  downstream from the fetcher, the number of
outstanding predicted branches  currently in the machine, the
presence of an instruction cache miss, and  the presence of an
instruction TLB miss.

      The algorithm detailed below describes when to dedicate one
cycle to fetch the alternate path of the oldest branch in the
processor instead of the predicted path of the newest branch.
Algorithm: fetch_alt_instruction_stream={  icache_miss_pending|
itlb_miss_pending| decode_stage_interlock_last_cycle|
(num_outs...