Browse Prior Art Database

Method for Improving Processor Pipeline in Systems with Cache

IP.com Disclosure Number: IPCOM000112894D
Original Publication Date: 1994-Jun-01
Included in the Prior Art Database: 2005-Mar-27

Publishing Venue

IBM

Related People

Chuang, CM: AUTHOR [+2]

Abstract

Improved processor pipeline performance can be obtained by the combination of address generation on the decode cycle [1] with a double (or larger) decoding capability [3].

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 23% of the total text.

Method for Improving Processor Pipeline in Systems with Cache

      Improved processor pipeline performance can be obtained by the
combination of address generation on the decode cycle [1] with a
double (or larger) decoding capability [3].

      In RISC-like architectures, it is possible to improve the
processor performance by optimizing the pipeline structure in many
different ways.  [1]  suggests doing address generation on the decode
cycle while [2,3]  suggest performance improvements by making use of
a double decoding scheme.  Each approach has its advantages and
disadvantages and represents a different cost/performance tradeoff.

      The double decode proposal of [2,3]  is shown in Fig. 1 for the
configuration with the best performance.  Path I includes all address
generation, in addition to rotate, multiply and divide ops.  Path II
includes all ALU ops.  In Fig. 1, the cache path has been added to
Path I after address generation; it is assumed that address
translation takes place in parallel with cache access using a late
select cache organization.  In [2,3], the potential performance was
considered but without the effects of the attached memory/cache
system.  Pipeline disruptions in a double decode system, due to the
cache/memory pipeline path will introduce some additional degrading
factors which are not present in a single decode system.  Four such
degrading factors involving memory operations are as follows:

1.  the range of DEPENDENT Loads is EXTENDED

2.  a DEPENDENT STORE sequence is introduced which adds one
    additional cycle of delay when it occurrs and complicates state
    saving for restart on a cache miss and subsequent page fault.

3.  OUT-OF-SEQUENCE operations can result on Load/Store operations
    which are paired to ALU ops, and which result in a task switch.

4.  BRANCHES, in a system without a branch preprocessor, are
    particularly troublesome, and introduce serial delays to both
    sides of the double decode paths.

      This disclosure improves these by reducing the range of
dependent loads, eliminating possible Out-of-Sequence operations, and
reduces the branch delay.

      A dependent load is a load of a register followed closely by an
instruction which uses that load.  In a single decode system, the
subsequent instruction which uses a load should be delayed in the
instruction sequence as many cycles as the cache pipeline takes to
retrieve the load.  Otherwise processor cycles will be lost, the
exact number depending on where the instruction using the load
appears.  For example, assuming a 2-cycle cache as shown
schematically in Fig.  1 with timing as shown in Fig. 2(a), the
cycles lost for dependent loads in a Single -Decode system are as
shown in Fig. 3.  When the instruction using the load immediately
follows the Load instruction as in Fig. 3(a), 2 potential decode
cycles are lost until the cache access completes.  If the instruction
using the...