Browse Prior Art Database

Practical Two-Cycle Forwarding Mechanism for Floating Point Units

IP.com Disclosure Number: IPCOM000116162D
Original Publication Date: 1995-Aug-01
Included in the Prior Art Database: 2005-Mar-30

Publishing Venue

IBM

Related People

Elliott, TA: AUTHOR [+2]

Abstract

In the early days of the POWER architecture, fast data dependent forwarding in the floating point unit was understood to be an extremely critical performance issue. In designing the three stage RISC System/6000* ("RS/6") floating point, a mechanism was developed to effectively bypass one stage on dependencies to both the B and C operands.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 14% of the total text.

Practical Two-Cycle Forwarding Mechanism for Floating Point Units

      In the early days of the POWER architecture, fast data
dependent forwarding in the floating point unit was understood to be
an extremely critical performance issue.  In designing the three
stage RISC System/6000* ("RS/6") floating point, a mechanism was
developed to effectively bypass one stage on dependencies to both the
B and C operands.

      All POWER and PowerPC floating points are build around the
-fused multiply add' instruction, or FMA, of the following form:
   Rt = (Ra * Rc) + Rb

      In follow-on microprocessors (601, 603, 604, 620 and RISC
Single Chip (RSC)), the RS/6 databypass mechanism has not been
pursued The key factor for all of these processors was cycle time.
The rigid data flow requirements of the RS/6 bypass mechanism put a
limit on the peak frequency of the design.  The trade-off for the
follow-on FPUs was an increased Cycles Per Instruction (CPI), for a
reduced cycle time.

      The goal was to develop the two cycle bypass mechanism which
would not suffer the same frequency restrictions as the RS/6 bypass
mechanism.

BACKGROUND (Part 1): The importance of data dependent bypassing.

      The following timing chart represents a normal 3 cycle FPU
without fast data bypassing, like the 604.
  EXAMPLE 1:
    FMA0 R0 = (R1 * R2) + R3
    FMA1 R4 = (R5 * R6) + R0
    cycle        0     1     2     3     4     5     6     7
    ------------------------------------------------------------
    dispatch    FMA0  FMA1
    EXEC1             FMA0  FMA1  FMA1  FMA1
    EXEC2                   FMA0              FMA1
    Writeback                     FMA0              FMA1

      On cycle 3, the data from FMA0 is available in the writeback
stage.  Without being able to bypass any stages, the writeback data
must go through the latch boundary at the start of EXEC 1.  This
means
the EXEC 1 stage does not have the B data for FMA1 until cycle 4.
The
above timing chart shows that, a two-cycle gap has been created by
the
data dependency.

      Now, it is possible for compilers to schedule independent FPU
instructions between FMA0 and FMA1.  However, as the number of
required independent FPU instructions grows, the likelihood that the
compiler can find and reschedule such instructions decreases.  In the
above example, to fully utilize the floating point, two FPU
instructions independent of both FMA0 and FMA1 would need to be
placed between the original instructions to eliminate the gap.

In the RS/6 design, the same two instructions would have the
following timing:
  cycle        0     1     2     3     4     5     6     7
  ------------------------------------------------------------
  dispatch    FMA0  FMA1
  EXEC1             FMA0  FM...