Browse Prior Art Database

Variable Radix Division for Low Silicon Floating Point Units

IP.com Disclosure Number: IPCOM000114927D
Original Publication Date: 1995-Feb-01
Included in the Prior Art Database: 2005-Mar-30
Document File: 4 page(s) / 142K

Publishing Venue

IBM

Related People

Elliott, TA: AUTHOR [+4]

Abstract

The primary goal for low end floating point units is to maximize the performance in a minimum amount of usable space. Although the floating point divide instruction is not considered a 'big hitter' as far as performance is concerned, improving the divide performance is beneficial for two reasons: 1. There are some floating point benchmarks which do use the divide instruction. Speeding up the divide would benefit these benchmarks. 2. Most importantly, having a fast divide instruction gives users the perception of a high end floating point.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Variable Radix Division for Low Silicon Floating Point Units

      The primary goal for low end floating point units is to
maximize the performance in a minimum amount of usable space.
Although the floating point divide instruction is not considered a
'big hitter' as far as performance is concerned, improving the divide
performance is beneficial for two reasons:
  1.  There are some floating point benchmarks which do use the
divide
       instruction.  Speeding up the divide would benefit these
       benchmarks.
  2.  Most importantly, having a fast divide instruction gives users
      the perception of a high end floating point.

      Fig. 1 shows a block diagram for a 2 pass, fused multiply add
floating point unit.  The primary function performed by this unit is
'(A * C) + B', floating point multiply add instruction or FMA.
Although nearly all instructions can be performed as a subset of a
single FMA, division is the exception.

      Earlier low end floating point units like the RSC, PowerPC 601
microprocessor and the PowerPC 603 microprocessor used a 2 bit
non-restoring division algorithm.  Non-restoring division, (A/B),
iterates on the following equation:
         R(n+1) = R(n) + g*B
  where the initial R(n=0) is A and the guess 'g' can be either
positive or negative.  The alignment shifter is capable of shifting
and inverting the B operand anywhere relative to the CSA tree to
produce a +-g or +-2g.  Remember that a shift of 1 bit position left
is the same as a multiplication by 2.  Therefore, if we were able to
restrict the 'g' to 0, +-1, or +-2, the alignment shifter would be
capable of providing the 'g*B' in the above equation.  By examining
Fig. 1, if the R(n) were placed on the feedback path into the CSA
tree, it would be possible to produce a new remainder every clock.
{i.e., R(n) is on the feedback path and g*B is output from the
alignment
shifter, the output of the CSA tree would be R(n+1)}.  By iterating
with
these values subsequent intermediate guesses could be produced which
would converge to the quotient with the desired precision.

      A large PLA is required to create a convergent lookup table for
a radix 4 division without using a +-3 guess.  This table is as large
or slightly larger than a radix 8 lookup with guesses up to +-7.  For
a small lookup table what's needed is a simple way of performing the
non-restoring division equation AND not have any restrictions on the
guess 'g'.

      By examining how the earlier low end floating points
accomplished the 2 bit non-restoring division, a solution to the +-3
problem can be derived.  The following discussion will reference
components from Fig. 1.  Assume an initial starting point with the
divide operands sitting in their latches above the multiply stage.
Before the divide iteration can start, the A operand must get onto
the feedback path (i.e., the initial R(n) = 'A').  To do this, C is
set to '1' and the ali...