Browse Prior Art Database

# Eliminating the Overhead of Floating Point Load and Store Instructions by Decoding Two Instructions Per Cycle in the Floating Point Unit

IP.com Disclosure Number: IPCOM000049484D
Original Publication Date: 1982-Jun-01
Included in the Prior Art Database: 2005-Feb-09
Document File: 3 page(s) / 38K

IBM

## Related People

Agerwala, TKM: AUTHOR [+2]

## Abstract

A floating point arithmetic unit is described in which two instructions per cycle are decoded. The store and load operations are overlapped in floating point loops.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 54% of the total text.

Page 1 of 3

Eliminating the Overhead of Floating Point Load and Store Instructions by Decoding Two Instructions Per Cycle in the Floating Point Unit

A floating point arithmetic unit is described in which two instructions per cycle are decoded. The store and load operations are overlapped in floating point loops.

In scientific engineering applications, a significant amount of execution time is often spent in short floating point loops. For example, in one Nastran subprogram executing on an IBM 3033, approximately 80 percent of the time is spent in computations characterized by the following loops:
S: LD 0, Ai MDR 0, 2

STD 0, Ci

BXLE (i approaches i+k; if i (less than or equal to)

j go to s)

Assume that a machine has a sequential floating point unit with a three cycle (alpha) multiply and a 2 (alpha) add. In general, a computation in the machine would proceed as absolute value of D to absolute value of P, where D, P, and the bar denote decode, putaway and execute, respectively. Assume further that D and P are overlapped with the preceding and succeeding instructions. instructions. The timing on the above loop is shown in Fig. 1. The machine would stream of instructions and operands is available.

The timing for the proposed design is shown in Fig. 2. The decode of the first load occurs in cycle 1 causing a transfer into floating register 0; the decode of the MDR (Multiply Double Register) in the same cycle causes this data to be also staged to the multiplier. The decode of the AD (Add Double Word) in cycle 4 eliminates the putaway in cycle 5. The STD (Store Double Word) decode in cycle 4 sets up a transfer into the store data buffer (SDB) in cycle 7. The central point here is that the overlap of the overhead instructions cannot be achieved in a unit that decodes one instruction at a time. For example, if the AD and STD were not decoded together in cycle 4, a cycle will eventually be lost. By overlapping the LD (Load Double Word) and STD the loop time is reduced from 7 (alpha) /iteration to 5. If the original loop accounted for 80 percent of the execution time, a 30 percent improvement in MIPS (million instructions per second) is obtained.

The controls and data path have been carefully designed so that the overlap of loads and stores can be achieved even in very tight loops. Fig. 3 shows the case where the overall computation is to add 2 vectors and place the result in storage. A conventional design would take 4 (alpha/iteration); the proposed approach would execute at 2 (alpha per iteration).

The IBM System/360 Model 91 floating point unit (FLU) is an existing state of the art unit which obtains high speed by overlapping independent iterations of a loop. With equivalent arithmetic (3 approximately mpy, 2 approximately add) the Model 91 FLU would execute the first loop at 4 approximately /iteration and the

1

Page 2 of 3

second at 3. The proposed design is simpler than the Model 91. Moreover, since the exec...