Browse Prior Art Database

Method for removing accumulator dependencies

IP.com Disclosure Number: IPCOM000008848D
Publication Date: 2002-Jul-17
Document File: 5 page(s) / 72K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for removing accumulator dependencies. Benefits include improved functionality.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Method for removing accumulator dependencies

Disclosed is a method for removing accumulator dependencies. Benefits include improved functionality.

Background

              A conventional operation in vector and matrix math involves accumulation. An example is code that looks like: Ai += Bi * Ci.

              A machine with multiplier accumulator (MAC) latency of 3 cycles cannot execute the example code in less than 3 cycles per iteration due to the data dependency of each iteration on the previous one. A single register being used to accumulate the sum of the whole series causes the dependency.

              Conventionally, most instructions are broken into RISC-like UOPs that alleviate the dependency somewhat. Future processors are expected to be both wider and more power efficient. They are expected to pack as much functionality into a UOP as possible, to minimize the energy expenditure, which is proportional to the number of UOPs in the core.

Description

              The disclosed method is a hardware device that eliminates the data dependency caused by MAC latency. Two mechanisms address the dependency. Mechanism 1 splits code into independent sub-loops. Mechanism 2 splits instructions that perform accumulations into two separate operations.

Mechanism 1

              To enable the utilization of a wide-issue machine, the disclosed hardware device splits the code into independent sub-loops (see Figure 1). For example, assume that a requirement is to produce two summation members per cycle. The following code can be split into even and odd parts and then summed together:

loop

A2ni += B2ni * C2ni ;

A(2n+1)i += B(2n+1)i * C(2n+1)i // broke the dependency on Ai

Loop_end

A2ni = A2ni + A(2n+1)I

              For large values of I (the loop counter), a very wide processor executes the loop at Log2N + 1 cycles.

              The mechanism must detect a wide array of code styles that employ accumulation. A key characteristic to latch on is an addition operation in which one of the sources is also the destination. Due to 32-bit architecture, practically every add instruction looks like an accumulation because one of the sources is also the destination. A sequence of instructions must be analyzed to detect an accumulation and activate the disclosed mechanism.

              For example, an instruction window contains a sequence. Assume the registers are architectural, branches are not shown, and prediction is perfect.

loop

R1 += *(R2) // R1 is the accumulator a vector stored in memory and pointed to by R2

R2 += 4 // pointer is incremented

Loop_end

              This loop is split into two parallel dependency chains. Registers TN are physical registers. The value T1 contains the value in R1 before the loop, and T2 contains the values in R2.

T4 = 0      ß new uop, assumes software cleared R1 before the loop

T5 = T2 +...