Method for removing accumulator dependencies
Publication Date: 2002-Jul-17
Publishing Venue
The IP.com Prior Art Database
Abstract
Disclosed is a method for removing accumulator dependencies. Benefits include improved functionality.
The following operators can be used to better focus your queries.
( ) , AND, OR, NOT, W/#
? single char wildcard, not at start
* multi char wildcard, not at start
"..." literal
Examples:
(Cat? OR feline) AND NOT dog?
Cat? W/5 behavior
(Cat? OR feline) AND traits
Cat AND charact*
This guide provides a more detailed description of the syntax that is supported along with examples.
This search box also supports the look-up of an IP.com Digital Signature (also referred to as Fingerprint); enter the 72-, 48-, or 32-character code to retrieve details of the associated file or submission.
For a concept search, you can enter phrases, sentences, or full paragraphs in English. For example, copy and paste the abstract of a patent application or paragraphs from an article.
Concept search eliminates the need for complex Boolean syntax to inform retrieval. Our Semantic Gist engine uses advanced cognitive semantic analysis to extract the meaning of data. This reduces the chances of missing valuable information, that may result from traditional keyword searching.
The IP.com Prior Art Database
Disclosed is a method for removing accumulator dependencies. Benefits include improved functionality.
United States
English (United States)
Method for removing accumulator dependencies
Disclosed is a method for removing accumulator dependencies. Benefits include improved functionality.
Background
A conventional operation in vector and matrix math involves accumulation. An example is code that looks like: Ai += Bi * Ci.
A machine with multiplier accumulator (MAC) latency of 3 cycles cannot execute the example code in less than 3 cycles per iteration due to the data dependency of each iteration on the previous one. A single register being used to accumulate the sum of the whole series causes the dependency.
Conventionally, most instructions are broken into RISC-like UOPs that alleviate the dependency somewhat. Future processors are expected to be both wider and more power efficient. They are expected to pack as much functionality into a UOP as possible, to minimize the energy expenditure, which is proportional to the number of UOPs in the core.
Description
The disclosed method is a hardware device that eliminates the data dependency caused by MAC latency. Two mechanisms address the dependency. Mechanism 1 splits code into independent sub-loops. Mechanism 2 splits instructions that perform accumulations into two separate operations.
Mechanism 1
To enable the utilization of a wide-issue machine, the disclosed hardware device splits the code into independent sub-loops (see Figure 1). For example, assume that a requirement is to produce two summation members per cycle. The following code can be split into even and odd parts and then summed together:
loop
A2ni += B2ni * C2ni ;
A(2n+1)i += B(2n+1)i * C(2n+1)i // broke the dependency on Ai
Loop_end
A2ni = A2ni + A(2n+1)I
For large values of I (the loop counter), a very wide processor executes the loop at Log2N + 1 cycles.
The mechanism must detect a wide array of code styles that employ accumulation. A key characteristic to latch on is an addition operation in which one of the sources is also the destination. Due to 32-bit architecture, practically every add instruction looks like an accumulation because one of the sources is also the destination. A sequence of instructions must be analyzed to detect an accumulation and activate the disclosed mechanism.
For example, an instruction window contains a sequence. Assume the registers are architectural, branches are not shown, and prediction is perfect.
loop
R1 += *(R2) // R1 is the accumulator a vector stored in memory and pointed to by R2
R2 += 4 // pointer is incremented
Loop_end
This loop is split into two parallel dependency chains. Registers TN are physical registers. The value T1 contains the value in R1 before the loop, and T2 contains the values in R2.
T4 = 0 ß new uop, assumes software cleared R1 before the loop
T5 = T2 +...