Browse Prior Art Database

Branch-Processing Instruction Cache

IP.com Disclosure Number: IPCOM000061062D
Original Publication Date: 1986-Jun-01
Included in the Prior Art Database: 2005-Mar-09
Document File: 3 page(s) / 57K

Publishing Venue

IBM

Related People

Cocke, J: AUTHOR [+2]

Abstract

This article teaches a novel instruction fetching mechanism designed for computer architecture which processes branch instructions while fetching other instructions, thereby eliminating most of the so-called branch penalty. The figure depicts the organization of the mechanism. It consists of an instruction cache, a directory integral with the cache arrays, and associated branch processing logic and dataflow which assumes the existence of a fixed-point processing unit and a floating-point processing unit to which non-branch instructions will be shipped. For the purposes of illustration, the instruction cache is logically organized as a 2-way set associative, 8KB capacity, 64B linesize cache. The instruction cache spans four physical arrays; each one is organized as 128 x (128 + directory bits + special function bits + parity).

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 3

Branch-Processing Instruction Cache

This article teaches a novel instruction fetching mechanism designed for computer architecture which processes branch instructions while fetching other instructions, thereby eliminating most of the so-called branch penalty. The figure depicts the organization of the mechanism. It consists of an instruction cache, a directory integral with the cache arrays, and associated branch processing logic and dataflow which assumes the existence of a fixed-point processing unit and a floating-point processing unit to which non-branch instructions will be shipped. For the purposes of illustration, the instruction cache is logically organized as a 2-way set associative, 8KB capacity, 64B linesize cache. The instruction cache spans four physical arrays; each one is organized as 128 x (128 + directory bits + special function bits + parity). Each array can be independently addressed. In this example, a maximum of four instructions are fetched simultaneously from the instruction cache, one from each array, beginning with the current value of the instruction counter (IC) and continuing to the end of the cache line containing the requested word. To save chip area and power, sense amps are shared among four neighboring columns of the cache rows, except for those columns which store the directory and special function bits. Each line of the instruction cache spans two physical rows of the cache arrays; therefore, logic (T) is required to transform some of the low-order bits of the address generated by the IC to access the proper row. At the beginning of each cycle, the IC points to a particular instruction, i, in one of the four arrays. Logic (T) generates the actual array addresses to point to instructions i + 1, i + 2, and i 3, provided i < 12. If i>12, fewer than four instructions can be read out, and the decode (D) logic is forced to ignore some instructions. Simultaneously with the array access, the directory entries are read out. Each array row contains half of the directory entry for each associativity set. The comparison is performed with the IC and segment register contents to resolve a hit or miss. In the event of a hit, the decode logic
(D) is enabled. As usual, a miss causes a fetch from the next level of the memory hierarchy to replace the LRU (least recently used) set in the required congruence class. Assume four instructions are read out and latched at the sense amps. They are scanned for a branch instruction. If there is more than one branch in the group of instructions, only the first is considered. Non-branch instructions are dispatched to the execution units, where they are queued. The IC is incremented by the number of instructions pulled out of the instruction cache, and the process repeats. The fixed and floating-point execution units pull instructions from their queues and execute them as appropriate. Depending upon the degree of performance required and the average delay to resolve a branch, i...