Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method and apparatus for managing out of order retired instruction stream with two level counters mechanism

IP.com Disclosure Number: IPCOM000125691D
Original Publication Date: 2005-Jun-13
Included in the Prior Art Database: 2005-Jun-13

Publishing Venue

IBM

Abstract

One of the key part of the invention in this patent is to utilize the tagged instruction streams before, during and after the instruction stream being processed. To utilize the parallelism of execution unit and the branch speculative unit, instruction are labeled with a unique GID (Group Identification) number to assistance the completion buffer, and a TID (Target Identification) number to assistance the dispatch unit. Both GID and TID are labeled after pre-decoding stage inside the instruction fetcher unit. GID was assigned to mark the relationship between the instruction fetched and TID are assigned to label the target execution unit that the instruction to be dispatched, so the dispatch unit gained early time advantage of the pre-decoding stage. The detailed TID and GID handling are illustrated in the following patent description with commends on almost every line of code (including 64 deep buffer to handling the wrap around).

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 20% of the total text.

Page 1 of 10

Method and apparatus for managing out of order retired instruction stream with two level counters mechanism

In the modern super scalar RISC CPU design environment, the biggest problem existed is maintain the order of instruction stream from multiple execution units:

Block Diagram of the VISA processor

Processor Core

Bus Interface Unit with wide Data Bus

Instruction fetcher

Rename buffer

Branch Processing Unit

Cond Reg

MSR

Mux / Distribut er

Instruction Reservation

Register File with multiple input ports and multiple output

Timer

Clock Multiplier/P

Instruction Dispatch Unit With speculative out of order Dispatch IID d GID

Load Queue

Load/store Unit with Load b ff

Internal Bus

Address Translation Unit for virtual address management

I-Cache D-Cache Manageme nt

Reorder buffer

Scan chain and self-test vector generation co- processor

Level 2 on Chip Data and Instruction

Completion Unit entry reorder Buffer

Diagram 1: Typical super scalar CPU architecture

1

[This page contains 1 picture or other non-text object]

Page 2 of 10

When an instruction stream is dispatched out of orderly and retired out of orderly, there must be a mechanism to maintain this double layer of 'disorder' back into original order so the resources and target of registers and memory conflict can be avoid to get the correct results.

Super scalar processor is designed to maximize the IPC (instruction per cycle) by coordinated design with compiler and processor core. Inside the both of the vector execution unit and branch resolution unit, reservation station and rename buffer are extensively used to assistance the recover from wrong prediction of the speculative execution. The completion buffer is designed to further reorder the out of order dispatched instruction sequence to make sure the instruction retired in order according to its original logic sequences.

The division of the blocks inside the implementation is based on the functionality separation. Another consideration is processor internal bus boundary. With these consideration in mind, processor micro architecture is divided into integer execution unit IEU, floating point unit FPU, bus interface unit BIU, instruction fetch unit IFU, floating point register file FPRF, integer register files IRF, and among some other blocks as shown in Diagram 1.

Conceptually, the processor is designed around the pipeline started from read port of vector register file, through vector execution unit, to the end of write port of vector register file. So the instruction flow side of the design is mainly done in IFU where pre-decode and fetch are performed.

The concurrency problem is complicated by the addition of floating point instruction stream, since the FPU has separated instruction queues and execution engines.

A typical floating point or integer execution unit block diagram will be shown in Diagram 2.

Execution Unit

Completion Buffer U it

Diagram 2: Typical execution unit block diagram

Reservation Queue and Dispatch

Register Fil...