Browse Prior Art Database

Effective Floating Point Code Generation with SSE/SSE2 Instructions for Intel Architecture Disclosure Number: IPCOM000010791D
Original Publication Date: 2003-Jan-22
Included in the Prior Art Database: 2003-Jan-22
Document File: 3 page(s) / 24K

Publishing Venue



This article describes a technique to generate code for efficient and fast floating point operation for Intel IA-32 architecture with SSE/SSE2 instructions available.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 34% of the total text.

Page 1 of 3

  Effective Floating Point Code Generation with SSE/SSE2 Instructions for Intel Architecture

  Disclosed is a technique to generate efficient floating point code for Intel IA32 architecture with SSE/SSE2 instructions available.There have been x87 floating point registers and the corresponding instruction set on IA-32 Pentium III and earlier architectures, however new xmm registers and the SSE/SSE2 instruction set are newly introduced from Pentium 4. This poses a challenge for compilers to generate efficient and fast floating point code as follows.

Floating point instructions have longer latency than integer instructions, so it is important to hold values on registers as much as possible. Therefore It is desirable to utilize both x87 and xmm registers for high performance. There are some functions which are available in x87 instruction set, but not in SSE/SSE2 instruction set. Transcendental functions, scale, modulo, and double-to-long conversion are examples of these functions. Therefore data movement between x87 register and xmm registers are necessary if the target value is not available on one of x87 registers. It is desirable to minimize this data transfer for better performance. The new xmm registers are of flat architecture, in contrast to stack architecture of x87 registers, so they are suitable to be used for parameter passing in procedure

calls. It is desirable to specify the usage of those registers well before each procedure call site in order to avoid shuffling of those registers. Among those instructions that are available in both x87 and SSE/SSE2 instruction set, some have large difference in throughput. For example, truncated conversion from floating point to integer is considerably faster with SSE/SSE2 instructions than with x87 instructions. It is desirable to select the faster instruction sequence for these


This article proposes a new technique to generate efficient floating point code by satisfying the restrictions and requirements described above.

The technique defines cost for each floating point registers, and provides the information regarding which register is preferable to select for operands of each floating point operations. The cost defines how much additional penalty is imposed before performing the current operation, if each of its right-hand-side (RHS) operands reaches the current instruction on the corresponding register. It then consists of the following three basic steps.

From the characteristics of each floating point instruction, it defines cost for each of

its RHS floating point operands as initial register cost information. By solving a backward dataflow equation, set register cost for left-hand-side (LHS)

operand of each floating point instructions by cumulating the cost of registers given in the 1st step. At the code generation phase, when a new floating point register needs to be




allocated, select a register whose cost is the smallest among available registers.

The detailed...