Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

A builtin to improve compiler generated SIMD code

IP.com Disclosure Number: IPCOM000018651D
Original Publication Date: 2003-Jul-30
Included in the Prior Art Database: 2003-Jul-30
Document File: 2 page(s) / 52K

Publishing Venue

IBM

Abstract

Disclosed is a builtin to improve compiler generated Single Instruction Multiple Data (SIMD) instructions. This is an enhancement to the Superword Level Parallelism (SLP) algorithm for generating SIMD instructions.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

A builtin to improve compiler generated SIMD code

   Larsen/Amarasinghe [*] documents an algorithm Superword Level Parallelism for generation of SIMD instructions within a basic block of a program. The algorithm begins with:

Find aligned loads/stores in a basic block and pair them up

Follow use/def chains to find pairs of instructions that can be parallelised, using a estimated benefit

function

This works nicely for small basic blocks, but for large basic blocks involving many computations, following use/def chains to find pairs of instructions can lead an explosion of possible pairs. The compiler might decide to pair up two instructions that seem to pair nicely, but selection of this pair may prevent longer chains of parallel computations from being generated. This is due to the fact that each instruction may only belong to one such pair of instructions.

The problem is to correctly pair up instructions to extract maximal parallelization in the block. At the point of running the SLP algorithm, enough optimization has already been done that stores to local variables and loads from them have been eliminated. This hinders the SLP algorithm from finding maximal parellelization in the block.

This invention allows a user to specify at a given point in the program that two or more computations should be grouped together to be computed in parallel. These computations are used along with the aligned load/stores as input for step 2 above.

A new builtin is added to the language: void __compute_parallel (T, T, ....); where T is a computation type (i.e. double) or a generic type...