Browse Prior Art Database

Code Optimization by Hints to the Compiler

IP.com Disclosure Number: IPCOM000118940D
Original Publication Date: 1997-Sep-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 4 page(s) / 142K

Publishing Venue

IBM

Related People

Ward, TJ: AUTHOR

Abstract

Modern CPUs normally execute fewer instructions per clock cycle than their hardware design should enable them to do when executing programs written in high-level languages. For example, a modern superscalar processor, such as the IBM 604*, has a number of characteristics which are not currently well exploited by programs written in high-level languages, such as C and C + +.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 35% of the total text.

Code Optimization by Hints to the Compiler

      Modern CPUs normally execute fewer instructions per clock cycle
than their hardware design should enable them to do when executing
programs written in high-level languages.  For example, a modern
superscalar processor, such as the IBM 604*, has a number of
characteristics which are not currently well exploited by programs
written in high-level languages, such as C and C + +.

      The described approach modifies the compiler to enable it to
accept and act on hints given by the programmer or by a
program-generation tool and, thereby, improve the performance of code
running on these types of processors.

      Code written, including these hints, can be compiled by
compilers which do not understand the hints.  In this case, the
generated code will derive no benefit but will suffer no penalty
either.

      This technique can apply to all processors which support one or
more of the following characteristics:
  o  Virtual storage
  o  Instruction cache
  o  Ability to run more than one instruction per clock cycle
  o  Ability to overlap 'branch' and 'arithmetic' instructions

      When executing optimally, the IBM 604 will fetch four
instructions per clock cycle from its level 1 (on-chip) cache and
will feed them to the various execution units in the processor.  It
will keep this up continuously unless prevented; common reasons for
not being able to work at full speed are:
  o  Cache miss.  Level 1 cache is organized as 512 lines of 8
      words per line.  If an instruction address is not in cache,
      a cache line must be discarded and the correct line fetched
      from storage.  Assuming a 200 MHz clock and 70 ns main memory,
      a cache miss causing a fetch from main memory will take 14
      clock cycles to resolve; thereby, foregoing the opportunity
      to run 56 instructions, and also worsening the memory
      bottleneck in a symmetric multiprocessor where you can have
      up to 8 CPUs attached to the same memory.
  o  Address translate miss.  The virtual-to-real address
      translator is organized as 128 lines, each controlling a
      1024-instruction (4-kbyte) page.  If an instruction address
      is not translated from this cache, the processor stalls while
      the translation is looked up.  Typically, this costs more
      than a cache miss.
  o  Misaligned fetch.  In an 8-instruction cache line, the fetcher
      can fetch at offset 0, 2, or 4.  So, if the program counter is
      at offset 1, 3, 5, 6 or 7 from the start of the cache line,
      fewer than four of the fetched instructions are validly part
      of the program sequence; the instructions which are not part
      of the program sequence are foregone.  This shows an advantage
      to sequential program flow.
  o  Branch dependencies.  Sometimes, the fetcher gets...