Browse Prior Art Database

Buffered Store Design to Approach Two-Ported Cache Performance with a One-Port Array

IP.com Disclosure Number: IPCOM000118880D
Original Publication Date: 1997-Aug-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 4 page(s) / 141K

Publishing Venue

IBM

Related People

Luick, DA: AUTHOR

Abstract

Disclosed is a method to hide store access to the cache, simplify bus scheduling, and permit a fast unidirectional bus for fetch and store.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 43% of the total text.

Buffered Store Design to Approach Two-Ported Cache Performance with
a One-Port Array

      Disclosed is a method to hide store access to the cache,
simplify bus scheduling, and permit a fast unidirectional bus for
fetch and store.

      Two ported caches, merged instruction/data caches in
particular, require nearly twice the silicon area at constant
performance of a single port array design.

      Generally, Reduced Instruction Set Computing (RISC) processors
generate enormous memory bandwidth requirements on an
instruction/data cache, second level cache, so that fetches alone
will drive nearly 100% (95%) utilization of the cache array and cache
data bus.  Thus, it  is desirable to accommodate stores to the cache
in a nearly transparent  fashion that does not impact fetch latency
or bandwidth without significant area cost to the cache.

      As shown in the Figure, the proposed processor partition has
4-cache Main Store Control Unit (MSCU) chips surrounding a fixed
point processor chip and a floating point unit.  In order to have
sufficient main store bandwidth, a 32-byte data bus to main store is
required.  This naturally dictates a four-way partition or four-way
interleave of the data in the four-cache arrays.  Thus, each cache
array on each of the four-banks (or chips) has eight-bytes or a
double word of data width with its own physical read/write controls
and address port.  If each of the four array banks is allowed to be
either ganged or unganged and separately controlled, then a fetch or
store of eight-bytes or less, not crossing a double-word boundary,
need only cause one of the four cache array banks to be accessed.

      Thus, it is possible to do multiple load and store accesses per
cycle if the loads and stores all respectively reference different
cache banks.  Two simultaneous array loads plus two simultaneous
array stores  could be allowed if two load address generation
addresses are implemented  in the processor.

      For cycle time (critical path) signal, Input/Output (I/O)
limitations, and complexity reasons, the processor implements only
one load/store address generation unit.  Because the four-bank cache
shown is a merged instruction/data cache, however, branch and
prefetch instructions are also fetched, often in 64- or 128-byte
lines, from the  four-bank cache.  The combination of instruction and
data fetching drives  the cache data bus close to 100% utilization
for extended bursts between  instruction/data cache misses.  The
four-cache array chips drive a common 32-byte cache data bus as a
single bus entity.

      Queueing theory reminds us that, from a bus stand point and
given a fixed total data path, one mixed instruction data bus of
32-bytes (a big server) will out perform two 16-bytes dedicated
busses; e.g., two  smaller servers, 16-byte instruction plus 16-byte
data.

      Even though the fetch cache data bus is full, however, the
actual cache arrays of four-b...