Buffered Store Design to Approach Two-Ported Cache Performance with a One-Port Array
Original Publication Date: 1997-Aug-01
Included in the Prior Art Database: 2005-Apr-01
Disclosed is a method to hide store access to the cache, simplify bus scheduling, and permit a fast unidirectional bus for fetch and store.
Buffered Store Design to Approach Two-Ported Cache
a One-Port Array
a method to hide store access to the cache,
simplify bus scheduling, and permit a fast unidirectional bus for
fetch and store.
caches, merged instruction/data caches in
particular, require nearly twice the silicon area at constant
performance of a single port array design.
Reduced Instruction Set Computing (RISC) processors
generate enormous memory bandwidth requirements on an
instruction/data cache, second level cache, so that fetches alone
will drive nearly 100% (95%) utilization of the cache array and cache
data bus. Thus, it is desirable to accommodate stores to the cache
in a nearly transparent fashion that does not impact fetch latency
or bandwidth without significant area cost to the cache.
As shown in
the Figure, the proposed processor partition has
4-cache Main Store Control Unit (MSCU) chips surrounding a fixed
point processor chip and a floating point unit. In order to have
sufficient main store bandwidth, a 32-byte data bus to main store is
required. This naturally dictates a four-way partition or four-way
interleave of the data in the four-cache arrays. Thus, each cache
array on each of the four-banks (or chips) has eight-bytes or a
double word of data width with its own physical read/write controls
and address port. If each of the four array banks is allowed to be
either ganged or unganged and separately controlled, then a fetch or
store of eight-bytes or less, not crossing a double-word boundary,
need only cause one of the four cache array banks to be accessed.
Thus, it is
possible to do multiple load and store accesses per
cycle if the loads and stores all respectively reference different
cache banks. Two simultaneous array loads plus two simultaneous
array stores could be allowed if two load address generation
addresses are implemented in the processor.
time (critical path) signal, Input/Output (I/O)
limitations, and complexity reasons, the processor implements only
one load/store address generation unit. Because the four-bank cache
shown is a merged instruction/data cache, however, branch and
prefetch instructions are also fetched, often in 64- or 128-byte
lines, from the four-bank cache. The combination of instruction and
data fetching drives the cache data bus close to 100% utilization
for extended bursts between instruction/data cache misses. The
four-cache array chips drive a common 32-byte cache data bus as a
single bus entity.
theory reminds us that, from a bus stand point and
given a fixed total data path, one mixed instruction data bus of
32-bytes (a big server) will out perform two 16-bytes dedicated
busses; e.g., two smaller servers, 16-byte instruction plus 16-byte
the fetch cache data bus is full, however, the
actual cache arrays of four-b...