Caches that support QoS
Original Publication Date: 2004-Sep-07
Included in the Prior Art Database: 2004-Sep-07
Most caches today have a Set Associative architecture (meaning they are built of "X-ways" sets, typically X=4 or X=8 at L2 caches); this is designed to utilize the cache size. The idea presented herein uses this existing mechanism to ensure minimal resources to groups of threads (or other cache clients).
Caches that support QoS
Caches are now serving a growing number of threads/clients at the same time. This can cause performance degradation due to mutual interference. (Pentium4 Hyper Thread shows negative speed-up when running swim and art, for example.) This problem may still be underestimated, as SMTs are just beginning to be implemented, but as SMT and SMP gain momentum, this problem will too.
Currently (only 2 threads) the problem is still small enough to be dismissed or solved with "brute force", by putting in larger caches. Research also indicated that O.S. should try and have a scheduling solution. A hardware solution (as will be presented below) can produce better results (enable dynamic resource management) and simplify the operating systems involved.
Most caches today have a Set Associative architecture (meaning they are built of "X-ways" sets, typically X=4 or X=8 at L2 caches);this is designed to utilize the cache size. The idea presented herein uses this existing mechanism to ensure minimal resources to groups of threads (or other cache clients). The concept of Quality of Service (QoS) is currently strange to the caches field (and usually not needed), although there is reason to believe that in the near future it will be required.
The basic idea is to dedicate specific Set ways to groups of threads/clients while keeping others shared. Each thread has permission to read / write any word in the cache (as of today). The new concept is that a thread would not be allowed to replace data from specific ways (dedicated to other threads/clients groups). So, for example, a heavy memory consumer thread would not be allowed to utilize 100% of the cache while other threads linger.
Note that the hardware modifications are very moderate and should not impact the cache size and/or timing:
1. Add a QoS level (=group number in the terminology used here) to read /
2. Change the Replacement Policy (from LRU mostly) to "Invalid or LRU between group's and shared". (This is just one example. Other versions also exist.)
Each memory request is now accompanied by a thread group number (QoS) level. Deciding on the division policy is done outside th ecache and can be changed dynamically.
Examples (All for 4 groups using an 8-...