Browse Prior Art Database

Method for a filter mechanism for the reduction of snoop traffic and the efficient use of L2 caches in a CMP system

IP.com Disclosure Number: IPCOM000008011D
Publication Date: 2002-May-10
Document File: 7 page(s) / 60K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for a filter mechanism for reduction of snoop traffic and efficient use of L2 caches in a chip multiprocessor (CMP) system. Benefits include improved performance.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 29% of the total text.

Method for a filter mechanism for the reduction of snoop traffic and the efficient use of L2 caches in a CMP system

Disclosed is a method for a filter mechanism for reduction of snoop traffic and efficient use of L2 caches in a chip multiprocessor (CMP) system. Benefits include improved performance.

Background

              Symmetric multiprocessor (SMP) system performance is often limited by available memory and bus (communication) bandwidth. While increasing the cache sizes and migrating to faster buses is an option, this solution usually sacrifices die and system cost.

              Chip multiprocessor (CMP) systems have conventionally been a viable alternative to increasingly complex microarchitectural solutions. Complex solutions do not favorably provide frequency-linear scaling of application performance.

              Conventionally, a read request from a processor, which misses level 1 (L1) and level 2 (L2) caches, is presented to the internal bus for snooping and potential servicing from the other processors within the CMP. If the snoop responses from the other processors indicate that another processor controls this data in shared or modified form, the data is provided to the requesting processors. In SMP systems, when a clean snoop hit occurs, the data is typically provided by the chipset. In CMP systems, optimization ensures that the transaction is not passed to the external bus. However, if the other processors’ snoop response is a miss, the request is sent to the external bus. The result is a compulsory transaction on the internal bus even though the data may not reside on any of the other processors. In addition, the transaction’s presence on the external bus is delayed because an internal response is sought before initiating the transaction on the external bus. The internal and external bus transactions are serialized. The conventional methodology results in traffic savings on the external bus but may cause a waste of bandwidth and power on the internal bus and longer latency for reads destined for the external bus.

              Another shortcoming of the conventional methodology is that a line requested by a processor that resides in another processor must be sent in its entirety to the requesting processor. This procedure is a requirement for write requests which result in Read For Ownership. For read requests, the result is possibly an L1/L2 eviction in the requesting processor to make room for the requested cache line. The processor may only use a  portion of the entire cache line, but it must store the entire cache line in its cache. The results are extra traffic on the bus and potential ejection of valuable data from the on-chip caches. .

              The following discussion assumes that cache lines are 32 bytes and consist of four 8-byte chunks corresponding to the bus data-path width of 8 bytes.

 


General description

              The disclosed method is a CMP solution with multiple processors on a die for reducing memory-request latency. The method exploits the high bandwidth available on a chip in t...