Browse Prior Art Database

Speculative Latency Measurement Method and Apparatus

IP.com Disclosure Number: IPCOM000010038D
Original Publication Date: 2002-Oct-10
Included in the Prior Art Database: 2002-Oct-10
Document File: 3 page(s) / 44K

Publishing Venue

IBM

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 36% of the total text.

Page 1 of 3

Speculative Latency Measurement Method and Apparatus

    Performance counters are commonly used within processors and memory controllers to measure how often a specific event occurs. To provide maximum flexibility, these performance counters have programmable event selection logic to select different events to be counted. These events may include the number of reads or writes started, the number of cache hits and misses, and the number of times an internal queue is full. Additionally, these performance counters may select to count the number of cycles data is on present on a system bus, or a pipeline is stalled.

In a traditional computer memory hierarchy, the latency of a cache hit it is fixed and known at design time. To improve read latency of consecutive reads, cache interfaces may be pipelined, taking commands from a read queue and sending them down the read pipeline. Because loads may be inserted into the read queue faster than they can be serviced, the average read latency may increase. This can result in two different cache latency performance numbers for a system, the best case latency, which may be rarely achieved, and the average case. In addition, if the load causes a cache miss, the load latency will be increased, as the data must be retrieved from the next level of cache, or main memory.

Computer system throughput is closely related to the average memory latency. When a processor in a computer system loads new data from memory, it must wait until that memory request has been satisfied before making continued forward progress. To minimize the effects of memory load times, processors and memory controllers integrate caches hierarchies to reduce load latency. Compiler optimizations may insert prefetches or other hints to the processor to issue a memory load request before the data is needed, so that when the data is present in the cache when the processor needs it.

As computer systems expand to non-uniform memory architectures (NUMA), the latency of accessing memory increases when data must be retrieved from a remote node. Computer programs and operating systems must be tuned to operate efficiently on a NUMA computer system. This tuning may consist of minimizing the number of remote memory accesses the processor or memory controller must perform. It may also involve altering memory access patterns or data locations to reduce the amount of time a command spends in a system queue.

Since the full type of transaction may be unknown until the type of transaction is completed, such as from where in the memory hierarchy a read will be satisfied, it is necessary to allow for a latency measurement to be aborted. In addition, to minimize effects of a transaction pattern always causing the latency of same (potentially incorrect) transaction to be measured, transactions can be selected at random intervals for measurement.

This invention provides a performance counter mode that speculatively predicts the transaction will be of the correc...