Publication Date: 2010-Sep-20
The IP.com Prior Art Database
Sequential-program performance on multi-core processors is an important and significant challenge. In the pre-multi-core era, Trace Processors had shown significant performance potential for sequential programs but entailed relatively-complex hardware for runtime instruction-trace construction and processing. Runtime-software based approaches to Trace Processors that attempted to reduce the hardware complexity were hobbled by the single-core processors available.
The emerging multi-core processor era has posed a major challenge to the tradition of hardware-driven improvements in single-thread performance, now requiring program parallelization for any performance improvement. Parallelization is not only a long-standing problem for both programmers and compilers, but also simply not applicable to inherently sequential programs. Architects are considering alternatives such as helper or slave threads, speculative threads, runtime optimizer threads, etc. to utilize the spare cores on a multi-core chip for speeding up single-thread execution. In this paper, we consider a single-thread performance oriented processor architecture proposed with considerable promise a decade ago, the Trace Processor [RJSS97, VM97], as it appears to be well suited to be adapted to the multi-core
Trace Processors consist of a set of relatively simple processing elements (PEs) fed short instruction traces by a somewhat complex hardware trace constructor front end. Traces are short, say 16-instruction, dynamic instruction sequences and execute in parallel on the PEs, which share global register file and of course a shared cache hierarchy. Trace Processors could potentially be mapped well onto current multi-core processors (Figure 1), which typically have multiple (2 to 8) in-order, single-issue threads in each processing core and multiple (2 to 16) cores on a single-chip processor. Each PE of a Trace Processor is roughly the equivalent of a hardware thread in a multi-core, in terms of performance. The threads on a single core share an L1 cache, and the different cores share a common L2 cache connected via a high-bandwidth crossbar interconnect. This cache hierarchy model would apply well for the PEs of a Trace Processor. Adding a global register file shared across the cores provides a platform for adapting Trace Processors to multi-cores. What remains is the important question of the perhaps complex trace
construction hardware, which could conceivably be replaced by a runtime software thread (e.g., [BDB00]) that constructs traces.
A key question in adapting Trace Processors to multi-cores is whether the longer latencies for key Trace Processor operations in a multi-core setting are tolerable from a performance viewpoint. Some of the important Trace Processor parameters are: trace construction time , which can take a few hundred cycles in software; trace cache access time , which can take several clocks if the trace cache is maintained in software;
global register access time
, which can take several clocks in a multi-core chip; data cache access time , etc. In this paper, we propose Trace-Core Processors (TcP), which employ meta threads instead of complex hardware to implement all the trace management features.
Baseline Multi-Core Processor
A goal of this paper is to enhance a typical current multi-core processor with novel complexity-effective hardware that s...