Minimizing lock contention by collective computation
Original Publication Date: 2009-Sep-22
Included in the Prior Art Database: 2009-Sep-22
Lock contention is one of the main obstacles to performance and scalability of multi-threaded applications. This article proposes a mechanism for efficient handling of multiple lock requests by a single thread, by leveraging existing vector units to perform computations of several threads in parallel. Using this mechanism we can reduce contention of lock requests, and in turn improve performance of servers that include SIMD units (e.g. IBM's VMX, VSX).
Lock contention is a well known [1,2] major performance bottleneck in multi-threaded applications. In addition to the serialization of the multi-threaded application in the locked area, performance penalties are paid when the application threads are waiting to enter the critical section. This causes a dynamic scaling of the size of the critical section by the number of contending threads. Therefore, lock contention may severely limit performance scalability and proliferation of current trend towards multi-core architectures.
Proposed here is a novel technique to deal with lock contention that is based on taking advantage of the possible dynamically frequent thread arrival to the lock. Our technique makes use of a reduction process in one thread to compute the equivalent of the execution of the critical section in pending threads. The approach takes advantage of existing compiler technology (SIMDization, reduction, parallel prefix-sums, etc). The thread-serialization implied by the cumulative critical section semantics allows applying the above-mentioned optimizations (namely parallel-prefix, reduction, SIMDization).
We describe our techniques in the context of Java, as it is the major development language in server applications. Nevertheless, the principles of the techniques are not limited to the Java environment, and can be utilized with proper adjustments in the contexts other programming languages and/or runtime software/HW.
The first thread arriving at the lock of a synchronized method acquires it and processes the critical section. During this processing it inspects the queue of the threads pending on the lock. If there is a sufficient number of pending threads, the first thread uses SIMDization, reduction or collective computation to perform the equivalent of the cumulative computation. For instance, if the critical section updates a counter by one, the first thread updates the counter by the number of involved threads.
As another example,
if the threads insert/remove items on a linked list, the first thread may do the cumulative task with just a single list traversal on behalf of the remaining threads, making efficient use of the memory/cache.
At the end of the
reduction, the first thread releases the pending threads by delivering a response to each pending threads' continuation (as described below).
The technique is based on the ability of one thread to observe and control the c...