Thread and process scheduling for minimal processing/communication latency
Publication Date: 2016-Sep-14
The IP.com Prior Art Database
Article presents a method for finding optimal core-thread placement - giving lowest communication latency between threads.
Page 01 of 4
Thread and process scheduling for minimal processing /
For some applications the most crucial system parameter is latency . When we have system consisting of several threads and they need to communicate we can plan their placement on specific cores to improve communication latency and throughput between them. This may be for example due to common cache for two cores - thread running on one core by just using data "automatically" stashes data to cache memory and those are available there for thread running on another core that utilizes the same cache memory. This may decrease data access latency considerably - as there is no memory access cycle required by second thread - data is hot and waiting in cache. As this proves to be important or even crucial in some applications like low-latency networking there is high need for efficient mechanisms to achieve this.
As optimizing for two threads is simple - we can measure all reasonable combinations of thread placement; it gets hard and complex when we go for >2 threads. To make it easier we decided to categorize threads into several groups : producers - thread mostly uses communication means to transfer data to outside (this may be thread that does disk reading (i.e. gets data from hardware) and then sends it to other thread)
consumers - thread mostly uses communication means to receive data from other thread (and then possibly put it to hardware - like last thread on network egress)
processors - thread that is balanced - both consumes and produces data - this is type of threads that, for example would do stream processing
idle - thread that is communicating in minimal way
Each of those threads has some input and output characteristic :
Producer Consumer Processor Idle Input low
Table only shows general behavior but in reality we know throughput in bytes /s. Thresholds that decide where the thread is categorized are something to be tweaked. Please note that throughput can map, almost directly, to latency (if producer thread "emits" a word every second, then communication latency <<1s is most likely not really needed).
The main idea is to schedule (by this we understand set affinity to specific core ) process, lightweight process or thread (that is OS entity that is schedulable) to run on core that is best in terms of this process performance (latency) in communication with some other process/LWP/thread. This can be achieved by various means like:
- a'priori knowledge of hardware (cache architecture, hyper threading, etc)
- runtime experimentation - online tweaking of process/LWP/thread placement on various cores and using measured latency as feedback
As of kernel measuring process communication latency the measurement may be seen as on figure 1 - kernel takes timestamps upon processes entering kernel-space
Page 02 of 4
with calls that are "ends" of IPC.
Then the latency of communication...