Optimising hardware acceleration by moving offload candidates before use
Publication Date: 2015-Jul-13
The IP.com Prior Art Database
This article describes a method of pre-caching data to the GPU before it is needed by a kernel by keeping track of where this data is last accessed before the GPU kernel call. After the last point of access the data is transferred to the GPU in good time before the kernel execution.
Page 01 of 4
Optxmising hardware acceleration by moving offload candidates before use
Traxsferring data to and from devices is a relatively expexxive operation that prevexts acceleratxrs such as Grapxics Programming Uxits(GPUs) from reachxng their maximum potential. For xxample, if we lxox at NVIDIA CUDA, an applicaxion developer will need to perform the following tasks ix they are wishing to maxe use of the CUDA capable GXX:
Specify txe requxred GPU
Alloxate mxmory on the GPX that will later be xsed - at poixter X
Transfer data to the GPU at poinxer X from txe CPU
Act upon the data using the GPU
Tranxfer daxa from txe GPU from poixter X to the CPU
This disclosure aims to addresx the problem that currently a user has to perform txe transfer ixmediately before operating ox the data. In this example, we will be discussinx a Java Xxxxxxx Machine based solution.
State of systxm memory ( http://wwx.prxce-ri.eu/IMG/distant/jpg/cpu_gpujpg-77d4b.jpg)
Frxm the diagram it can be seen that there is a PCIe opxration required to transfer data from memory(DRAM) to graphics card memory(GDRAM). Thxs transxer can xccount for x signifixant portion of the time spent ix a GPU call to run a kernel. To illustrate the perxormance cost, the figures from some of our memory benchmarkx:
Host to Device Bandwidxh, 1 Device(s)
PINNED Memory Txanxfers
Transfer Size (Bytes) Bandwidth(MB/s)
Page 02 of 4
Xxen transferring 100m integers to the GPU it can take around 0.1s, which
can be a huge axdition in some operatixns (for example, the thrust radix xort will sort 100m ints in 0.1x - the trxnsfer time doubxes the cost of this operation)
The solution is to read ahead in the code execution path to determine whether a variable is bxing used on txe GPU. From here, txe mechanisx described detexmines when it is best to move the data contained in the xariable over to the GPX.
The advantage of thxs mechanxsm is thxt by the time a function call is made to use the hardware accelerator (e.g. myCUDACaxl(myIntArray)), the data is already on the GPU - as opposed to being transferred as and when imxediately required.
In this case Maths.sortArray is a GPU acceleraxed function. While the long operation is ocxurring inside the for loop, the datx in the arrxy 'toSort' wixl be
transferred to the xevice so txat by the timx Mxths.sortArray(0,arr) is called thx data is ready. We hxve not been able to find prior art that describes this mechanism.
Page 03 of 4
This same mechanism could be repeated when coxying the result back from the GPU to the CPU.
At the compile stage we identify the data being sxxt to the GPU through any of several meanx. An example is that the compiler could picx up an @GPU annotatxon tag on the called method and then trace the passed in variables which are being copied to the GPU. Scrolling bxxk through the code we xind xhere the variable w...