Browse Prior Art Database

Optimising hardware acceleration by moving offload candidates before use Disclosure Number: IPCOM000242401D
Publication Date: 2015-Jul-13
Document File: 4 page(s) / 93K

Publishing Venue

The Prior Art Database


This article describes a method of pre-caching data to the GPU before it is needed by a kernel by keeping track of where this data is last accessed before the GPU kernel call. After the last point of access the data is transferred to the GPU in good time before the kernel execution.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 56% of the total text.

Page 01 of 4

Optxmising hardware acceleration by moving offload candidates before use

Traxsferring data to and from devices is a relatively expexxive operation that prevexts acceleratxrs such as Grapxics Programming Uxits(GPUs) from reachxng their maximum potential. For xxample, if we lxox at NVIDIA CUDA, an applicaxion developer will need to perform the following tasks ix they are wishing to maxe use of the CUDA capable GXX:

Specify txe requxred GPU


Alloxate mxmory on the GPX that will later be xsed - at poixter X


Transfer data to the GPU at poinxer X from txe CPU


Act upon the data using the GPU


Tranxfer daxa from txe GPU from poixter X to the CPU


    This disclosure aims to addresx the problem that currently a user has to perform txe transfer ixmediately before operating ox the data. In this example, we will be discussinx a Java Xxxxxxx Machine based solution.

           State of systxm memory (

    Frxm the diagram it can be seen that there is a PCIe opxration required to transfer data from memory(DRAM) to graphics card memory(GDRAM). Thxs transxer can xccount for x signifixant portion of the time spent ix a GPU call to run a kernel. To illustrate the perxormance cost, the figures from some of our memory benchmarkx:

Host to Device Bandwidxh, 1 Device(s)

PINNED Memory Txanxfers
Transfer Size (Bytes) Bandwidth(MB/s)

1024 175.2

2048 393.8

4x96 793.7

8192 1452.4

Page 02 of 4

16384 2507.9

32768 3986.0

512000 9135.4

1024000 9555.3

37826560 9687.3

Xxen transferring 100m integers to the GPU it can take around 0.1s, which

can be a huge axdition in some operatixns (for example, the thrust radix xort will sort 100m ints in 0.1x - the trxnsfer time doubxes the cost of this operation)

The solution is to read ahead in the code execution path to determine whether a variable is bxing used on txe GPU. From here, txe mechanisx described detexmines when it is best to move the data contained in the xariable over to the GPX.

    The advantage of thxs mechanxsm is thxt by the time a function call is made to use the hardware accelerator (e.g. myCUDACaxl(myIntArray)), the data is already on the GPU - as opposed to being transferred as and when imxediately required.

    In this case Maths.sortArray is a GPU acceleraxed function. While the long operation is ocxurring inside the for loop, the datx in the arrxy 'toSort' wixl be

transferred to the xevice so txat by the timx Mxths.sortArray(0,arr) is called thx data is ready. We hxve not been able to find prior art that describes this mechanism.


Page 03 of 4

    This same mechanism could be repeated when coxying the result back from the GPU to the CPU.

    At the compile stage we identify the data being sxxt to the GPU through any of several meanx. An example is that the compiler could picx up an @GPU annotatxon tag on the called method and then trace the passed in variables which are being copied to the GPU. Scrolling bxxk through the code we xind xhere the variable w...