Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method for Data Transfer Optimization for a GPU in a JIT compiler

IP.com Disclosure Number: IPCOM000250407D
Publication Date: 2017-Jul-11
Document File: 4 page(s) / 28K

Publishing Venue

The IP.com Prior Art Database

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 32% of the total text.

Method for Data Transfer Optimization for a GPU in a JIT compiler Data transfer time is a performance bottleneck in leveraging general purpose graphics processing units (GPUs). In the common scenarios of repeatedly invoking the general- purpose computing on graphics processing units (GPGPU) to perform some work, data transfer activity repeats between the host system and the GPGPU. The system performs data transfer using the Peripheral Component Interconnect Express (PCIe) bus, which is of significantly lower bandwidth and higher latency than other interconnects in the system (e.g., interconnect to the caches or system Random Access Memory (RAM)). The following scenario is a typical example: For each ( … ) Copy_input_data_to_GPU()

GPU_Kernel_execution() Copy_result_to_host()

Where data transfers occur at each loop iteration, both copy to GPU and copy out of the GPU. One key approach to reducing and minimizing the impact of data transfer on performance is to eliminate unneeded data transfers. This transforms the example above to: Copy_input_data_to_GPU() For each ( …. )

GPU_Kernel_execution() Copy_result_to_host() This is the fundamental insight of the GPU data transfer optimization. However, realizing it is more complex as additional complexity arises when the GPU kernels and data transfers are generated at runtime by a Just-In-Time (JIT) compiler. For example, if there are branching conditions within the loop, the GPU kernel might not be invoked and the data transfers will be wasted. In addition, commonly, the host might at various times require the data that is used/modified by the GPU; for example: For each ( …. ) Generate_input_data()

Copy_input_data_to_GPU() GPU_Kernel_execution() Copy_result_to_host() Perform_some_additional_operations_on_data()

Where at each loop iteration, the result data must go back to the host for additional processing. Therefore, the data transfer optimization may not be possible. Further, under some conditions, the GPU needs several different variables where the data transfer of some variables can be eliminated and others cannot be eliminated. The JIT compiler must recognize and deal with such situations. Optimizing data transfer in such

scenarios, while maintaining program correctness, requires deep analysis techniques and a comprehensive runtime system. Finally, to realize this optimization for a real production system, the JIT compiler must deal with other details such as: handling non-optimizable situations, handling memory management (allocation/deallocation), and recoverability-availability-serviceability (RAS) capabilities. Several academic publications present information about the optimization of GPU data transfer activity [1, 2]. Existing methods do not utilize any runtime/profiling information to enhance performance. The approach can be highly conservative because it cannot apply the optimization in many cases where there is a concern that performance will be degraded; for example:

For each ( …. ) Gener...