7 Tips for CUDA & OpenCL Programming and How ArrayFire Helps

In order to get the best performance from your CUDA or OpenCL code, it is helpful to keep in mind some useful tips for optimizing performance.

Note: By “accelerator” we refer to GPUs, APUs, co-processors, FPGAs, and any devices capable of running CUDA or OpenCL.

Vectorized Code: Accelerators perform best with vectorized code because the computations map naturally onto arithmetic cores of the hardware. ArrayFire functions are inherently vectorized, so if you are using ArrayFire, you are writing vectorized code.

Memory Transfers: Avoid excessive memory transfers. Each casting operation to and from the accelerator moves data back and forth between CPU memory and accelerator memory. ArrayFire makes many automatic optimizations to minimize these memory transfers by only transferring data when absolutely needed.

Serial vs Parallel Operations: CPUs are serial computing devices and accelerators are parallel computing devices. For small or serial operations, the best performance is likely achieved on the CPU. For large or parallel operations, the best performance is likely achieved on the accelerator. Often, a good rule of thumb is to use the CPU if your data is only a few hundred elements and to use the accelerator for processing on >10,000 elements. With ArrayFire, you can control the segments of code that are run on each device through creation of array data types.

Loops: Loops typically imply serial processing. However, with CUDA or OpenCL it is possible to run all the iterations simultaneously, if there are no data dependencies between iterations. ArrayFire makes this very easy with its GFOR function.

Lazy Execution: With CUDA and OpenCL, it is important to construct kernels that contain the right amount of computation, not so much that timeouts occur and not so little that throughput drops. ArrayFire employs a lazy execution design which automatically constructs optimally-sized kernels from your algorithms. Lazy execution also means that ArrayFire does not launch GPU kernels until the results are requested, either in a display or subsequent CPU-based computation. If you wish to force a ArrayFire computation, the ArrayFire sync and eval functions are available.

Write Good Timing Code: Badly written timing code can often cause artificially decreased accelerator performance. ArrayFire comes with a convenient timing function to ensure proper benchmarks.

Regular Access Patterns: When performing subscripting, keep in mind that the accelerator memory controllers are not as versatile as those on the CPU. Best performance is achieved when your subscript access patterns are regular and uniform. For example, A([1 4 2 3 5 1 2 ...]) would be slow while A([1 2 3 4 5 6 ...]) would be faster. With ArrayFire, subscripting is very easy. ArrayFire is column-major, so it is faster to access columns (A(span,i)) rather than rows (A(i,span)).

What else? What are good general tips for writing CUDA & OpenCL code?