Using zero-copy buffers on integrated GPUs

Brian Kloppenborg C/C++, OpenCL 1 Comment

One of the most powerful aspects of parallel program on integrated GPUs is taking advantage of shared memory and caches. The best example of this is sharing common data between the CPU and GPU via. zero-copy buffers. This technique permits your program to avoid the O(N) cost of copying data to/from the GPU. This feature is particularly useful for applications that deal with real-time data streams, like video processing.

Creating zero copy buffers is not that difficult, but there is one caveat when it comes to using the buffers. First, lets consider how to make zero-copy buffers:

Method 1: OpenCL allocation of zero-copy buffers

The first method of allocating zero-copy buffers is quite simple: let OpenCL do it for you. To create a buffer in this fashion simply specify the CL_MEM_ALLOC_HOST_PTR argument to the cl::Buffer constructor (orclCreateBuffer if you are using the C API):

// Instruct OpenCL to allocate host memory
::size_t size = N * sizeof(float)
Buffer d_results(context, CL_MEM_ALLOC_HOST_PTR, size);

... launch kernels, etc

Method 2: Programmer-allocated aligned memory

In situations where you are streaming data or doing DMA between devices, it may be necessary to use aligned memory. In this situation, Intel OpenCL devices require that data be allocated with 4k alignment and be a multiple of 64 bytes (a cache line). At present AMD APUs follow this same alignment convention. Aligned allocation requires calls to OS-specific functions, after the memory is allocated, you can use create a buffer using CL_MEM_USE_HOST_PTR function:

::size_t size = N * sizeof(float); // size must be a multiple of 64
::size_t alignment = 4096;
// Allocate aligned memory on the host. The particular function is OS-dependent.
//  memalign(size_t alignment, size_t size) on Linux, release memory with free(...)
//  _aligned_malloc(size_t size, size_t alignment) on Windows, release with _aligned_free(...)
//  posix_memalign(void ** ptr, size_t alignment, size_t size) on OSX, release with free(...)
float * h_results = (float*) memalign(alignment, size);

// Create an OpenCL buffer using the host pointer
Buffer d_results = cl::Buffer(context, CL_MEM_USE_HOST_PTR, size, t_results);

Accessing zero-copy memory buffers

After your buffers are created, how do we access the data contained in them? Given method 2, one might think your code could directly access the elements of “h_results”, but this probably won’t work. Perhaps the most misunderstood aspect of zero-copy buffers in OpenCL relates to the state of the buffer after being used in a kernel invocation. It is often thought that the memory in the buffers will be updated immediately after the kernel completes; however, the OpenCL specification does not guarantee this behavior. Instead, it is up to the programmer to ensure that the buffer is correctly updated prior to accessing the data. To do so, simply use the OpenCL map/unmap buffer commands:

// Get exclusive access to the buffer
float * temp = (float*) queue.enqueueMapBuffer(d_results, CL_TRUE, CL_MAP_READ, 0, size);

... use the mapped buffer

// release the buffer back to OpenCL’s control
queue.enqueueUnmapMemObject(d_results, temp);

Note that you must unmap the buffer after use to inform OpenCL that it has been given control over the memory location.

Further reading

Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics

AMD OpenCL Optimization Guide: Mapping and zero-copy buffers

 

Comments 1

Leave a Reply

Your email address will not be published.