ArrayFire-OpenGL Interop using CUDA

A lot of ArrayFire users have been interested in the usage of ArrayFire in partnership with OpenGL for graphics computation. In the long run, we do plan to expand further on the interoperablilty and make it easier through ArrayFire. For now, we have developed a small example to expand on the usage of the CUDA-OpenGL interop API to assist in the interop operations between ArrayFire and OpenGL.

Some of the advantage of direct ArrayFire-OpenGL interop are:

Faster data transfers: Since the OpenGL buffers as well as ArrayFire data reside on the GPU, we can use a direct device to device copy rather than using the CPU as an intermediate and the relatively slow PCIe interface.
Offscreen rendering: It is commonly known that rendering to screen is a slow process. Using the API shown in this example, we can render directly of off screen framebuffers and retrieve the data in ArrayFire.
Transform feedback: This example also shows the usage of Transform feedback buffers in OpenGL. Transform feedback buffers allow the user to retrieve data from in-between the rendering pipeline, for example getting back the transformed vertices after the operations in vertex shaders.

Since a lot of data transfers to CPU are eliminated, we can do a lot more processing faster.

Let us now discuss the example. This example is relatively simple and renders a single triangle.

Code Walkthrough

The code for this example can be found at our github page.

The code relies upon the following libraries:

ArrayFire 2.0
CUDA-OpenGL Interop (CUDA 5.5)
GLFW 3
GLEW (1.x, latest version recommended)

Let us walk through the example step by step. Please note that any steps that do not contribute to the interop and are just common OpenGL code will not be covered in this blog.

Initialization

int device = argc > 1? atoi(argv[1]):0;
af::deviceset(device);
af::info();

int width = 400, height = 300, channels = 4;
int num_triangles = 1, num_vertices = 3;

array af_indices = array(3, indices);
array af_vertices = array(3, num_vertices, vertices);	//3 floats per vertex, 3 vertices
array af_colors = array(3, num_vertices, colors);		//3 floats per color, 3 colors
array af_image = constant(0.0, height, width, channels);
array pkd_image = constant(0.0, channels, width, height);
array af_project = constant(-1.0, 3, num_vertices);

This section initializes ArrayFire and create arrays for the basic data structure. Since ArrayFire stores data as column major, we decided that the arrays will have each vertex as one column, ie. if you had an array of 10 vertices with 3 coordinates each, then your array would have dimensions 3×10.

Most of the array names are self explanatory. pkd_image is used to store the packed image data returned by OpenGL through the framebuffer. af_image is used to store the returned image in the ArrayFire format (unpacked). af_project stores the result of the projection returned after the vertex shader stage (using transform feedback). The size of project is same as the vertex buffer.

bool onscreen = true;			//Change to false for offscreen renderering
init_glfw(width, height, channels, onscreen);

The onscreen flag is a useful one. When it is set to true, the render happens on the GLFW window. Although not a part of the code, it is possible to retrieve the rendered image back into ArrayFire. However, turning this flag to false results in offscreen rendering. This means rendering is done directly to a software framebuffer rather than to the screen, hence it is faster. It also allows the rendered image to be retrieved back into ArrayFire. If you wish to display this image then the ArrayFire image() function can be used (will be discussed a little later).

init_glfw is used to initialize the GLFW and GLEW context.

Initialize Buffers

GLuint vertex_b = 0;
cudaGraphicsResource_t vertex_cuda;
create_buffer(vertex_b, GL_ARRAY_BUFFER, &vertex_cuda, 3 * num_vertices * sizeof(float), GL_DYNAMIC_DRAW);

These lines are an abstraction to create the OpenGL buffers and bind them to their respective CUDA Graphics Resource object.

void create_buffer(GLuint& buffer, GLenum buffer_target, cudaGraphicsResource_t* cuda_resource, const unsigned size, GLenum buffer_usage, const void* data = NULL)
{
    glGenBuffers(1, &buffer);
    glBindBuffer(buffer_target, buffer);
    glBufferData(buffer_target, size, data, buffer_usage);
    CUDA(cudaGraphicsGLRegisterBuffer(cuda_resource, buffer, cudaGraphicsRegisterFlagsNone));
    glBindBuffer(buffer_target, 0);
}

This section of code shows the creation and registering of the OpenGL buffer with CUDA. The buffer generation and bindings are done exactly as in regular OpenGL. The cudaGraphicsGLRegisterBuffer registers the buffer with CUDA. If using a texture or renderbuffer or the like, then use cudaGraphicsGLRegisterImage.

// Render Buffer
GLuint render_b = 0;
cudaGraphicsResource_t render_cuda;
create_buffer(render_b, GL_RGBA32F, width, height, &render_cuda);

// Depth Buffer - for off screen rendering
GLuint depth_b = 0;
create_buffer(depth_b, GL_DEPTH_COMPONENT, width, height);

In the code above, the framebuffer and depthbuffer are initialized which can now be used for off screen rendering.

Framebuffer, Transform Feedback and Shader Loading

//Required for framebuffer copy
GLuint frame_b = 0;
bind_framebuffer(render_b, depth_b, frame_b);

// Initialize shaders
init_program("shader.vert", "shader.frag");

// Initialize transform feedback
init_projection(transform_b, project_b);

This section of the code is used to initialize depth and framebuffers for off screen rendering. It also loads and compiles the shaders and initializes the transform feedback buffers.

Render Loop

while(!glfwWindowShouldClose(window))

GLFW allows the user to control to render loop. So we use a simple while loop with a GLFW terminating condition.

Copy Data to OpenGL Buffers

float* d_vertices = af_vertices.device<float>();
copy_from_device_pointer(vertex_cuda,
			 d_vertices,
			 GL_ARRAY_BUFFER,
			 3 * num_vertices * sizeof(float));

In the render loop, we first copy the vertex and color data to their respective OpenGL buffers. In the copy_from_device_pointer function, we have:

template<typename T>
void copy_from_device_pointer(cudaGraphicsResource_t cuda_resource,
                              T& d_ptr,
                              GLuint buffer_target,
                              const unsigned size)
{
    CUDA(cudaGraphicsMapResources(1, &cuda_resource));
    bool is_mapped = true;
    if (buffer_target == GL_RENDERBUFFER) {
        cudaArray* array_ptr = NULL;
        CUDA(cudaGraphicsSubResourceGetMappedArray(&array_ptr, cuda_resource, 0, 0));
        CUDA(cudaMemcpyToArray(array_ptr, 0, 0, d_ptr, size, cudaMemcpyDeviceToDevice));
    } else {
        T* opengl_ptr = NULL;
        CUDA(cudaGraphicsResourceGetMappedPointer((void**)&opengl_ptr, (size_t*)&size, cuda_resource));
        CUDA(cudaMemcpy(opengl_ptr, d_ptr, size, cudaMemcpyDeviceToDevice));
    }
    unmap_resource(cuda_resource, is_mapped);
}

In this function, we first “map” the cuda resource of the buffer using the cudaGraphicsMapResources function, ie. enable the buffer for access by CUDA. This allows CUDA to copy memory to and from the buffer. Once completed, we must unmap the resources.

Other Memory Copying Notes

If using a renderbuffer or texture, the memory needs to be copied using CUDA textures of cudaArrays and then using the cudaMemcpy(To/From)Array function. If using other buffers like GL_ARRAY_BUFFER, we can simply get the pointer using the cudaGraphicsResourceGetMappedPointer function and then copy it using the cudaMemcpyDeviceToDevice flag.

Copying from OpenGL buffers to CUDA is also done in the same way. See the copy_to_device_pointer() function in the code.

In the code that follows the memory copies, we call the render function which renders the data to screen or framebuffer. It uses relatively well know OpenGL calls and hence I will not be covering it in the blog. Any questions can be asked on the ArrayFire forums. Once the rendering is complete, we copy the transform feedback memory and the framebuffer memory back to ArrayFire.

On exit, delete the buffers and exit.

Compile and Run

Linux

On Linux, the Makefile is provided. In the Makefile, please update the paths to ArrayFire (AF_PATH), CUDA and GLFW (GLFW_LIB_PATH).

As is, the makefile assumes that CUDA, GLFW and GLEW is installed in the default locations.

To compile and run:

$make
$./cugl_interop

Windows

On Windows, you can create a CUDA project in Visual Studio. Then add the locations of the include and library directories of ArrayFire, GLFW and GLEW in the project properties. The code is OS-independent and should run without any modifications.

Output

The output, in case of onscreen rendering should be a rotating triangle. A sample screenshot is shown below.

Here are some useful resources to look up:

Please feel free to ask any questions about this interop example on the ArrayFire forums. We are working on a larger, more extensible interop code that we will release upon request.

Known issues: The ArrayFire graphics API is known to not work as intended when using external OpenGL context. It is advised that visualizations be limited. If they still do not work, the saveimage function can be a fallback, albeit a last resort fallback.

Github Link: https://github.com/arrayfire/ArrayFire-OpenGL-Interop