Quest for the Smallest OpenCL Program

I have heard many complaints about the verbosity of the OpenCL API. This claim is not unwarranted.

The verbosity is due to the low-level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages.

A typical OpenCL program must:

Query for the platform
Get the device IDs from the platform
Create a context from a set of device IDs
Create a command queue from the context
Create buffer objects for your data
Transfer the data to the buffer
Create and build a program from source
Extract the kernels
Launch the kernels
Transfer the data to the host

Wow that was longer than I thought it would be. As you can imagine this can be a daunting task for anyone who is not familiar with the OpenCL API.

For kicks and giggles, I decided to create the smallest OpenCL program with a focus on simplicity and readability. I am going to make use of the excellent C++ API from the Khronos website. Technically you can create a program that transfers to and from the device but I wanted to make something less trivial. Here are my requirements for the program:

Transfer data from host to device
Perform an addition on two vectors and store it in a third
Return the data to the host
Print the results

Here is my attempt:

#define __CL_ENABLE_EXCEPTIONS
#include "cl.hpp"
#include <iostream>
#include <iterator>
#include <string>
#include <vector>

using namespace cl;
using namespace std;

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_DEFAULT);
    static const unsigned elements = 1000;
    vector<float> data(elements, 5);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
        }
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
    add(EnqueueArgs(elements), a, b, c);

    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));

    std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}

Not bad eh? The biggest savings were realized by making use of the default platform, context, command queue objects in the C++ API. I also took advantage of a few C++11 features including string literals, and auto. This code was tested on OSX using the clang++ compiler. It should be able to run on Visual Studio 2013 and GCC with little or no changes.

Let’s dive into the code. A default context can be created by passing CL_DEVICE_TYPE_DEFAULT to the cl::Context function. This allows you to call several OpenCL functions without specifying a context. In many cases managing a context is not necessary and adds unnecessary overhead to the code.

I am also taking advantage of the special Buffer constructors in the C++ API which take iterators as inputs. This allows you to allocate the correct amount of memory and transfer the data with one call.

I especially like the make_kernel function in the C++ API. It is a function that creates a functor(an object which overloads the parenthesis operator). Let’s take a look at how it is used.

auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
add(EnqueueArgs(elements), a, b, c);

make_kernel is a template function that takes the program object and the name of the kernel as arguments. The template parameters are the type of input parameters of the kernel. This function returns a functor that takes an object as well as three Buffer objects. The EnqueueArgs object can be used to set the launch configuration of the kernel. In this case I took advantage of the default command queue but that can also be set using the EnqueueArgs object.

The copy command can be used to transfer the data to and from the device. In this case, it transfers the data from the c buffer into the result vector.

The C++ OpenCL API provides numerous abstractions over the C counterpart. It is also easy to mix the C and C++ interface in the same program. I have yet to find a functionality that is missing from the C++ API. I would encourage everyone to check out the C++ Wrapper API for OpenCL.

Comments 8

Pingback: Generating PTX files from OpenCL code | ArrayFire
Pingback: Templating and Caching OpenCL Kernels | ArrayFire
hughperkins
February 16, 2015 at 10:07 am

This looks awesome! By the way, how well will this work on nVidia devices? I guess we would need to use the 1.1 wrapper version for now?

hughperkins
February 16, 2015 at 4:04 pm
Update: for use against nvidia, can we link with 1.2 wrapper, and simply undefine opencl 1.2 before we include it? like:
```
#define CL_USE_DEPRECATED_OPENCL_1_1_APIS
#include "CL/cl.h"
#undef CL_VERSION_1_2
#define __CL_ENABLE_EXCEPTIONS
#include "cl-1.2.hpp"
```
YJ Yang
February 26, 2015 at 11:14 pm

Nice work, neat use of newer C++ features. Appreciate it.

Hao
August 10, 2015 at 12:18 pm

The link “C++ API” is OpenCL 1.1, which is incorrect. The code should include “cl.hpp” from OpenCL 1.2.

1. Post
  Author
  
  Umar
  August 10, 2015 at 3:03 pm
  
  Ahh right. Some of the features I am using are part of the OpenCL 1.2 C++ API. I have updated the link.
  
  It should be noted that the 1.2 version of the C++ API can be used with the older versions of OpenCL. So this can be used with devices that only support OpenCL 1.1(NVIDIA).
  
Pingback: Claduc: A portable high-level C++ API with CUDA or OpenCL back-end

Comments 8

Leave a Reply to YJ Yang Cancel reply