Quest for the Smallest OpenCL Program

Umar ArshadC/C++, OpenCL 8 Comments

I have heard many complaints about the verbosity of the OpenCL API. This claim is not unwarranted.

The verbosity is due to the low-level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages.

A typical OpenCL program must:

  1. Query for the platform
  2. Get the device IDs from the platform
  3. Create a context from a set of device IDs
  4. Create a command queue from the context
  5. Create buffer objects for your data
  6. Transfer the data to the buffer
  7. Create and build a program from source
  8. Extract the kernels
  9. Launch the kernels
  10. Transfer the data to the host

Wow that was longer than I thought it would be. As you can imagine this can be a daunting task for anyone who is not familiar with the OpenCL API.

For kicks and giggles, I decided to create the smallest OpenCL program with a focus on simplicity and readability. I am going to make use of the excellent C++ API from the Khronos website. Technically you can create a program that transfers to and from the device but I wanted to make something less trivial. Here are my requirements for the program:

  • Transfer data from host to device
  • Perform an addition on two vectors and store it in a third
  • Return the data to the host
  • Print the results

Here is my attempt:

#define __CL_ENABLE_EXCEPTIONS
#include "cl.hpp"
#include <iostream>
#include <iterator>
#include <string>
#include <vector>

using namespace cl;
using namespace std;

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_DEFAULT);
    static const unsigned elements = 1000;
    vector<float> data(elements, 5);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
        }
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
    add(EnqueueArgs(elements), a, b, c);

    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));

    std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}

Not bad eh? The biggest savings were realized by making use of the default platform, context, command queue objects in the C++ API. I also took advantage of a few C++11 features including string literals, and auto. This code was tested on OSX using the clang++ compiler. It should be able to run on Visual Studio 2013 and GCC with little or no changes.

Let’s dive into the code. A default context can be created by passing CL_DEVICE_TYPE_DEFAULT to the cl::Context function. This allows you to call several OpenCL functions without specifying a context. In many cases managing a context is not necessary and adds unnecessary overhead to the code.

I am also taking advantage of the special Buffer constructors in the C++ API which take iterators as inputs. This allows you to allocate the correct amount of memory and transfer the data with one call.

I especially like the make_kernel function in the C++ API. It is a function that creates a functor(an object which overloads the parenthesis operator). Let’s take a look at how it is used.

auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
add(EnqueueArgs(elements), a, b, c);

make_kernel is a template function that takes the program object and the name of the kernel as arguments. The template parameters are the type of input parameters of the kernel. This function returns a functor that takes an object as well as three Buffer objects. The EnqueueArgs object can be used to set the launch configuration of the kernel. In this case I took advantage of the default command queue but that can also be set using the EnqueueArgs object.

The copy command can be used to transfer the data to and from the device. In this case, it transfers the data from the c buffer into the result vector.

The C++ OpenCL API provides numerous abstractions over the C counterpart. It is also easy to mix the C and C++ interface in the same program. I have yet to find a functionality that is missing from the C++ API. I would encourage everyone to check out the C++ Wrapper API for OpenCL.

Comments 8

  1. Pingback: Generating PTX files from OpenCL code | ArrayFire

  2. Pingback: Templating and Caching OpenCL Kernels | ArrayFire

  3. This looks awesome! By the way, how well will this work on nVidia devices? I guess we would need to use the 1.1 wrapper version for now?

  4. Update: for use against nvidia, can we link with 1.2 wrapper, and simply undefine opencl 1.2 before we include it? like:

    
    #define CL_USE_DEPRECATED_OPENCL_1_1_APIS
    #include "CL/cl.h"
    #undef CL_VERSION_1_2
    #define __CL_ENABLE_EXCEPTIONS
    #include "cl-1.2.hpp"
    
    
    1. Post
      Author

      Ahh right. Some of the features I am using are part of the OpenCL 1.2 C++ API. I have updated the link.

      It should be noted that the 1.2 version of the C++ API can be used with the older versions of OpenCL. So this can be used with devices that only support OpenCL 1.1(NVIDIA).

  5. Pingback: Claduc: A portable high-level C++ API with CUDA or OpenCL back-end

Leave a Reply

Your email address will not be published. Required fields are marked *