Templating and Caching OpenCL Kernels

Pradeep GarigipatiArrayFire 2 Comments

About a month ago, one of my colleagues did a post on how to author the most concise OpenCL program using the C++ API provided by Khronos. In today’s post, we shall further modify that example to achieve the following two goals.

  1. Enable the kernel to work with different integral data types out of the box
  2. Ensure that the kernels compile only once at run time per data type

Let’s dive into the details now.

We can template the OpenCL kernels by passing a build option -D T="typename" to the kernel compilation step. To pass such options, we would need a construct that can give us a string literal that represents the corresponding integral type. Let us declare a struct with static method getName and add template specializations for the types we want our kernel to handle. For our example, let’s add specializations for int, float and unsigned int. The entire code snippet with struct declaration and template specializations should look like the following:

template struct CLTypes {
	static const char * getName() { return "Unsupported"; }
};

template<> struct CLTypes {
	static const char * getName() { return "int"; }
};

template<> struct CLTypes {
	static const char * getName() { return "float"; }
};

template<> struct CLTypes {
	static const char * getName() { return "unsigned"; }
};

Our next step is to abstract out the OpenCL related code into a function, addVectors. Given the following function signature for addVectors,

template void addVectors(vector &out, vector &in1, vector &in2, std::size_t nElems);

the program’s main body will look like the following:

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_GPU);

    static const unsigned elements = 32;
    vector data(elements, 5);
    vector result(elements);
    
    addVectors(result, data, data, elements);

    vector data2(elements, 5);
    vector result2(elements);

    addVectors(result2, data2, data2, elements);

    // we shouldn't see any compile msgs for the below calls.
    for(int i=0; i<100; ++i) {
	addVectors(result, data, data, elements);
	addVectors(result2, data2, data2, elements);
    }

    return 0;
}

Now, we are ready to work on the code that compiles our kernel source and enqueues it on the device. We shall use the C++11 feature std::call_once to ensure single run time kernel compilation per data type. The body of the function addVectors will look like the following:

template
void addVectors(vector &out, vector &in1, vector &in2, std::size_t nElems)
{
    static std::once_flag	compileFlag;

    static cl::Program		addProg;
    static cl::Kernel		addkernel;

    std::call_once(compileFlag, []() {
    	std::string kern = "__kernel void add(global const T * const a, global const T * const b, global T * restrict const c) { unsigned idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }";
		
    	std::ostringstream options;
    	options<<"-D T="<< CLTypes::getName();

    	addProg		= cl::Program(kern, false);
    	addProg.build(options.str().c_str());
    	std::cout<<"vector addition kernel compiled for type: "<::getName()<(addkernel);
		
    addOp(EnqueueArgs(nElems), a, b, c);

    cl::copy(c, begin(out), end(out));
}

That is all that needs to be done for using the same kernel source with different data types and compiling them only once at run time per data type. The complete code sample is available here. This sample is merely an example on how you can parametrize kernel source with respect to types; more complicated algorithms would require further modifications to this sample.

Comments 2

  1. Pingback: Templating and Caching OpenCL Kernels

  2. Pingback: CUDA Optimization tips for Matrix Transpose in real world applications | ArrayFire

Leave a Reply

Your email address will not be published. Required fields are marked *