Custom Kernels with ArrayFire

Pavan YalamanchiliArrayFire, C/C++, CUDA, OpenCL Leave a Comment

As extensive as ArrayFire is, there are a few cases where you are still working with custom CUDA or OpenCL kernels. For example, you may want to integrate ArrayFire into an existing code base for productivity or you may want to keep it around the old implementation for testing purposes. In this post we are going to talk about how to integrate your custom kernels into ArrayFire in a seamless fashion.

In and Out of ArrayFire

First let us look at the following code and then break it down bit by bit.

int main()
{
    af::array x = af::randu(num, 1);

    af::array y = af::array(num, 1);

    float *d_x = x.device();
    float *d_y = y.device();

    af::sync();

    launch_simple_kernel(d_y, d_x, num);
    x.unlock();
    y.unlock();

    float err = af::sum(af::abs(y - 1));
    printf("Error: %f\n", err);
    return 0;
}

Most kernels require an input. In this case we created a uniform random array, x.  If you have an input you can use it instead of x.

// Generate input data
af::array x = af::randu(num, 1);

The necessary memory required for the output is allocated before  the kernel launch.

// Create empty array with required dimensions
af::array y = af::array(num, 1);

In this example, the output is the same size as in the input. Note that the actual output data type is not specified. For such cases, ArrayFire assumes the data type is single precision floating point ( af::f32 ). If necessary, the data type can be specified at the end of the array(..) constructor. Once you have the input and output arrays, you will need to extract the device pointers / objects using array::device() method in the following manner.

 // Get device pointers
 float *d_x = x.device();
 float *d_y = y.device();

Before  launching your custom kernel, it is best to make sure that all ArrayFire computations have finished. This can be called by using af::sync(); 1

// Finish the tasks arrayfire was doing
af::sync();

// Launch kernel to do the following operations
// y = sin(x)^2 + cos(x)^2
launch_simple_kernel(d_y, d_x, num);

The function launch_simple_kernel handles the launching of your custom kernel. We will have a look at how to do this in CUDA and OpenCL later in the post. Once you have finished your computations, you have to tell ArrayFire to take control of the memory objects.

// Tell arrayfire it controls the pointers again
x.unlock();
y.unlock();

This is a very crucial step as ArrayFire believes the user is still in control of the pointer. This means that ArrayFire will not perform garbage collection on these objects resulting in memory leaks. You can now proceed with the rest of the program. In our particular example, we are just performing an error check and exiting.

// Check for errors
// sin(x)^ + cos(x)^2 == 1
// The following should print 0
float err = af::sum(af::abs(y - 1));
printf("Error: %f\n", err);

One of the benefits of ArrayFire is that it enables you to use the same code base for both CUDA and OpenCL devices. In the same vein, this particular example can be used to extend using a CUDA kernel or an OpenCL kernel. The externally defined launch_simple_kernel will be the only part of your code base that will change when moving from CUDA to OpenCL (or vice versa!).

Launching a CUDA kernel

Integrating a CUDA kernel into your ArrayFire code base is a fairly straightforward process. You need to set the launch configuration parameters, launch the kernel and wait for the computations to finish. This is shown below.

__global__
static void simple_kernel(float *d_y,
                          const float *d_x,
                          const int num)
{
    const int id = blockIdx.x * blockDim.x + threadIdx.x;

    if (id < num) {
        float x = d_x[id];
        float sin_x = sin(x);
        float cos_x = cos(x);
        d_y[id] = (sin_x * sin_x) + (cos_x * cos_x);
    }
}

void inline launch_simple_kernel(float *d_y,
                                 const float *d_x,
                                 const int num)
{
    // Set launch configuration
    const int threads = 256;
    const int blocks = (num / threads) + ((num % threads) ? 1 : 0);
    simple_kernel<<>>(d_y, d_x, num);
    // Synchronize and check for error
    cudaDeviceSynchronize();
}

Launching an OpenCL kernel

If you are integrating an OpenCL kernel into your ArrayFire code base, launching a kernel is a bit more complicated. Since ArrayFire uses its own context internally, you need to get the context from a memory object. Once you have access to the same context ArrayFire is using, the rest of the process is exactly the same as launching a stand alone OpenCL context.

void inline launch_simple_kernel(float *d_y,
                                 const float *d_x,
                                 const int num)
{
    // Read the OpenCL kernel as a string
    std::string simple_kernel_str = get_kernel_string("simple.cl");

    // Get OpenCL context from memory buffer and create a Queue
    cl_context context = get_context((cl_mem)d_x);
    cl_command_queue queue = create_queue(context);

    // Build the OpenCL program and get the kernel
    cl_program program = build_program(context, simple_kernel_str);
    cl_kernel   kernel = create_kernel(program, "simple_kernel");
    
    cl_int err = CL_SUCCESS;
    int arg = 0;

    // Set input parameters for the kernel
    err |= clSetKernelArg(kernel, arg++, sizeof(cl_mem), &d_y);
    err |= clSetKernelArg(kernel, arg++, sizeof(cl_mem), &d_x);
    err |= clSetKernelArg(kernel, arg++, sizeof(int   ), &num);

    // Set launch configuration
    size_t local  = 256;
    size_t global = local * (num / local + ((num % local) ? 1 : 0));
    // Launch OpenCL kernel
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
    clFinish(queue);
    return;
}

The code snippet above uses helper functions. You can find the helper functions over here The OpenCL kernel itself is fairly simple as seen below.

__kernel
void simple_kernel(__global float *d_y,
                   __global const float *d_x,
                   const int num)
{
    const int id = get_global_id(0);

    if (id < num) {
        float x = d_x[id];
        float sin_x = sin(x);
        float cos_x = cos(x);
        d_y[id] = (sin_x * sin_x) + (cos_x * cos_x);
    }
}

You can find the code samples at the following links:

1 af::sync() is not needed if you are not using streams in CUDA. af::sync() ensures you are not unintentionally doing out of order executions. af::sync() is always required when writing OpenCL kernels.

Leave a Reply

Your email address will not be published. Required fields are marked *