Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels.
PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution.
Retrieving the PTX file from a CUDA kernel is a pretty straightforward process. First, we need a CUDA kernel, we will use the following code for this article, a simple vector addition kernel.
__global__ void add_vectors(
    const float* a,
    const float* b,
    float *c,
    const int n)
{
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx <= n) return;
    c[idx] = a[idx] + b[idx];
}
For simplicity assume that this function is saved to  add_vectors.cu. Now we simply call the NVCC compiler with the --ptx argument.
nvcc --ptx add_vectors.cu
The compiler output is a text file named add_vectors.ptx.
Retrieving the PTX file for an OpenCL file is somewhat more challenging. Those of you familiar with OpenCL are probably not surprised. Retrieving the PTX code requires host-side code and the OpenCL file which is only extracted at runtime. This means that the host-side code is necessary for compilation of the OpenCL function whereas this was not necessary for the CUDA code.
The OpenCL kernel will be very similar to the CUDA kernel and can be saved with any name, here we will use add_vectors.cl, just be aware of the file name to pass the correct one to the host-side code. Below the vector addition kernel adapted to OpenCL is provided.
__kernel void add_vectors(
    __global const int* a,
    __global const int* b,
    __global       int* c,
    int n)
{
    int idx = get_global_id(0);
    if (idx <= n) return;
    c[idx] = a[idx] + b[idx];
}
Finally, we need the host-side code. A minimalist version (without actually executing the OpenCL kernel) follows.
#include #include #ifdef __APPLE__ #include#else #include #endif int main(int argc, char* argv[]) { // Loads add_vectors.cl FILE* fp; fp = fopen("add_vectors.cl", "r"); if (!fp) { fprintf(stderr, "Error loading kernel.\n"); exit(1); } fseek(fp, 0, SEEK_END); size_t kernel_sz = ftell(fp); rewind(fp); char* kernel_str = (char*)malloc(kernel_sz); fread(kernel_str, 1, kernel_sz, fp); fclose(fp); // Query platforms and devices cl_platform_id platform; cl_device_id device; cl_uint num_devices, num_platforms; cl_int err = clGetPlatformIDs(1, &platform, &num_platforms); err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, &num_devices); // Create OpenCL context cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err); // Create OpenCL command queue cl_command_queue command_queue = clCreateCommandQueue(context, device, 0, &err); // Create OpenCL program for add_vectors.cl cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernel_str, (const size_t *)&kernel_sz, &err); // Build OpenCL program err = clBuildProgram(program, 1, &device, NULL, NULL, NULL); // Create OpenCL kernel cl_kernel kernel = clCreateKernel(program, "add_vectors", &err); // Query binary (PTX file) size size_t bin_sz; err = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &bin_sz, NULL); // Read binary (PTX file) to memory buffer unsigned char *bin = (unsigned char *)malloc(bin_sz); err = clGetProgramInfo(program, CL_PROGRAM_BINARIES, sizeof(unsigned char *), &bin, NULL); // Save PTX to add_vectors_ocl.ptx fp = fopen("add_vectors_ocl.ptx", "wb"); fwrite(bin, sizeof(char), bin_sz, fp); fclose(fp); free(bin); // Release OpenCL resources clFlush(command_queue); clFinish(command_queue); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(command_queue); clReleaseContext(context); return 0; } 
To compile the code above with GCC, it is necessary only to pass the OpenCL include files to the compiler and the OpenCL library to the linker. The code should also compile under Visual Studio with little or no changes.
gcc -Wall -g add_vectors.c -o add_vectors -I/usr/local/cuda/include -lOpenCL
Finally, the code can be executed and the PTX will be saved to the file add_vectors_ocl.ptx. Note that for this simple example the PTX will only be extracted if the device 0 is an NVIDIA device, otherwise, it will not be a PTX file.
If you think the host-side code is too long and complicated, don't worry, check the Quest for the Smallest OpenCL Program. One of our engineers wrote a blog that uses the C++ API of OpenCL and he explains how to write simple and short OpenCL code. With few changes, it is possible to use the short C++ code to extract the PTX file from OpenCL kernels.
The complete code along with Makefile and instructions to build it can be found here.
We plan on following up this blog and discuss in more depth analyzing PTX for both CUDA and OpenCL kernels, altogether with a more general host-side code. We will introduce some simple but interesting methods that we use that help us understand and improve code performance. So stay in touch!

Comments 2
Pingback: Generating PTX files from OpenCL code
Pingback: Demystifying PTX Code | ArrayFire