Generating PTX files from OpenCL code

Peter Entschev CUDA, OpenCL 2 Comments

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels.

PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution.

Retrieving the PTX file from a CUDA kernel is a pretty straightforward process. First, we need a CUDA kernel, we will use the following code for this article, a simple vector addition kernel.

__global__ void add_vectors(
    const float* a,
    const float* b,
    float *c,
    const int n)
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx <= n) return;

    c[idx] = a[idx] + b[idx];

For simplicity assume that this function is saved to Now we simply call the NVCC compiler with the --ptx argument.

nvcc --ptx

The compiler output is a text file named add_vectors.ptx.

Retrieving the PTX file for an OpenCL file is somewhat more challenging. Those of you familiar with OpenCL are probably not surprised. Retrieving the PTX code requires host-side code and the OpenCL file which is only extracted at runtime. This means that the host-side code is necessary for compilation of the OpenCL function whereas this was not necessary for the CUDA code.

The OpenCL kernel will be very similar to the CUDA kernel and can be saved with any name, here we will use, just be aware of the file name to pass the correct one to the host-side code. Below the vector addition kernel adapted to OpenCL is provided.

__kernel void add_vectors(
    __global const int* a,
    __global const int* b,
    __global       int* c,
    int n)
    int idx = get_global_id(0);

    if (idx <= n) return;

    c[idx] = a[idx] + b[idx];

Finally, we need the host-side code. A minimalist version (without actually executing the OpenCL kernel) follows.


#ifdef __APPLE__

int main(int argc, char* argv[])
    // Loads
    FILE* fp;
    fp = fopen("", "r");
    if (!fp) {
        fprintf(stderr, "Error loading kernel.\n");

    fseek(fp, 0, SEEK_END);
    size_t kernel_sz = ftell(fp);

    char* kernel_str = (char*)malloc(kernel_sz);
    fread(kernel_str, 1, kernel_sz, fp);

    // Query platforms and devices
    cl_platform_id platform;
    cl_device_id device;
    cl_uint num_devices, num_platforms;
    cl_int err = clGetPlatformIDs(1, &platform, &num_platforms);
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1,
                         &device, &num_devices);

    // Create OpenCL context
    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

    // Create OpenCL command queue
    cl_command_queue command_queue = clCreateCommandQueue(context, device, 0, &err);

    // Create OpenCL program for
    cl_program program = clCreateProgramWithSource(context, 1,
            (const char **)&kernel_str, (const size_t *)&kernel_sz, &err);

    // Build OpenCL program
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // Create OpenCL kernel
    cl_kernel kernel = clCreateKernel(program, "add_vectors", &err);

    // Query binary (PTX file) size
    size_t bin_sz;
    err = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &bin_sz, NULL);

    // Read binary (PTX file) to memory buffer
    unsigned char *bin = (unsigned char *)malloc(bin_sz);
    err = clGetProgramInfo(program, CL_PROGRAM_BINARIES, sizeof(unsigned char *), &bin, NULL);

    // Save PTX to add_vectors_ocl.ptx
    fp = fopen("add_vectors_ocl.ptx", "wb");
    fwrite(bin, sizeof(char), bin_sz, fp);

    // Release OpenCL resources
    return 0;

To compile the code above with GCC, it is necessary only to pass the OpenCL include files to the compiler and the OpenCL library to the linker. The code should also compile under Visual Studio with little or no changes.

gcc -Wall -g add_vectors.c -o add_vectors -I/usr/local/cuda/include -lOpenCL

Finally, the code can be executed and the PTX will be saved to the file add_vectors_ocl.ptx. Note that for this simple example the PTX will only be extracted if the device 0 is an NVIDIA device, otherwise, it will not be a PTX file.

If you think the host-side code is too long and complicated, don't worry, check the Quest for the Smallest OpenCL Program. One of our engineers wrote a blog that uses the C++ API of OpenCL and he explains how to write simple and short OpenCL code. With few changes, it is possible to use the short C++ code to extract the PTX file from OpenCL kernels.

The complete code along with Makefile and instructions to build it can be found here.

We plan on following up this blog and discuss in more depth analyzing PTX for both CUDA and OpenCL kernels, altogether with a more general host-side code. We will introduce some simple but interesting methods that we use that help us understand and improve code performance. So stay in touch!

Comments 2

  1. Pingback: Generating PTX files from OpenCL code

  2. Pingback: Demystifying PTX Code | ArrayFire

Leave a Reply

Your email address will not be published. Required fields are marked *