Generating PTX files from OpenCL code

Peter CUDA, OpenCL 2 Comments

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels.

PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution.

Retrieving the PTX file from a CUDA kernel is a pretty straightforward process. First, we need a CUDA kernel, we will use the following code for this article, a simple vector addition kernel.

For simplicity assume that this function is saved to Now we simply call the NVCC compiler with the --ptx argument.

The compiler output is a text file named add_vectors.ptx.

Retrieving the PTX file for an OpenCL file is somewhat more challenging. Those of you familiar with OpenCL are probably not surprised. Retrieving the PTX code requires host-side code and the OpenCL file which is only extracted at runtime. This means that the host-side code is necessary for compilation of the OpenCL function whereas this was not necessary for the CUDA code.

The OpenCL kernel will be very similar to the CUDA kernel and can be saved with any name, here we will use, just be aware of the file name to pass the correct one to the host-side code. Below the vector addition kernel adapted to OpenCL is provided.

Finally, we need the host-side code. A minimalist version (without actually executing the OpenCL kernel) follows.

To compile the code above with GCC, it is necessary only to pass the OpenCL include files to the compiler and the OpenCL library to the linker. The code should also compile under Visual Studio with little or no changes.

Finally, the code can be executed and the PTX will be saved to the file add_vectors_ocl.ptx. Note that for this simple example the PTX will only be extracted if the device 0 is an NVIDIA device, otherwise, it will not be a PTX file.

If you think the host-side code is too long and complicated, don't worry, check the Quest for the Smallest OpenCL Program. One of our engineers wrote a blog that uses the C++ API of OpenCL and he explains how to write simple and short OpenCL code. With few changes, it is possible to use the short C++ code to extract the PTX file from OpenCL kernels.

The complete code along with Makefile and instructions to build it can be found here.

We plan on following up this blog and discuss in more depth analyzing PTX for both CUDA and OpenCL kernels, altogether with a more general host-side code. We will introduce some simple but interesting methods that we use that help us understand and improve code performance. So stay in touch!