Demystifying PTX Code

In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files.
In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here).

For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let’s start by looking at the complete PTX code:

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sun May 18 04:44:51 2014 (1400399091)
// Driver 331.79
//

.version 3.0
.target sm_21, texmode_independent
.address_size 32


.entry add_vectors(
    .param .u32 .ptr .global .align 4 add_vectors_param_0,
    .param .u32 .ptr .global .align 4 add_vectors_param_1,
    .param .u32 .ptr .global .align 4 add_vectors_param_2,
    .param .u32 add_vectors_param_3
)
{
    .reg .pred      %p<2>;
    .reg .s32       %r<21>;


    ld.param.u32    %r9, [add_vectors_param_3];
    mov.u32         %r5, %envreg3;
    mov.u32         %r6, %ntid.x;
    mov.u32         %r7, %ctaid.x;
    mov.u32         %r8, %tid.x;
    add.s32         %r10, %r8, %r5;
    mad.lo.s32      %r4, %r7, %r6, %r10;
    setp.lt.s32     %p1, %r4, %r9;
    @%p1 bra        BB0_2;

    ret;

BB0_2:
    shl.b32         %r11, %r4, 2;
    ld.param.u32    %r18, [add_vectors_param_0];
    add.s32         %r12, %r18, %r11;
    ld.param.u32    %r19, [add_vectors_param_1];
    add.s32         %r13, %r19, %r11;
    ld.global.u32   %r14, [%r13];
    ld.global.u32   %r15, [%r12];
    add.s32         %r16, %r14, %r15;
    ld.param.u32    %r20, [add_vectors_param_2];
    add.s32         %r17, %r20, %r11;
    st.global.u32   [%r17], %r16;
    ret;
}

The file starts with a header showing some compiler information in comments, followed by three lines containing the 1) PTX ISA version, 2) target architecture (compute capability), and 3) address-mode that will used.

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sun May 18 04:44:51 2014 (1400399091)
// Driver 331.79
//

.version 3.0
.target sm_21, texmode_independent
.address_size 32

Immediately after the header, the entry point of the kernel is found, starting with the .entry directive followed by the kernel name.

.entry add_vectors(
...
)

The parameters for this kernel are three pointers to 32-bit integers global memory arrays (notice that there isn’t a distinction from constant to non-constant pointers and memory!), and one 32-bit integer argument (the length of input and output vectors).

Each parameter of a function starts with the .param directive, followed by its data type, .u32 in our case. In this example, the compiler transformed all the function parameters into 32-bit unsigned integers (for extended information on data types, refer to PTX ISA documentation). The pointer arguments are followed by a .ptr directive, the address space where they are located (.global for global space), and finally the data alignment which specifies the alignment (in this case .align 4 represents 4 bytes alignment).

    .param .u32 .ptr .global .align 4 add_vectors_param_0,
    .param .u32 .ptr .global .align 4 add_vectors_param_1,
    .param .u32 .ptr .global .align 4 add_vectors_param_2,
    .param .u32 add_vectors_param_3

Immediately after the function prototype, the code block begins. First thing in the block are register definitions, they always start with .reg directive, followed by the data type and its name (note that references to registers are always preceded by the % sign). Registers defined with the .pred directive are used for conditional assignments, such as in the case of branch instructions. Since PTX is an intermediary language, the registers defined are virtual and do not necessarily have a one-to-one translation to hardware registers. A set of virtual registers can be defined by including the number N of virtual registers inside < >, with them being addressable as r0, r1, ..., rN-1, when r is the name given to the set of registers.

    .reg .pred      %p<2>;
    .reg .s32       %r<21>;

Following the register definitions, we see the ld.param directive, which has the ability to copy a function parameter into a register. Because the second argument is an address, it must be inside brackets, as in [add_vectors_param_3]. Copying the parameter to a register is necessary as most PTX instructions can not handle function parameters directly.

    ld.param.u32    %r9, [add_vectors_param_3];

Later, the values of some special registers are copied to GPU registers, they are:

envreg3: read-only driver-defined special register;
ntid.x: number of threads per CTA, Cooperative Thread Array, in x dimension, equivalent to get_local_size(0);
ctaid.x: identifier of CTA within a grid, equivalent to get_group_id(0);
tid.x: thread identifier in the x dimension of CTA, equivalent to get_local_id(0).

    mov.u32         %r5, %envreg3;
    mov.u32         %r6, %ntid.x;
    mov.u32         %r7, %ctaid.x;
    mov.u32         %r8, %tid.x;

After copying these registers into locally defined registers, the real fun starts. The values of envreg3 and tid.x are added and stored in register r10, since envreg3 is not directly part of the code, but instead a driver-specific computation, I will not go into more details about it (note that envreg does not appear in PTX of the equivalent CUDA kernel).

    add.s32         %r10, %r8, %r5;

With the mad instruction (short for multiply-add), the global thread ID is calculated (get_global_id(0)), this is done by multiplying the second and third arguments (r7 and r6, holding values of ntid.x and ctaid.x), with the lower 16 bits (identified by the .lo suffix of mad instruction) of the result being later added to the fourth argument (r10, that holds tid.x) and stored back to the first argument (r4).

    mad.lo.s32      %r4, %r7, %r6, %r10;

The first conditional instruction is then reached (the only one in this example), which consists of the setp instruction, followed by @ and bra instructions. The setp instruction sets a predicate (first argument) by comparing the second argument (the global thread ID in this example) to the third argument (number of elements of input/output vectors) using the condition stated after setp, in this case lt (lower than). The @ instruction then executes the predicate, if true, branches (by executing bra) to target BB0_2 (target must be a label or register pointing to the address of a label) and continues execution from there, if false, just continues execution without branching. In the case of the condition being false, our kernel would simply return by calling the ret instruction that comes immediately after.

    setp.lt.s32     %p1, %r4, %r9;
    @%p1 bra        BB0_2;

    ret;

The real data processing is done by the set of instructions under label BB0_2, which calculates data pointers, loads data from global memory, performs calculations and stores results back to global memory. Since we are dealing with 32-bit data pointers, the first instruction shl takes 32-bit untyped binary registers and shift bits left by 2 positions (equivalent to an integer multiplication by 4), this is in fact calculating the zero-based position of data according to the global thread ID. The result in r11 is then added to the addresses pointed by the global memory pointers (after loading them to registers by instruction ld.param), resulting in the addresses where the data that needs to be processed by the thread lies in memory.

BB0_2:
    shl.b32         %r11, %r4, 2;
    ld.param.u32    %r18, [add_vectors_param_0];
    add.s32         %r12, %r18, %r11;
    ld.param.u32    %r19, [add_vectors_param_1];
    add.s32         %r13, %r19, %r11;

Data is loaded from global memory with instruction ld.global, the second parameter is a register that holds the pointer to a memory address and must be within brackets, just as we have seen before with ld.param. The actual data read from memory is then stored into the register passed as the first argument. After the resulting value is calculated by adding both values read from memory, and memory address where result will be stored is calculated, it is stored in global memory with complimentary instruction to ld.global, that is st.global. In the storage to memory case, the first argument is the address, that is the argument inside brackets. Finally, the return instruction ret is called and the processing is finished.

    ld.global.u32   %r14, [%r13];
    ld.global.u32   %r15, [%r12];
    add.s32         %r16, %r14, %r15;
    ld.param.u32    %r20, [add_vectors_param_2];
    add.s32         %r17, %r20, %r11;
    st.global.u32   [%r17], %r16;
    ret;

While PTX code might not be as straightforward to understand as the C/C++ code, in my opinion is still easier to understand than pure assembly for most architectures, as we do not have to handle lots of interrupts, for example. Understanding PTX code can significantly help you understand what the compiler is actually doing with your C/C++ code, and how the program will flow through hardware. In some cases, this can be an important tool to put your GPU to full power.

Comments 3

Pingback: complicated instructions in a PTX file
Awais Ali
February 3, 2017 at 2:17 pm

Can we output register value on the console in PTX assembly? Furthermore is there any way to get register addresses in PTX assembly?
Any help will be appreciated.
Thanks

Find Hao
July 4, 2017 at 4:38 am

Thanks for your work. It’s great. And could I translate it to Chinese and post on my blog? I will mark your link as reference and declare your rights of the article.

Comments 3

Leave a Reply to Awais Ali Cancel reply