Computer algorithms are extra friendly towards data sizes that are powers of two. GPU compute algorithms work particularly well with data sizes that are multiples of 32. In most real-world situations, however, data is rarely so conveniently sized. In today’s post, we’ll be looking at one such scenario related to GPU compute. Specifically, we’ll provide you with some tips on how to optimize matrix transpose algorithm for a GPU.
Let’s start with the transpose kernel available from NVIDIA’s Parallel Forall blog. It’s been optimized to avoid bank conflicts as well, but only works on matrices with dimensions that are multiples of 32.
template__global__ void transpose32(T * out, const T * in, unsigned dim0, unsigned dim1) { __shared__ T shrdMem[TILE_DIM][TILE_DIM+1]; unsigned lx = threadIdx.x; unsigned ly = threadIdx.y; unsigned gx = lx + blockDim.x * blockIdx.x; unsigned gy = ly + TILE_DIM * blockIdx.y; #pragma unroll for (unsigned repeat = 0; repeat < TILE_DIM; repeat += blockDim.y) { unsigned gy_ = gy+repeat; shrdMem[ly + repeat][lx] = in[gy_ * dim0 + gx]; } __syncthreads(); gx = lx + blockDim.x * blockIdx.y; gy = ly + TILE_DIM * blockIdx.x; #pragma unroll for (unsigned repeat = 0; repeat < TILE_DIM; repeat += blockDim.y) { unsigned gy_ = gy+repeat; out[gy_ * dim0 + gx] = shrdMem[lx][ly + repeat]; } }
A minor modification has been made to the kernel from the NVIDIA website by templating the matrix element data type such that we can use the same kernel with other integral data types. The variable TILE_DIM
should have static constant qualifier and be initialized with a value of 32.
Now we make the kernel applicable to matrices with any size by adding index checks. This requires making the following changes.
template__global__ void transposeNoSC(T * out, const T * in, unsigned dim0, unsigned dim1) { /* * same code as transpose32 kernel */ #pragma unroll for (unsigned repeat = 0; repeat < TILE_DIM; repeat += blockDim.y) { unsigned gy_ = gy+repeat; if (gx The above kernel will check the read write indices for validity. Now let's add back the modifications to handle the special case where matrix dimensions are multiples of 32.
template__global__ void transposeSC(T * out, const T * in, unsigned dim0, unsigned dim1) { /* * same code as transpose32 kernel */ #pragma unroll for (unsigned repeat = 0; repeat < TILE_DIM; repeat += blockDim.y) { unsigned gy_ = gy+repeat; if (is32Multiple || (gx Notice the additional boolean template parameter
is32Multiple
that is used to short circuit theif
condition within the for loops inside the kernel. Wherever the transposeSC is called upon with boolean parameter value as true in the code, the resultant compiled code essentially eliminates theif
condition.The above kernel can be further optimized where any sort of index checks are completely avoided for all the blocks except the last one along a given dimension. Keep in mind that adding more short circuit parameters will increase the compilation time correspondingly.
Similar modifications can be done to OpenCL kernel using the kernel compile options to the function call
clBuildProgram
and ensuring the kernels are compiled only once. We have recently done a post on how to ensure OpenCL kernels are compiled only once in applications. You can find the article here.Here are some bandwidth analysis results using the above three kernels on a 1024x1024 matrix. For the results, we used a launch configuration of 32x8 threads per block where each block processed a 32x32 sub-matrix from the input matrix.
Matrix Dimensions transpose32 Bandwidth transposeNoSC Bandwidth transposeSC Bandwidth 1024x1024 85.18 GB/s 82.21 GB/s 84.41 GB/s Keep an eye out for future posts along this theme, in which we'll discuss tips to optimize CUDA/OpenCL kernels for different algorithms.
Comments 1
Pingback: CUDA Optimization tips for Matrix Transpose in real world applications | Atlanta Tech Blogs