Last week we posted a video recording from NVIDIA’s GTC09 conference. In the video, I walked through median filtering, presenting the vanilla implementation and then walking through progressive CUDA optimizations. A comment on that post suggested trying some other compiler flags, and it sparked a new series of experiments.
In the original video, we started with a vanilla CPU implementation of 3×3 median filtering. We then ported this to the GPU to realize some immediate gains, but then we started a string of optimizations to see how far we could drive up performance: switching to textured memory, switching to shared memory, switching the internal sorting of pixels, etc. The conclusion: pay attention to the resource usage reported by nvcc (registers, lmem, smem, cmem).
Times are in milliseconds (not speedups). I’m running this on a GeForce 8600M GT (MacBook Pro laptop). The source code is below if you’d like to try on your own.
- The original baseline CPU and GPU algorithms both using bubble sort. The GPU version uses vanilla global memory accesses. The GPU is beating the CPU with brute force parallelism.
cpu = 0.0894 gmem = 0.0144
- Put in the –maxrregcount=9 flag you suggested. It goes from using 11 reg, 36 lmem bytes to now using 9 reg, 40 lmem. Going from 11 to 9 registers means more blocks can launch, so we get a slight speedup at the cost of only 4 bytes of lmem (4 bytes == one global read).
gmem = 0.0144 gmem_r9 = 0.0115
- Put the exchange algorithm into both CPU and GPU — no more bubble sort. Still use vanilla memory accesses. We now see a dramatic win for the GPU, but I have no explanation for why the CPU takes longer now. Any ideas?
cpu_exch = 0.1215 gmem_exch = 0.0048
- For the GPU, keep the original exchange sort version that uses shared memory for accesses. This is the best time for all algorithms: fast memory accesses, efficient sorting, fewest registers.
exch = 0.0013
Here they all are together:
cpu = 0.0894 gmem = 0.0144 gmem_r9 = 0.0115 cpu_exch = 0.1215 gmem_exch = 0.0048 exch = 0.0013
Anyone know why the CPU takes longer when using exchange sort? I am checking the return values to make sure all the algorithms are still functionally correct.
The source code for all of this is included in Jacket v1.3 (currently in the testing phase). But you can also download them here to try for yourself. This experiment is contained in ‘run_13.m’. Note, the timing makes use of gsync/geval which will roll out in v1.3. To run this with v1.2.2 or earlier, just use GFORCE instead (see comment at top or run_13.m).