AccelerEyes is focused on not only providing the most easy to use GPU programming platform for CUDA capable GPUs by leveraging the MATLAB® language, our engineering organization is always looking for ways to improve the performance of all areas in the Jacket platform. A case in point is some recent work with matrix multiplication, specifically (Single General Matrix Multiply) SGEMM, or MTIMES. The Jacket 1.3 release was based on CUBLAS for matrix multiplication and given the importance of matrix multiplication to so many of our customers, we decided to find out if we could improve performance of the function.
Update: The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken. Have a look at the MTIMES Benchmarks wiki for up-to-date performance results, including Fermi (Tesla C2050) benchmarks!
The following chart illustrates the improvements we were able to make with our custom GEMM implementation in the 1.4 Release Candidate. The performance timings for SGEMM at various sizes of square matrices were performed, comparing the Jacket 1.3 Release and the Jacket 1.4 Release Candidate.
The graph above represents C=A*B timings for every NxN matrix from 10×10 to 4000×4000. Each data point is an average of 2 calls of MTIMES for each matrix size. Notice that Jacket 1.3 follows exactly 2 curves; the lower curve for certain multiples of 16 and 32, and the upper curve for everything else. The lower blue curve is attained by not using textures, as the data aligns and coalesces nicely in memory, and using other optimizations for certain data sizes. By switching to using only texture memory, Jacket 1.4 sacrifices some performance slightly in certain cases, but along with other optimizations, it brings faster consistent run-times, gives enhanced GFOR performance and enables native mixed matrix types support. In addition to texture memory, careful shared memory and register usage brings significant performance enhancements as well.
The graph above represents C=A*B GFLOPS for every NxN matrix from 10×10 to 4000×4000.
Miscellaneous information regarding the Jacket installation and hardware used for testing:
>> ginfo AccelerEyes Jacket v1.4.0rc2 (build 4971) CUDA driver: 195.36.15, CUDA toolkit 3.0 Memory: 0 CPU-used, 23 GPU-used, 4034 GPU-free (in MB) License Type: Designated Computer License Features: jacket sdk mgl4 dla Multi-GPU: Licensed for 4 GPUs Detected CUDA-capable GPUs: GPU0 Tesla C1060, 1265 MHz, 4095 MB VRAM, Compute 1.3 (single,double) (in use) GPU1 GeForce 8400 GS, 896 MHz, 511 MB VRAM, Compute 1.1 (single)
$ cat /proc/cpuinfo vendor_id : GenuineIntel model name : Pentium(R) Dual-Core CPU E5200 @ 2.50GHz cpu MHz : 2499.934 cache size : 2048 KB cpu cores : 2 …
* Volkov, V., and Demmel, J. W., Benchmarking GPUs to tune dense linear algebra, SC08.
* SGEMM code from Vasily Volkov.
* SGEMM code from Lung-Sheng Chien.
* See the MTIMES benchmarks wiki page for more info!