SGEMM, MTIMES & CUBLAS performance on the GPU

ArrayFireBenchmarks, CUDA 5 Comments

AccelerEyes is focused on not only providing the most easy to use GPU programming platform for CUDA capable GPUs by leveraging the MATLAB® language, our engineering organization is always looking for ways to improve the performance of all areas in the Jacket platform. A case in point is some recent work with matrix multiplication, specifically (Single General Matrix Multiply) SGEMM, or MTIMES.  The Jacket 1.3 release was based on CUBLAS for matrix multiplication and given the importance of matrix multiplication to so many of our customers, we decided to find out if we could improve performance of the function.

Update: The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken. Have a look at the MTIMES Benchmarks wiki for up-to-date performance results, including Fermi (Tesla C2050) benchmarks!

The following chart illustrates the improvements we were able to make with our custom GEMM implementation in the 1.4 Release Candidate. The performance timings for SGEMM at various sizes of square matrices were performed, comparing the Jacket 1.3 Release and the Jacket 1.4 Release Candidate.

The graph above represents C=A*B timings for every NxN matrix from 10×10 to 4000×4000. Each data point is an average of 2 calls of MTIMES for each matrix size. Notice that Jacket 1.3 follows exactly 2 curves; the lower curve for certain multiples of 16 and 32, and the upper curve for everything else. The lower blue curve is attained by not using textures, as the data aligns and coalesces nicely in memory, and using other optimizations for certain data sizes. By switching to using only texture memory, Jacket 1.4 sacrifices some performance slightly in certain cases, but along with other optimizations, it brings faster consistent run-times, gives enhanced GFOR performance and enables native mixed matrix types support. In addition to texture memory, careful shared memory and register usage brings significant performance enhancements as well.

The above graph overlays a MATLAB R2009b 32-bit (CPU) benchmark of MTIMES.

The graph above represents C=A*B GFLOPS for every NxN matrix from 10×10 to 4000×4000.

The GFLOPS formula used: SGEMM GFLOPS formulat

Miscellaneous information regarding the Jacket installation and hardware used for testing:

>> ginfo
AccelerEyes Jacket v1.4.0rc2 (build 4971)
CUDA driver: 195.36.15, CUDA toolkit 3.0
Memory: 0 CPU-used, 23 GPU-used, 4034 GPU-free (in MB)
License Type: Designated Computer
License Features: jacket sdk mgl4 dla
Multi-GPU: Licensed for 4 GPUs
Detected CUDA-capable GPUs:
GPU0 Tesla C1060, 1265 MHz, 4095 MB VRAM, Compute 1.3 (single,double) (in use)
GPU1 GeForce 8400 GS, 896 MHz, 511 MB VRAM, Compute 1.1 (single)
$ cat /proc/cpuinfo
vendor_id      : GenuineIntel
model name    : Pentium(R) Dual-Core  CPU  E5200  @ 2.50GHz
cpu MHz        : 2499.934
cache size     : 2048 KB
cpu cores      : 2
…

Update
Below is an example of the SGEMM performance on a Tesla C2050 (Fermi) card
–  Jacket 1.3 vs. Jacket 1.4 Final Release.

References
* Volkov, V., and Demmel, J. W., Benchmarking GPUs to tune dense linear algebra, SC08.
SGEMM code from Vasily Volkov.
SGEMM code from Lung-Sheng Chien.
See the MTIMES benchmarks wiki page for more info!

Comments 5

  1. Interesting, some great work here.
    Just curious: will there be an option to use the old routine if desired? (as sometimes one is flexible in choice of dimension).
    Actually — shouldn’t it be relatively easy to check dimension automatically and pick either the new or the old routine, depending on whether the dimension is an appropriate multiple of 32?

    1. Post
      Author

      Thanks for the interest.

      The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken.
      See the MTIMES Benchmarks wiki for an up-to-date listing
      of Jacket’s performance.
      Dynamically adjusting for optimal performance automatically is the next area of focus for Jacket’s MTIMES, scheduled to be included in the next release.

      ~Chris

  2. Great work!
    So this optimization relies on “texture memory”. If I remember correctly the usage of “texture memory” is depreciated on Fermi. Does this apply here? How does Fermi perform in these benchmarks?

    1. Post
      Author

      Thanks!

      From the NVIDIA CUDA Programming guide:
      “On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well).”

      Basically, in compute 1.x cards, global memory is not cached, while texture memory is. In compute 2.x cards, such as Fermi, global memory is now cached. Compute 2.x cards can still use texture memory, but the advantage of it over global memory isn’t as significant compared to compute 1.x cards.

      In addition to texture memory, careful shared memory and register usage can bring significant performance enhancements as well.

      Have a look at the MTIMES Benchmarks wiki for up-to-date performance results, including Fermi (Tesla C2050) benchmarks.

      ~Chris

Leave a Reply

Your email address will not be published. Required fields are marked *