Benchmarking Tesla K20

In this blog post, we are going to compare NVIDIA’s latest high end offering, the Tesla K series (PDF) with their previous offering. In particular we are comparing the Tesla K20C with Tesla C2070/2075.

This blog post follows a similar post about benchmarking the GTX680 we did last year. We take a look at similar set of functions (and a little bit more) to see what benefits the newer line brings.

All of the benchmarks were done using double precision. In all of the graphs, higher trendlines are better.

Matrix Multiplication

In house at AccelerEyes, we use matrix multiplication as the gold standard for testing the maximum performance of all new GPUs we end up with. The K20c reaches a peak at about 1 TFLOPs for Double Precision. This is about 85% of maximum theoretical performance of 1.2 TFLOPS ^. The c2070 reaches about 310 GFLOPS, a mere 60% of maximum theoretical performance of 515 GFLOPS. In effect, the K20 is 3 times faster than the C2070.

Linear Algebra

Continuing to other linear algebra functions from matrix multiply, we see that the K20 still beats the C2070 by about 2.5x. You may also notice that LU decomposition is slower on the K20 at smaller sizes. We are unsure about the actual reasons, but let us present all the info we have. The GPUs were on different machines using different motherboards with possibly different PCIe speeds. This is relevant because small chunks of data is transferred to the host for faster computation. At smaller sizes, the data transfers and the computation on the CPU become the bottleneck. Hence it may be unfair to compare the GPUs across various systems at these sizes.

Fast Fourier Transform

In this particular case, the K20c is about 1.6x faster than the c2070. The absolute numbers are not relevant here because the formula used to calculate GFLOPs for FFTs (5 * N * log(N)) is only approximate, and the constant factor may be higher or lower than 5 depending on the algorithm utilized.

Sort

Although the K20 starts off a little bit slower, the performance keeps increasing at a steeper incline than that of the C2070. The older Tesla seems to be approaching its limit at 165 million keys per second, where as the K20 is still going strong at the end of the test coming in at about 500 million keys per second. This is about 3x faster than the C2070.

All the above benchmarks have been done using arrayfire 1.9 which uses cuda 5.0.

^Ref: http://www.nvidia.com/object/tesla-servers.html