Explaining FP64 performance on GPUs

Introduction

GPUs are really good at doing math. The Achilles heel is when it comes to 64-bit double precision math. GPUs, at least consumer grade, are not built for high performance FP64. This is because they are targeted towards gamers and game developers, who do not really care about high precision compute. So vendors like NVIDIA and AMD do not cram FP64 compute cores in their GPUs.

For example, on a GTX 780 Ti, the FP64 performance is 1/24 FP32. Which means in an ideal case, running the same code by only changing float types to double types, would yield the single precision run time to be about 1/24th of the double precision time (time(FP32) = time(FP64)/24).

Keep in mind, for compute-bound algorithms, such as GEMM and FFT, the theoretical best case for FP64 performance is 1:2 FP32, simply because it involves computing with double the number of bits as FP32. If the algorithms are memory bound, such as matrix transpose, then most GPUs will attain the 1:2 performance. The numbers we discuss below will all be compute-bound performance numbers.

How double precision performs really depends on the architecture of the GPU.

Comparing NVIDIA GPUs

Lets take three almost identical cards: GTX 780 Ti, GTX Titan Black and the Tesla K40c. All are Kepler GK110 based GPUs, with the same number of SMX and cores (15 SMX, 2880 cores) and the same bus width (384-bit). The 780 Ti and Titan Black even have nearly same base clock speeds (~880MHz; K40c is 745MHz) and identical memory clock speeds (7GHz; K40 is 6GHz).

Single Precision Performance
With respect to single precision performance, all three are fairly in the same ball park. Which the 780 Ti and Titan Black sit just around 5.1 TFlops, owing to their similar clock speeds, the Tesla K40c drops in at 4.3 TFlops. You could give the Tesla a pass because it has a lower clock speed.

All three only vary significantly in three categories.

1. Memory
This doesn’t affect performance very much (at least when the sizes fit in all GPUs). The price difference based on market prices of other GPUs with similar memory size variations should not be that big either.

2. Price
At launch, the GTX 780 Ti was priced at $699, $999 for the Titan Black and an estimated $5500 for the K40c. Well, that escalated quickly.
(Price Sources: GTX 780 Ti, GTX Titan Black, K40)

3. Double Precision Performance
So if the 780 Ti and the Titan Black are practically the same in every respect, why is there a $300 difference in their price at launch (discounting memory size difference)?

The answer is in the double precision capabilities. The 780 Ti is physically locked at 1:24 FP32 where has the Titan Black has an ace up it’s sleeve.

For the Titan Black, the magic happens in the driver. The Titan black’s driver gives the user an option to choose the double precision performance between 1:3 and 1:24 FP32 (by switching the GPU to TCC mode). When the double precision performance is set to 1:24 FP32, which is the same as the 780 Ti, the the single precision performance of the Titan Black and 780 Ti are identical. But when the user sets the double precision performance to 1:3 FP32, the single precision performance is compromised to boost double precision performance and make it equal to the K40c. In other words, you can choose the performance of the Titan Black to match either the 780 Ti or the K40c based and your preference.

The K40c has a double precision performance of 1:3 without compromising the single precision performance. This is because the K40 is given a special double precision unit for every 3 single precision cores (white paper). It combines the best of both worlds. NVIDIA also states that the Tesla GPUs go through a much more rigorous Q&A process which guarantees lesser failures and also has additional features such as ECC memory. Hence the large price difference.

Summary of NVIDIA GPUs

NVIDIA’s GTX series are known for their great FP32 performance but are very poor in their FP64 performance. The performance generally ranges between 1:24 (Kepler) and 1:32 (Maxwell). The exceptions to this are the GTX Titan cards which blur the lines between the consumer GTX series and the professional Tesla/Quadro cards.

The Kepler architecture Quadro and Tesla series card provide full double precision performance with 1:3 FP32. However, with the Quadro M6000, NVIDIA has decided to provide only minimal FP64 performance by giving it only 1:32 of FP32 capability and touting the M6000 as the best graphics card rather than the best graphics+compute card like the Quadro K6000.

AMD GPUs

AMD GPUs perform fairly well for FP64 compared to FP32. Most AMD cards (including consumer/gaming series) will give between 1:3 and 1:8 FP32 performance for FP64. The AMD Tahiti architectures tested in these benchmarks here do not suffer from the same problems FP64 problems as NVIDIA’s GTX series and give a 1:4 performance. Newer Hawaii architecture consumer grade GPUs are expected to provide 1:8 performance.

The FirePro W9100, W8100 and S9150 will give you an incredible FP64 1:2 FP32 performance.

Overall, AMD GPUs hold a reputation for good double precision performance ratios compared to their NVIDIA counterparts.

Benchmarks

Here are some benchmarks comparaing FP32 and FP64 GEMM.

Conclusion

There is no perfect solution to the problem of choosing between FP32 and FP64 when it comes to GPUs. The best users can do is to decide whether they need the accuracy of double precision compute and how much compute they need, eventually going for a GPU that maximizes their productivity. Some applications like physics modelling and simulation, high accuracy financial computations etc which call for double precision accuracy at high performance require capable FP64 cards. Applications such as image and video processing, signal processing, statistics may not require such high precision and can get away with high FP32 performance only.

Disclaimer: We receive GPUs from both NVIDIA and AMD. However, this blog does not promote or advocate any specific vendor.

Sources:
Wikipedia: NVIDIA GeForce 700 Series, NVIDIA Tesla Series, NVIDIA Quadro Series, AMD FirePro Series, AMD GPUs
NVIDIA Website: NVIDIA GTX 780 Ti, NVIDIA GTX Titan Black, NVIDIA Tesla K40
Anandtech: NVIDIA GTX 780 Ti, NVIDIA GTX Titan Black, NVIDIA Tesla K40

Comments 12

Royi
June 23, 2015 at 7:34 am

Hi,
Great information.

At the bottom line, if you don’t want to spend a lot of money and get good FP64 performance, AMD is your choice.

By the way, could you show a graph which compares both platforms on OpenCL?
CUDA is highly optimized for one architecture, it would be more logic to test them both using the same language.

1. Shehzan Mohammed
  June 23, 2015 at 3:12 pm
  
  Thanks for the feedback.
  We will be doing upcoming blogs on benchmarking that will show the comparison in much more detail and for many more algorithms.
  If you wish to check the comparison yourself, you can simply use the benchmarking/blas.cpp example in ArrayFire. It runs a similar benchmark for GEMM.
  That is what we use to generate these as well.
  
  1. Royi
    June 23, 2015 at 5:09 pm
    
    Hi,
    Yet unlike you, I don’t have access to all of those.
    
    I wish you always included Intel Iris Pro (With 128 MB Embedded L4 Ram), AMD R290 and nVidia 780.
    
    All tested in OpenCL.
    Now it seems you have tendency for nVidia (Comparing their state of the art against previous cards of AMD and no Intel).
    
    1. Pavan Yalamanchili
      June 23, 2015 at 5:36 pm
      
      Royi
      
      There are many factors why we did not do this.
      
      1. The point of this post is NOT to compare AMD vs Intel or NVIDIA. We wanted to show realistically how single and double precision performs on off the shelf GPUs and server GPUs.
      
      2. The OpenCL blas library we were using for the benchmark (clBLAS) is highly tuned for AMD GPUs only. Comparing the performance of clBLAs on NVIDIA and Intel GPUs will not provide the right picture.
      
      3. We are using the hardware that was giving to us by our partners. To re-iterate we were not trying to pit one hardware against the other, we were trying to something else entirely.
      
      There is a reason we did not put the different GPUs on the same graph. It is so people don’t make the assumption that we are comparing NVIDIA to AMD. We were trying to compare 32 bit vs 64 bit floating point performance on individual architectures. You are unnecessarily reading between the lines and accusing us of something that we are not doing.
      
      1. Royi
        June 23, 2015 at 6:03 pm
        
        No accusation, on the contrary, I said the result might suggest that.
        
        I want to use ArrayFire and it means I need a certain hardware.
        Hence the results of your benchmarks and my performance expectations are the key point to decide on hardware.
        
        So even if you don’t want this role, unless someone else will make this data available, you’re the source.
        
        By the way, It would be nice if you could create a collaboration with Anandtech and include some program based on your library as part of their suite to review GPU’s.
        That would be amazing.
      2. Pavan Yalamanchili
        June 23, 2015 at 6:09 pm
        
        We will be updating our benchmarks page in a couple of weeks showing the performance of core algorithms on individual hardware. We will be showing the peak capabilities of each device and how much of peak we are achieving.
        
        We as a library strive to achieve the best performance out of all compute devices available. If you want to have a more informal chat about the hardware performance with the arrayfire developers, please head over to this chat room here: https://gitter.im/arrayfire/arrayfire.
Michał Janiszewski
June 24, 2015 at 1:43 pm

“running the same code by only changing float types to double types,
would yield the double precision run time to be about 1/24th of the
single precision time”

shouldn’t it be the other way round? “Single in 1/24th time of double” or “double in 24x time of single”

1. Shehzan Mohammed
  June 24, 2015 at 6:59 pm
  
  Thanks for pointing this out. I’ll correct it.
  
Gurga
January 5, 2016 at 11:56 pm

Quick note:

“It combines the best of both worlds. NVIDIA also states that the Tesla GPUs go through a much more rigorous Q&A process which guarantees lesser failures and also has additional features such as ECC memory. Hence the large price difference.”

What you are basically saying here is that NVIDIA has a rigorous question and answer process with their video cards to guarantee quality. I suppose if you wanted to REALLY personify video cards you could call running a video card a “Q&A,” but I’m almost 10000000% certain you meant to say QA, which means Quality Assurance, and is the most commonly used term to refer to… well… assuring quality in a product before shipping it 😛

1. Dan
  July 9, 2016 at 1:09 am
  
  you took three paragraphs to discuss “Q&A”. This hardly qualifies as a “Quick note” lols 😉
  
XXXeeqXXX
March 30, 2016 at 3:56 am

the last Conclusion, FP32 or 64 not much affected to game

Don Karam
July 7, 2016 at 9:56 pm

What’s the ratio with single and double precision performance on CPUs?