GPUs are really good at doing math. The Achilles heel is when it comes to 64-bit double precision math. GPUs, at least consumer grade, are not built for high performance FP64. This is because they are targeted towards gamers and game developers, who do not really care about high precision compute. So vendors like NVIDIA and AMD do not cram FP64 compute cores in their GPUs.
For example, on a GTX 780 Ti, the FP64 performance is 1/24 FP32. Which means in an ideal case, running the same code by only changing float types to double types, would yield the single precision run time to be about 1/24th of the double precision time (time(FP32) = time(FP64)/24).
Keep in mind, for compute-bound algorithms, such as GEMM and FFT, the theoretical best case for FP64 performance is 1:2 FP32, simply because it involves computing with double the number of bits as FP32. If the algorithms are memory bound, such as matrix transpose, then most GPUs will attain the 1:2 performance. The numbers we discuss below will all be compute-bound performance numbers.
How double precision performs really depends on the architecture of the GPU.
Comparing NVIDIA GPUs
Lets take three almost identical cards: GTX 780 Ti, GTX Titan Black and the Tesla K40c. All are Kepler GK110 based GPUs, with the same number of SMX and cores (15 SMX, 2880 cores) and the same bus width (384-bit). The 780 Ti and Titan Black even have nearly same base clock speeds (~880MHz; K40c is 745MHz) and identical memory clock speeds (7GHz; K40 is 6GHz).
Single Precision Performance
With respect to single precision performance, all three are fairly in the same ball park. Which the 780 Ti and Titan Black sit just around 5.1 TFlops, owing to their similar clock speeds, the Tesla K40c drops in at 4.3 TFlops. You could give the Tesla a pass because it has a lower clock speed.
All three only vary significantly in three categories.
This doesn't affect performance very much (at least when the sizes fit in all GPUs). The price difference based on market prices of other GPUs with similar memory size variations should not be that big either.
3. Double Precision Performance
So if the 780 Ti and the Titan Black are practically the same in every respect, why is there a $300 difference in their price at launch (discounting memory size difference)?
The answer is in the double precision capabilities. The 780 Ti is physically locked at 1:24 FP32 where has the Titan Black has an ace up it's sleeve.
For the Titan Black, the magic happens in the driver. The Titan black's driver gives the user an option to choose the double precision performance between 1:3 and 1:24 FP32 (by switching the GPU to TCC mode). When the double precision performance is set to 1:24 FP32, which is the same as the 780 Ti, the the single precision performance of the Titan Black and 780 Ti are identical. But when the user sets the double precision performance to 1:3 FP32, the single precision performance is compromised to boost double precision performance and make it equal to the K40c. In other words, you can choose the performance of the Titan Black to match either the 780 Ti or the K40c based and your preference.
The K40c has a double precision performance of 1:3 without compromising the single precision performance. This is because the K40 is given a special double precision unit for every 3 single precision cores (white paper). It combines the best of both worlds. NVIDIA also states that the Tesla GPUs go through a much more rigorous Q&A process which guarantees lesser failures and also has additional features such as ECC memory. Hence the large price difference.
Summary of NVIDIA GPUs
NVIDIA's GTX series are known for their great FP32 performance but are very poor in their FP64 performance. The performance generally ranges between 1:24 (Kepler) and 1:32 (Maxwell). The exceptions to this are the GTX Titan cards which blur the lines between the consumer GTX series and the professional Tesla/Quadro cards.
The Kepler architecture Quadro and Tesla series card provide full double precision performance with 1:3 FP32. However, with the Quadro M6000, NVIDIA has decided to provide only minimal FP64 performance by giving it only 1:32 of FP32 capability and touting the M6000 as the best graphics card rather than the best graphics+compute card like the Quadro K6000.
AMD GPUs perform fairly well for FP64 compared to FP32. Most AMD cards (including consumer/gaming series) will give between 1:3 and 1:8 FP32 performance for FP64. The AMD Tahiti architectures tested in these benchmarks here do not suffer from the same problems FP64 problems as NVIDIA's GTX series and give a 1:4 performance. Newer Hawaii architecture consumer grade GPUs are expected to provide 1:8 performance.
The FirePro W9100, W8100 and S9150 will give you an incredible FP64 1:2 FP32 performance.
Overall, AMD GPUs hold a reputation for good double precision performance ratios compared to their NVIDIA counterparts.
Here are some benchmarks comparaing FP32 and FP64 GEMM.
There is no perfect solution to the problem of choosing between FP32 and FP64 when it comes to GPUs. The best users can do is to decide whether they need the accuracy of double precision compute and how much compute they need, eventually going for a GPU that maximizes their productivity. Some applications like physics modelling and simulation, high accuracy financial computations etc which call for double precision accuracy at high performance require capable FP64 cards. Applications such as image and video processing, signal processing, statistics may not require such high precision and can get away with high FP32 performance only.
Disclaimer: We receive GPUs from both NVIDIA and AMD. However, this blog does not promote or advocate any specific vendor.
Wikipedia: NVIDIA GeForce 700 Series, NVIDIA Tesla Series, NVIDIA Quadro Series, AMD FirePro Series, AMD GPUs
NVIDIA Website: NVIDIA GTX 780 Ti, NVIDIA GTX Titan Black, NVIDIA Tesla K40
Anandtech: NVIDIA GTX 780 Ti, NVIDIA GTX Titan Black, NVIDIA Tesla K40