How much speedup can you get with CUDA or OpenCL?

Scott ArrayFire, Benchmarks, CUDA, OpenCL Leave a Comment

Everyday developers ask us to predict how much speedup they can get with CUDA or OpenCL. Rather than gaze mysteriously into a crystal ball, we ask the developers questions to explore pertinent acceleration factors.

Note, we’ll use the term accelerator to include GPUs, Xeon Phi coprocessor, APUs, FPGAs, and any other CUDA or OpenCL device. The principles we discuss below are equally applicable to all of these accelerators.

The following are some of the important factors that must be considered when estimating the potential for accelerated speedups:

  • Hardware:  The more advanced the accelerator hardware, the more the speedup you get (e.g. the NVIDIA Kepler K20 outperforms the previous NVIDIA Fermi C2090 generation).
  • Data Sizes:  In general, accelerators will outperform CPUs to a larger degree as data sizes increase. Accelerators are only fast because they can exploit data parallelism. If there is not much data (e.g. only a few hundred elements in a vector), then there is not going to be much gain in performance. However, if there is a lot of data (e.g. more than 10,000 elements or a 100×100 matrix) then accelerators will be able to process those elements in parallel and exploit the data-parallelism. For an example of how speedups increase with data size, see the ArrayFire benchmarks.
  • Application:  Speedup figures vary from application to application. Some operations may be computed faster than others on an accelerator. In general, the more parallelizable, the better the accelerator performance. And the fewer trips memory takes across the bus, the better. For examples of real application speedups, see the ArrayFire case studies.
  • CPU:  In running speedup comparisons between the CPU and the GPU, the CPU speed matters. The CPU still plays an important role in heterogeneous computing throughput. For example, better CPUs help ArrayFire go faster, because ArrayFire does a lot of things (e.g. JIT compilation) on the CPU to keep the accelerator fully utilized.
  • Code Quality:  Better-written CPU code always leads to better CPU + accelerator code.
  • Choice of Software Tools:  Software tools can affect your potential for speedups. For instance, libraries outperform compilers in providing productive performance for accelerators.

What factors do you consider when predicting speedups?


Leave a Reply

Your email address will not be published.