In case you missed it, we recently held an ArrayFire Webinar, focused on exploring the tradeoffs of OpenCL vs CUDA.
This webinar is part of an ongoing series of webinars held each month to present new GPU software topics as well as programming techniques with Jacket and ArrayFire.
For those of you who missed it, we provide a recap here. Lots of questions were fielded by our team, so it’s a must-watch. We hope to see you at the next one!
Download the slides. Here is a transcript of the content portion of the webinar:
AccelerEyes is pleased to present today’s ArrayFire webinar looking at OpenCL and CUDA Trade-offs and Comparisons. Everyday, we interact with many programmers in various stages of GPU development projects. In making GPU project decisions, there is a lot of information to absorb from a variety of sources. The intent of this webinar is to condense this information into the key points that matter, to help you digest GPU computing software decisions.
Over the last 5 years, as we’ve collected information from our GPU computing customers, we’ve determined that there are 5 core GPU software features that programmers seek in a technology. These include:
- Performance: This is the core motivation for GPU computing and relates to quality. “How good will my code end up?” is the question here. “How fast can I push it?”
- Scalability: This is related to labor costs and quality of results. “If I develop my code on a workstation, can I launch it on a cluster without major headaches?” is the question here.
- Portability: This is related to flexibility and costs, both labor costs and fixed hardware costs. It can also lead to a question of quality and freedom to move code to better or newer hardware that emerges.
- Community: This is a broad term meant to encompass other terms like support, logevity, commitment, etc. It is important that the technology platform has a lot of users and momentum. It is important that investments made today persist and pay dividends down the road.
- Programmability: A sense of how much labor and effort are required to get good performance. This includes a sense of the software platform’s robustness to bugs, availability of functionality, and time to get from project start to successful project end.
We’ve arranged this webinar to address these five GPU Software Features for the two major GPU computing platforms, CUDA and OpenCL. For each of these features, we will share some of our thoughts on how CUDA and OpenCL compare. Then we will conclude the discussion of each feature with some comments on how the feature relates to ArrayFire (those slides have an ArrayFire logo in the upper right hand corner of the slide).
The first feature is Performance. Both CUDA and OpenCL are fast, and on GPU devices they are much faster than the CPU for data-parallel codes, with 10X speedups commonly seen on data-parallel problems.
Both CUDA and OpenCL can fully utilize the hardware. They are both entirely sufficient to extract all the performance available in whatever hardware device. But we italicized the word “can” here, because the devil is really in the details. Performance depends upon a slew of variables, including hardware type, algorithm type, and code quality. It is nearly impossible to guess how much speedup you can extract from a piece of code. In our experience, nearly all science, engineering, and financial codes can get great benefit out of GPU hardware, but the big question is how difficult is it to transform your algorithm to realize those benefits. We’ll discuss programmability later on.
We will present ArrayFire performance results in CUDA and OpenCL at the end of the webinar.
The second feature we’ll discuss is Scalability. Scalability can mean many things, so we’ve broken this discussion down into 3 kinds of scaling.
From Laptops to Single GPU machines, both CUDA and OpenCL codes scale without any code change. This is a very common use case. We see nearly half of GPU computing users leverage a laptop at some stage of GPU development and later move the code to a different “performance” hardware setup, like a workstation or cluster. Both CUDA and OpenCL make life easy for this use case.
From a Single GPU Machine to a Multi-GPU Machine, both CUDA and OpenCL require user managed code for low-level synchronization of communication between the multiple GPUs. This is a headache, but with patience, it is manageable in both CUDA and OpenCL.
From a Multi-GPU machine to a Cluster, neither CUDA nor OpenCL really offer much assistance. Rather, programmers tend to write their own MPI code that handles all the cluster communication and then use CUDA and OpenCL directly in each node.
With respect to Scability, there are some other interesting developments of note. The first is that there is new technology in CUDA called GPUDirect that is aimed at reducing memory transfer overheads when communicating between multiple GPUs. It has optimizations to reduce overhead by allowing peer-to-peer memory transfers between GPUs on the same PCI express bus. It also has optimization to reduce the overhead of moving data from GPU memory to a network interface card. This is certainly an interesting development, but it is too new for us to say if it offers enough benefit to be an important technology.
The second interesting development is in mobile GPU computing. OpenCL has quickly become the most pervasive way to do GPU computing on mobile devices, including smartphones and tablets. Companies like ARM, Imagination Technologies, Freescale, Qualcomm, Samsung, and others are all enabling their mobile GPUs to run OpenCL codes. There are more mobile devices sold each year than there are PCs, so this is a huge community that is beginning to put its support behind OpenCL. At AccelerEyes, we have done several GPU consulting projects on mobile GPUs and are believers that there is big benefit to accelerating apps, especially computer vision and video processing apps, directly on the phone or tablet.
In scaling from Laptops to Single GPU machines, ArrayFire’s just-in-time compiler automatically makes optimizations for the GPU type, without any code change. In this sense, both the CUDA and OpenCL versions of ArrayFire enjoy scalability here.
From a single GPU machine to a multi-GPU machine, ArrayFire has a big advantage. The ArrayFire deviceset() function makes mutli-GPU computing super simple. No need to mess with synchronization issues. ArrayFire automatically manages memory and queues up all GPUs in your system with a full workflow, ensuring good resource utilization.
From multi-GPU machines to clusters, ArrayFire is the same as CUDA and OpenCL. There is not really any added benefit, and users write and manage their own MPI code.
The third feature is Portability. This is perhaps the most reconizable difference between CUDA and OpenCL. CUDA only runs on NVIDIA GPUs, while OpenCL is the open industry standard and runs on AMD, Intel, NVIDIA, and other hardware devices.
With respect to CUDA, there was a recent announcement at NVIDIA’s GPU Technology Conference in Asia that said CUDA would become more open, and the press carried it as saying that CUDA would become open source. This is definitely a step that GPU programmers are happy to see. But it remains to be seen what this actually means. There are two comments that we’ll make on this announcement:
- From what we can tell, parts of the CUDA compiler will be open sourced to a limited number of groups. These groups will likely try to build compiler adaptations that enable CUDA code to run on other devices, if you use their compilers. From the announcement, it appears that the CUDA libraries, like CUBLAS and CUFFT, will not be open sourced. This is a critical disctinction, because as we’ll show later on, libraries are key.
- Creating a compiler that can automatically generate tuned code for various hardware devices with very different architectures is extremely difficult. These kinds of projects continue to remain popular in hardcore academic research, but have yet to mature their way into actually widespread utility. Ocelot is an example research project in this category.
Also, with respect to portability, CUDA does not provide CPU fallback. Currently, developers using CUDA typically put if-statements in their code that distinguish between the presense or absense of a GPU device at runtime. In contract, with OpenCL, CPU fallback is supported and makes code maintenance much easier.
ArrayFire is fully portable. The same ArrayFire code runs on CUDA or OpenCL. The only difference is the version of the ArrayFire library that you link against in your code.
The main caveat here is that today, the OpenCL version of ArrayFire only supports a subset of the functionality available in the CUDA version. This is due to the fact that our CUDA code base has been around much longer. It is also due to the fact that there is less of an OpenCL software ecosystem, as we’ll discuss next.
The fourth feature is Community. This is the feature that encompasses support, longevity, commitment, etc. As those things are hard to measure, we put together a proxy. It is interesting to look at the number of forum topics on NVIDIA’s CUDA forms at nearly 27,000 and AMD’s OpenCL forums at 4,000. Also, the neutral 3rd party site Stackoverflow has tags for CUDA and OpenCL, with the number of CUDA tags being over 3X the number of OpenCL tags. As you would expect, there are many more people doing CUDA programming today due to the great investment NVIDIA has put into building the ecosystem for GPU computing.
With respect to AccelerEyes, we have over 1,400 GPU topics on our forums, which is the largest community of GPU programmers supported by any software company. The next largest is the veteran HPC company, PGI, which has 485 topics on their GPU forums.
Community also has to do with ecosystem and other tools. We will cover a discussion of those as they relate to libraries in a moment.
The fifth and final feature is Programmability. Both CUDA and OpenCL are low-level. It is time consuming to do GPU kernel development in either of those platforms. The bulk of that time is often spent in redesigning algorithms to exploit data-parallelism.
This is why the entire GPU computing market has lately shifted a major focus towards programmability.
To understand the landscape, let’s look at the simple two-by-two below, where we have Faster vs Slower technologies on the y-axis, and time-consuming vs easy-to-use technologies on the x-axis. As a baseline, you can consider SSE or AVX instructions on the CPU as something that is time consuming to write and won’t end up giving you the data-parallel performance that you can expect out of a GPU.
Writing GPU kernels in CUDA or OpenCL leads to much faster code, but is likewise very time-consuming to develop.
In the opposite corner, compiler directives have recent become popular. The claim of these is that you can sprinkle a few pragmas into your code and that the compiler will figure out how to get the code to run well on the GPU. While in some simple cases you might get a little benefit, there is simply no compiler today that is capable of automatically generating good, fast code for GPUs from standard serial CPU code. Compilers simply can’t figure out how to morph serial algorithms into data-parallel algorithms.
Which is why libraries are so key to GPU computing. In a library, you get access to a set of functions that have already been hand-optimized and tuned to exploit data-parallelism. Libraries include within them the benefits of speed that come from writing kernels. But they are also written with ease-of-use in mind and merely require a similar level of intrusion as is required by the compiler directive. This is why ArrayFire has been and continues to be so successful.
Libraries really make all the difference in GPU computing. To compare and contract CUDA versus OpenCL, it is important to look at the comparison of libraries available.
Raw math libraries available in CUDA include CUBLAS, CUFFT, CULA, and Magma. These are pretty much complete providing the majority of all routines necessary for raw dense linear algebra and FFT. CUDA also has CUSPARSE which is a good start for sparse linear algebra routines, but still needs to mature. CUDA libraries only run on NVIDIA GPUs. NVIDIA does not provide libraries for OpenCL.
Raw math libraries in AMD’s OpenCL have matured a lot recently. With clAmdBlas and clAmdFft, you get most of the important BLAS routines, radix 2, 3, and 5 FFT routines, which covers the most common cases. There is no LAPACK function support and there are no sparse data support libraries.
But a very important point is that AMD’s libraries run not only on AMD devices, but on all OpenCL-compliant devices, including NVIDIA GPUs.
Due to these developments at AMD, AccelerEyes is proud to have recently supported OpenCL in both our Jacket product (which applies to MATLAB® code) and our ArrayFire product (which applies to C/C++, Fortran, and Python).
Our OpenCL support is new and not nearly as mature as our support of CUDA. But our initial OpenCL support is better than our initial CUDA support was when we first launched our CUDA products. And we expect OpenCL to continue to mature rapidly in the near future.
This concludes the slide portion of the presentation. In what follows, we will spend some time showing benchmarks of OpenCL versus CUDA in ArrayFire code, with a particular focus on the raw math libraries we just discussed.
Checkout the video below for the benchmarks and Q/A session. Enjoy!