Parallel Software Development Trends for Dummies

John MelonakosComputing Trends Leave a Comment

Last month, I posted two articles describing computing trends and why heterogeneous computing will be a significant force in computing for the next decade. Today, I continue that series with an article describing the biggest challenge to continued increases in computing performance – parallel software development.

Biggest Challenge

As I described previously, in order to use an accelerator, software changes must be made. Regular x86-based compilers cannot compile code to run on accelerators without these needed changes. The amount of software change required varies depending upon the availability of and reliance upon software tools that increase performance and productivity.

There are four possible approaches to take advantage of accelerators in heterogeneous computing environments:  do-it-yourself, use compilers, use libraries, or use accelerated applications (if you’re lucky).


The do-it-yourself approach is defined by taking an inordinate amount of software development time to write code that leverages all the parallel attributes of your heterogeneous computer. Hardware vendors such as AMD, Intel, and NVIDIA provide access to low-level tools that enable developing parallel code to run on their heterogeneous hardware.

NVIDIA is the leader with the CUDA platform. CUDA is the first highly adopted platform enabling high-performance general-purpose GPU code. Developers of CUDA code write GPU kernels, manage different levels of GPU memory, make tradeoffs for data transfers between the host and the device, and optimize many other aspects of the parallel system.

OpenCL is the industry’s open standard for similarly writing data-parallel code in heterogeneous computers. AMD and Intel both promote OpenCL as a primary approach towards programming their parallel computing hardware offerings. OpenCL requires a similar level of low-level understanding and competence to write efficient parallel software.

Both CUDA and OpenCL are primarily targeted at leveraging data-parallelism of devices, and additional considerations must be made to use the multiple cores available on the CPUs in the system. For instance, OpenMP and MPI enable the use of multiple CPU cores and compute nodes in a heterogeneous system.

The do-it-yourself approach is very costly in terms of developer muscle and expertise. The do-it-yourself approach does not adapt well to ongoing hardware updates. For instance, a parallel computing algorithm tuned for NVIDIA’s Fermi GPUs may not be the best algorithm choice for Kepler GPUs. These difficulties are what prompted AMD to say that most programmers will not use CUDA or OpenCL. We agree. Most programmers will be smarter and will use the more productive approaches described below.

The best attribute of the do-it-yourself approach is that it’s always available in case other more productive approaches fail.

Use Compilers

Serious research attempts have been made by compiler developers to offload the brunt of parallel software development from manpower to compiler power. Unfortunately, these research attempts have not been very successful and the task of automatically finding the parallelism in a code is an unsolved research problem.

There are some use cases where simple, straightforward loops can be unrolled by compilers and executed efficiently on parallel hardware without much human intervention. But those are not commonplace.

Compilers can also get smarter as software developers introduce compiler-targeted directives (or software changes) that give the compilers hints about which things should be parallelized. However, many people have noticed that by the time you add sufficient compiler hints to the code to get good performance, you often would have been better off simply going the do-it-yourself approach in the first place.

Compilers will continue to improve, but for many decades theoretical researchers have tried to solve the problem of automatic parallelism detection from code and have failed. In fact, it has been proven that it is impossible to automatically detect parallelism in many instances.

The best attribute of compilers is that they work on simple arithmetic in loops; so if you have a very simple use case with simple operations, compilers are worth consideration.

Use Libraries

Libraries benefit from the great performance of the do-it-yourself approach, as well as the easy-of-use of the compiler approach. Software libraries are written by parallel computing experts who are focused on optimizing the last bit of advantage out of one narrow function at a time.

Today, heterogeneous software libraries are built on top of the CUDA or OpenCL platform and selection of the appropriate library depends upon the choice of platform. There are more CUDA libraries available today than OpenCL libraries, largely because NVIDIA has done a marvelous job at building a parallel computing ecosystem. However, with the recent emergence of Intel Xeon Phi and the growth of OpenCL’s utility in mobile computing, OpenCL libraries are becoming more and more prevalent than before.

Use of libraries requires a trust upon the library’s developers. Common qualifying questions regarding libraries include:  Is the library fast? Is the library stable? Is the library fully supported? Will the library be updated quickly as new hardware updates occur? Is there a community built around the library?

Libraries that have strong responses to those questions have great benefits for software developers. Libraries that do not can be a waste of time, and developers would do well to not waste their time pushing on broken software.

The best attribute of libraries is the ability to leverage expertly written parallel software while avoiding the time-sink of the do-it-yourself approach.

Use Accelerated Applications

As heterogeneous computing becomes more prevalent, many popular applications are becoming accelerator-enabled already. For instance, Adobe has products that are CUDA and OpenCL-accelerated for faster video transcoding. We are working with MathWorks on parallel tools. Ansys has GPU-accelerated products. And there are many more.

Users of those products do not have to do heavy-lifting to get faster code. They simply get to benefit from the work done by the application developers, often with simple checkboxes or designations that indicate a preference towards accelerated computing.

The Next Decade: Challenges for Parallel Software

The next decade will be defined by how the industry responds to the challenges of developing parallel software for heterogeneous computers, from high-performance computers down to mobile devices. Poor choices will lead to outcomes like that of the Roadrunner supercomputer (and IBM cell processor) which is being decommissioned after only a few years of use (see note below). Writing software that leverages the best parallel computing hardware, adapts well to the rapid pace of hardware updates, and minimizes developer muscle is the industry’s goal.

What challenges have you noticed about developing parallel software? How do you think the next decade of software development will be defined?


  • The IBM Cell processor was very popular when we started AccelerEyes in 2007. I’m glad we never bet on that technology and stuck with GPUs, or we would have wasted a lot of precious resources. Some other library companies were not so lucky and were hurt by the downfall of the Cell processor.

Posts in this series:

  1. CPU Processing Trends for Dummies
  2. Heterogeneous Computing Trends for Dummies
  3. Parallel Software Development Trends for Dummies

Leave a Reply

Your email address will not be published. Required fields are marked *