Benchmarking parallel vector libraries

Pavan YalamanchiliArrayFire, Benchmarks, C/C++, CUDA Leave a Comment

There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open.

In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures.

The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below:

  • NVIDIA Tesla K20
  • AMD FirePro S10000
  • Intel Xeon E5-2560v2

Background

ArrayFire

ArrayFire provides high performance primitives across a wide spectrum of domains. Objects of the base class, array, can support up to 4 dimensions. It enables users to take advantage of parallelism at multiple levels by writing vectorized code.

ArrayFire supports NVIDIA GPUs using CUDA and OpenCL, AMD GPUs using OpenCL and Intel / AMD CPUs in serial and parallel (using OpenCL) modes. For this post, ArrayFire 3.0 has been used. You can download ArrayFire from the following locations.

BoostCompute

Boost Compute is C++ template library for parallel computing based on OpenCL. Boost Compute provides accelerated versions of the algorithms present in the standard template library. It also provides an alternative C++ wrapper over common OpenCL C routines.

Boost Compute supports NVIDIA GPUs, AMD GPUs and Intel / AMD CPUs using OpenCL. You can download BoostCompute from the following locations.

ArrayFire uses BoostCompute for sorting, set operations and finding regions in an image in our OpenCL backend.

Hsa-Bolt

Bolt is a C++ template library optimized for heterogenous computing. It supports routines built on top of OpenCL, C++ AMP, and TBB. It provides high performance implementations of common algorithms from the Standard Template Library.

You can download the source to Hsa-Bolt from the following locations.

Thrust

Thrust is a parallel algorithms library providing high level C++ interafce on top of CUDA, OpenMP and TBB. It resembles C++ standard template library and provides accelerated versions of the commonly used algorithms.

Thrust supports NVIDIA GPUs using CUDA and Intel / AMD GPUs using OpenMP and TBB. You can download thrust from the following locations.

ArrayFire uses Thrust for sorting, set operations and finding regions in an image in our CUDA backend.

Intel TBB

Threading Building Blocks (TBB) is a C++ template library developed by intel for multi-core processors. It provides platform independent data structures and algorithms that decrease the complexity of working with multiple threads.

TBB supports multi-core processors (including the Xeon Phi). You can download TBB from the following locations.

Reductions

The reduction algorithm reduces the vector into a single element by successively reducing two input elements to a single output element. The most commonly used operations are addition, minimum and maximum.

It is one of the most commonly used algorithms in scientific and engineering algorithms. Most notably in the fields of statistics and machine learning.



AMD FirePro S10000

  • For sizes less than 10 million elements, ArrayFire is the fastest followed by HSA-Bolt and BoostComput
  • For sizes approaching 100 million elements, HSA-Bolt is 10% faster ArrayFire, and 15% faster than BoostCompute

NVIDIA Tesla K20

  • ArrayFire consistently performs better than other libraries for sizes less than 10 million elements
  • ArrayFire OpenCL performs better than ArrayFire CUDA for sizes greater than 1 million elements.
  • BoostCompute consistently performs better than Thrust
  • For sizes approaching 100 million elements, ArrayFire OpenCL is fastest followed by BoostCompute, Thrust and ArrayFire CUDA

Intel Xeon E5-2560v2

  • Intel TBB is significantly faster than ArrayFire and BoostCompute for sizes less than 1 million elements
  • ArrayFire is about 20% to 30% slower for sizes greater than 1 million elements
  • BoostCompute is about 40% to 70% slower for sizes greater than 1 million elements

Scan

The scan algorithm is useful when performing condition select operations. It can also be used where cumulative distributions are needed.



AMD FirePro S10000

  • ArrayFire consistently performs better than other libraries
  • HSA-Bolt consistently performs better than BoostCompute

NVIDIA Tesla K20

  • ArrayFire performs better than other libraries for sizes less than 1 million elements
  • ArrayFire OpenCL performs better than ArrayFire CUDA for sizes greater than 100000 elements
  • Thrust consistently performs better than BoostCompute
  • For sizes greater than 1 million elements, Thrust is 16% faster than ArrayFire OpenCL, 22% faster than ArrayFire CUDA

Intel Xeon E5-2560v2

  • Intel TBB is significantly faster than ArrayFire and BoostCompute for all sizes

Transform

The transform operation performs the same operations on each element of the input array. For this benchmark a commonly used function called saxpy has been used.



AMD FirePro S10000

  • ArrayFire consistently performs better than other libraries
  • BoostCompute consistently performs better than HSA-Bolt

NVIDIA Tesla K20

  • For sizes less than 1 million elements, Thrust is the fastest, followed by ArrayFire CUDA, ArrayFire OpenCL and BoostCompute
  • For sizes greater than 1 million elements, BoostCompute is fastest followed by ArrayFire OpenCL, Thrust and ArrayFire CUDA

Intel Xeon E5-2560v2

  • For sizes less than 1 million elements, TBB is significantly faster than BoostCompute and ArrayFire
  • For sizes larger than 1 million elements, all three libraries perform within 5% of each other

Conclusion

On AMD and NVIDIA GPUs, ArrayFire is the fastest library for vectors on the order of 10 million elements. At even smaller sizes, ArrayFire’s memory manager ammortizes the memory allocation and deallocation time, making it significantly faster than the alternatives. While improvements can be made overall, one particular area that can be improved upon would be the scan algorithm at large sizes.

On Intel Xeons, TBB is significantly faster than OpenCL implementations. Improvements can and should be made in ArrayFire to extract the best performance out of multicore CPUs. Our roadmap for the next quarter includes a native parallel implementation for our CPU backend. This should bridge the performance gap seen in these benchmarks.

Each library performed better than the others at some point in the benchmarks. Considering that all the libraries are open source, these results should provide a motivation for these libraries to learn from each other.

Leave a Reply

Your email address will not be published. Required fields are marked *