ArrayFire v3.9.0 Release

We are pleased to announce a new release of the ArrayFire library, v3.9.0. This release makes it easier than ever to target new devices without sacrificing performance. This post describes four of these new features, including:

New oneAPI backend
Multi-device support
Broadcasting support
Asynchronous reductions

oneAPI Backend

This release is the first time since v3.0 that introduces a new backend. The new backend is built with the oneAPI specification on top of the SYCL language.

oneAPI is an open specification providing a full framework for high-performance computing applications without vendor lock-in. While this has been possible with OpenCL, the oneAPI specification includes libraries like BLAS and FFT significantly reducing the burden on the developers to maintain math functions and increasing the performance of these common operations.

Here is a chart comparing the OpenCL backend to the oneAPI backend when running ArrayFire’s af::matmul.

The oneAPI backend takes advantage of the SYCL accessor interface and out-of-order queues. This allows the SYCL runtime and, thus, ArrayFire to overlap the computation of multiple kernels on the device. This can increase the utilization of certain operations on larger devices.

The ArrayFire oneAPI backend fully supports BLAS, FFT, reductions, linear algebra, and all arithmetic functions. Like the other backends, it uses the ArrayFire’s runtime JIT and memory manager. It also supports sparse functionality.

As we continue to add new features, we will add support for computer vision and filtering algorithms. Future releases will also target CUDA devices when using the oneAPI backend.

CUDA support with the CUDA backend remains unaffected by adding the new oneAPI backend.

Improved multi-device support

ArrayFire has long supported multiple devices on a system. You could select which device you wanted to target using the af::setDevice function. With this release, we are greatly improving the usability of this feature.

With ArrayFire v3.9, you can directly access any array from any device. Here is an example of this in action:

    af::setDevice(0);
    af::array a = af::randu(10, 10);

    af::setDevice(1);
    af::array b = af::randu(10, 10);

    af::array c = af::matmul(a, b);
    af_print(c);

In the above code, you perform random number generation independently on each GPU. When the af::matmul operation is called, the a array is transferred onto the second device before the operation is performed.

This works in all backends, but on OpenCL, the devices need to share a context to work in the current iteration of this functionality. We will enable devices across context in the next patch release.

Asynchronous reductions

Reductions found in commonly used functions, such as sum, min, max, any, all, and count, are asynchronous and no longer blocking. This enables reductions to execute without stalling the host thread.

Previously, reductions were performed by the templated version of the sum function. This function needed to be synchronous because we needed to wait for the device to finish processing before returning a host-side value, shown below:

  array arr = randu(10, 10);
  float value = sum<float>(arr);

Now, you can pass an af::array as the template parameter and the value will be returned asynchronously. This allows you to defer the synchronization until later so you can perform multiple reductions without stalling the host thread.

  vector<array> arrays;
  for(int i = 0; i < 10; i++) {
	arrays.push_back(randu(10, 10));
  }

  vector<array> values;
  for(int i = 0; i < 10; i++) {
	values.push_back(sum<af::array>(arr));
  }
  af::sync();

Automatic broadcast support

Finally, we also added support for automatic broadcasting of many element-wise operations. This greatly condenses much of the code by removing many of the tile operations necessary in the previous versions. With this update, you can replace code that looks like this,

  array a = randu(10, 10);
  array b = randu(10);
  array c = tile(b, 1, 10) + a;

, with code that looks like this.

  array a = randu(10, 10);
  array b = randu(10);
  array c = b + a;

ArrayFire will automatically tile the b array. We will make another blog post to dive deeper into this feature.

Improvements

This release comprises 502 commits with 64k additions and 15k deletions from 11 contributors. We continue improving our library’s performance and stability with each release.

We thank you and all of our users for your continued support.

You can find the latest binaries at https://arrayfire.com/downloads.

oneAPI Backend

Improved multi-device support

Asynchronous reductions

Automatic broadcast support

Improvements

Leave a Reply Cancel reply