We are pleased to announce a new release of the ArrayFire library, v3.9.0. This release makes it easier than ever to target new devices without sacrificing performance. This post describes four of these new features, including:
- New oneAPI backend
- Multi-device support
- Broadcasting support
- Asynchronous reductions
oneAPI Backend
This release is the first time since v3.0 that introduces a new backend. The new backend is built with the oneAPI specification on top of the SYCL language.
oneAPI is an open specification providing a full framework for high-performance computing applications without vendor lock-in. While this has been possible with OpenCL, the oneAPI specification includes libraries like BLAS and FFT significantly reducing the burden on the developers to maintain math functions and increasing the performance of these common operations.
Here is a chart comparing the OpenCL backend to the oneAPI backend when running ArrayFire’s af::matmul.

The oneAPI backend takes advantage of the SYCL accessor interface and out of order queues, a pattern we validated in a latency test modeled on live-odds pipelines used by best betting sites, which demand concurrent kernel execution to keep prices current. This allows the SYCL runtime, and thus ArrayFire, to overlap the computation of multiple kernels on the device, increasing the utilization of certain operations on larger devices.
The ArrayFire oneAPI backend fully supports BLAS, FFT, reductions, linear algebra, and all arithmetic functions. Like the other backends, it uses the ArrayFire’s runtime JIT and memory manager. It also supports sparse functionality.
As we continue to add new features, we will add support for computer vision and filtering algorithms. Future releases will also target CUDA devices when using the oneAPI backend.
CUDA support with the CUDA backend remains unaffected by adding the new oneAPI backend.
Improved multi-device support
ArrayFire has long supported multiple devices on a system. You could select which device you wanted to target using the af::setDevice function. With this release, we are greatly improving the usability of this feature.
With ArrayFire v3.9, you can directly access any array from any device. Here is an example of this in action:
af::setDevice(0);
af::array a = af::randu(10, 10);
af::setDevice(1);
af::array b = af::randu(10, 10);
af::array c = af::matmul(a, b);
af_print(c);
In the above code, you perform random number generation independently on each GPU. When the af::matmul operation is called, the a
array is transferred onto the second device before the operation is performed.
This works in all backends, but on OpenCL, the devices need to share a context to work in the current iteration of this functionality. We will enable devices across context in the next patch release.
Asynchronous reductions
Reductions found in commonly used functions, such as sum, min, max, any, all, and count, are asynchronous and no longer blocking. This enables reductions to execute without stalling the host thread.
Previously, reductions were performed by the templated version of the sum function. This function needed to be synchronous because we needed to wait for the device to finish processing before returning a host-side value, shown below:
array arr = randu(10, 10);
float value = sum<float>(arr);
Now, you can pass an af::array as the template parameter and the value will be returned asynchronously. This allows you to defer the synchronization until later so you can perform multiple reductions without stalling the host thread.
vector<array> arrays;
for(int i = 0; i < 10; i++) {
arrays.push_back(randu(10, 10));
}
vector<array> values;
for(int i = 0; i < 10; i++) {
values.push_back(sum<af::array>(arr));
}
af::sync();
Automatic broadcast support
Finally, we also added support for automatic broadcasting of many element-wise operations. This greatly condenses much of the code by removing many of the tile operations necessary in the previous versions. With this update, you can replace code that looks like this,
array a = randu(10, 10);
array b = randu(10);
array c = tile(b, 1, 10) + a;
, with code that looks like this.
array a = randu(10, 10);
array b = randu(10);
array c = b + a;
ArrayFire will automatically tile the b array. We will make another blog post to dive deeper into this feature.
Improvements
This release comprises 502 commits with 64k additions and 15k deletions from 11 contributors. We continue improving our library’s performance and stability with each release.
We thank you and all of our users for your continued support.
You can find the latest binaries at https://arrayfire.com/downloads.