ArrayFire v3.5 Official Release

Discover the powerful enhancements in ArrayFire v3.5, including thread-safety and a new Canny edge detector, designed to elevate your parallel computing experience. Unlock more features that can transform your applications today!

Umar Arshad

Jul 10, 2017

5 min read

Today we are pleased to announce the release of ArrayFire v3.5, our open source library of parallel computing functions supporting CUDA, OpenCL, and CPU devices. This new version of ArrayFire improves features and performance for applications in machine learning, computer vision, signal processing, statistics, finance, and more.

This release focuses on thread-safety, support for simple sparse-dense arithmetic operations, canny edge detector function, and a genetic algorithm example. A complete list of ArrayFire v3.5 updates and new features are found in the product Release Notes.

Thread Safety

ArrayFire now supports threading programming models. This is not intended to improve the performance since most of the parallelism is happening on the device, but it does allow you to use multiple devices in an easy way. With ArrayFire you had to use a for loop to get this functionality. Normally this isn't a problem but when you have to work with host side transfers, this becomes troublesome.

Let's suppose we wanted to perform a matrix multiply on multiple GPUs. Because most ArrayFire functions are asynchronous this can be done in a for loop.

for (int i = 0; i < nDevices; i++) {
    setDevice(i);
    array a = randu(1000, 1000);
    array b = constant(3, 1000, 1000);
    array c = matmul(a, b);
}

This works well most of the time but this becomes an issue when you want to interact with host side code in your application.

For example, this code attempts to perform a matrix multiply and then transfer that data back to the CPU to do additional work. The ArrayFire code executes the computation on different devices but they will not be queued at the same time because the data transfer to the c_data array is a blocking operation.

for(int i = 0; i < nDevices; i++) {
    setDevice(i)
    array a = randu(1000, 1000);
    array b = constant(3, 1000, 1000);
    array c = matmul(a, b);

    // Blocks until the previous operation has completed
    vector<float> c_data(c.elements());
    c.host(c_data.data());

    // Perform some host side side work

    array data(1000, 1000, c_data.data());
    // more work
}

In order for this to work correctly you would have had to split the loop so that the matmul operation is queued first then in a separate loop the data is transfered to the CPU. If more work needs to be performed, you will need to create a third loop to iterate through the devices again.

vector<array> a(nDevices);
vector<array> b(nDevices);
vector<array> c(nDevices);
for(int i = 0; i < nDevices; i++) {
    setDevice(i);
    a[i] = randu(1000, 1000);
    b[i] = constant(3, 1000, 1000);
    c[i] = matmul(a, b);
}

vector<vector<float>> c_data(nDevices)
for(int i = 0; i < nDevices; i++) {
    setDevice(i)
    c_data[i] = vector<float>(c.elements());
    c[i].host(c_data[i].data());
}

// Perform some host side side work

for(int i = 0; i < nDevices; i++) {
    setDevice(i);
    array data(1000, 1000, c_data.data());
    // more work
}

This will perform the operations on each of the GPUs on the machines as expected but it is troublesome to program like this. With the thread-safety updates in 3.5 you can now execute the operations for each of the GPUs in separate threads.

vector<thread> threads;
for(int i = 0; i < nDevices; i++) {
  threads.emplace_back([=]() {
    setDevice(i);
    array a = randu(1000, 1000);
    array b = constant(3, 1000, 1000);
    array c = matmul(a, b);

    // Blocks until the previous operation has completed
    vector<float> c_data(c.elements());
    c.host(c_data.data());

    // Perform some host side side work

    array data(1000, 1000, c_data.data());
    // more work
  });
}

This is a much easier to implement using threads then the for loop approach.

Canny edge detector

We have added the Canny edge detection algorithm to ArrayFire. The Canny algorithm is a multi-stage algorithm which can handle a wide variety of edges The canny algorithm can automatically calculate the thresholds using Otsu's method.

Sparse-Dense arithmetic

ArrayFire 3.4 introduced initial support for sparse arrays. With 3.5 we are expanding the sparse arrays to allow for sparse-dense arithmetic.

Support for CLBlast

With ArrayFire 3.5 we are introducing support for a new OpenCL BLAS implementation. It is developed by Cedric Nugteren and it is called CLBlast. It has improved performance on NVIDIA hardware and features an autotuner support devices that have not been explicitly been optimized. You can read more about clBlast at the GitHub repository here.

Improved CUDA JIT engine

The CUDA JIT engine has been refactored to use the NVRTC library. Previously the JIT functionality was implemented using the NVVM library which made it difficult to employ more advanced control flows. With NVRTC we can create more advanced JIT functions to improve the performance.

Future Releases

As always we are working on improving the performance of all of our functions. This has been another exciting update to ArrayFire and we have great plans for the next release. Stay tuned!

Download

ArrayFire v3.5 can be downloaded from these locations:

Community

ArrayFire is continually improving through the addition of new functions and features. We welcome your feedback:

General discussion forums on the ArrayFire Google Group
Live discussion chat on the ArrayFire Gitter
Issue reports on the ArrayFire GitHub

As you find success with ArrayFire, we invite you to contribute a post to this blog to share with the broader community. Email schedule a 15-minute consultation to contribute to this blog.

Dedicated Support and Coding Services

ArrayFire offers dedicated support packages for ArrayFire users.

ArrayFire serves many clients through consulting and coding services, algorithm development, porting code, and training courses for developers. Contact us at schedule a 15-minute consultation a free technical consultation to learn more about our consulting and coding services.