Today we are pleased to announce the release of ArrayFire v3.5, our open source
library of parallel computing functions supporting CUDA, OpenCL, and CPU
devices. This new version of ArrayFire improves features and performance for
applications in machine learning, computer vision, signal processing,
statistics, finance, and more.
This release focuses on thread-safety, support for simple sparse-dense
arithmetic operations, canny edge detector function, and a genetic algorithm
example. A complete list of ArrayFire v3.5 updates and new features are found
in the product Release Notes.
Thread Safety
ArrayFire now supports threading programming models. This is not intended to
improve the performance since most of the parallelism is happening on the
device, but it does allow you to use multiple devices in an easy way.
With ArrayFire you had to use a for loop to get this functionality. Normally
this isn’t a problem but when you have to work with host side transfers, this
becomes troublesome.
Let’s suppose we wanted to perform a matrix multiply on multiple GPUs. Because
most ArrayFire functions are asynchronous this can be done in a for loop.
for (int i = 0; i < nDevices; i++) { setDevice(i); array a = randu(1000, 1000); array b = constant(3, 1000, 1000); array c = matmul(a, b); }
This works well most of the time but this becomes an issue when you want to interact
with host side code in your application.
For example, this code attempts to perform a matrix multiply and then transfer
that data back to the CPU to do additional work. The ArrayFire code executes the
computation on different devices but they will not be queued at the same time
because the data transfer to the c_data array is a blocking operation.
for(int i = 0; i < nDevices; i++) { setDevice(i) array a = randu(1000, 1000); array b = constant(3, 1000, 1000); array c = matmul(a, b); // Blocks until the previous operation has completed vector<float> c_data(c.elements()); c.host(c_data.data()); // Perform some host side side work array data(1000, 1000, c_data.data()); // more work }
In order for this to work correctly you would have had to split the loop so that
the matmul operation is queued first then in a separate loop the data is transfered
to the CPU. If more work needs to be performed, you will need to create a third loop
to iterate through the devices again.
vector<array> a(nDevices); vector<array> b(nDevices); vector<array> c(nDevices); for(int i = 0; i < nDevices; i++) { setDevice(i); a[i] = randu(1000, 1000); b[i] = constant(3, 1000, 1000); c[i] = matmul(a, b); } vector<vector<float>> c_data(nDevices) for(int i = 0; i < nDevices; i++) { setDevice(i) c_data[i] = vector<float>(c.elements()); c[i].host(c_data[i].data()); } // Perform some host side side work for(int i = 0; i < nDevices; i++) { setDevice(i); array data(1000, 1000, c_data.data()); // more work }
This will perform the operations on each of the GPUs on the machines as expected
but it is troublesome to program like this. With the thread-safety updates in 3.5
you can now execute the operations for each of the GPUs in separate threads.
vector<thread> threads; for(int i = 0; i < nDevices; i++) { threads.emplace_back([=]() { setDevice(i); array a = randu(1000, 1000); array b = constant(3, 1000, 1000); array c = matmul(a, b); // Blocks until the previous operation has completed vector<float> c_data(c.elements()); c.host(c_data.data()); // Perform some host side side work array data(1000, 1000, c_data.data()); // more work }); }
This is a much easier to implement using threads then the for loop approach.
Canny edge detector
We have added the Canny edge detection algorithm to ArrayFire. The Canny
algorithm is a multi-stage algorithm which can handle a wide variety of edges
The canny algorithm can automatically calculate the thresholds
using Otsu’s method.
Sparse-Dense arithmetic
ArrayFire 3.4 introduced initial support for sparse arrays. With 3.5 we are
expanding the sparse arrays to allow for sparse-dense arithmetic.
Support for CLBlast
With ArrayFire 3.5 we are introducing support for a new OpenCL BLAS
implementation. It is developed by Cedric Nugteren and it is called CLBlast. It
has improved performance on NVIDIA hardware and features an autotuner support
devices that have not been explicitly been optimized. You can read more about
clBlast at the GitHub repository here.
Improved CUDA JIT engine
The CUDA JIT engine has been refactored to use the NVRTC library. Previously the
JIT functionality was implemented using the NVVM library which made it difficult
to employ more advanced control flows. With NVRTC we can create more advanced
JIT functions to improve the performance.
Future Releases
As always we are working on improving the performance of all of our functions. This has been another exciting update to ArrayFire and we have great plans for the next release. Stay tuned!
Download
ArrayFire v3.5 can be downloaded from these locations:
Community
ArrayFire is continually improving through the addition of new functions and features. We welcome your feedback:
- General discussion forums on the ArrayFire Google Group
- Live discussion chat on the ArrayFire Gitter
- Issue reports on the ArrayFire GitHub
As you find success with ArrayFire, we invite you to contribute a post to this blog to share with the broader community. Email sales@arrayfire.com to contribute to this blog.
Dedicated Support and Coding Services
ArrayFire offers dedicated support packages for ArrayFire users.
ArrayFire serves many clients through consulting and coding services, algorithm development, porting code, and training courses for developers. Contact us at sales@arrayfire.com or schedule a free technical consultation to learn more about our consulting and coding services.
Comments 1
It looks better and better each cycle.
Can you show Non Local Means implementation using ArrayFire (Not a built in function but how would you implement it using other blocks)?
Thank You.