Today we are pleased to announce the release of ArrayFire v3.5, our open source library of parallel computing functions supporting CUDA, OpenCL, and CPU devices. This new version of ArrayFire improves features and performance for applications in machine learning, computer vision, signal processing, statistics, finance, and more. This release focuses on thread-safety, support for simple sparse-dense arithmetic operations, canny edge detector function, and a genetic algorithm example. A complete list of ArrayFire v3.5 updates and new features are found in the product Release Notes. Thread Safety ArrayFire now supports threading programming models. This is not intended to improve the performance since most of the parallelism is happening on the device, but it does allow you to use multiple devices in ...

## ArrayFire - CUDA Interoperability

Although ArrayFire is quite extensive, there remain many cases in which you may want to write custom kernels in CUDA or OpenCL. For example, you may wish to add ArrayFire to an existing code base to increase your productivity, or you may need to supplement ArrayFire's functionality with your own custom implementation of specific algorithms. Today's "Learning ArrayFire from scratch", blog post discusses how you can interface ArrayFire and CUDA.

## Benchmarking parallel vector libraries

There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high ...

## ArrayFire v3.0 is here!

Today we are pleased to announce the release of ArrayFire v3.0. This new version features major changes to ArrayFire’s visualization library, a new CPU backend, and dense linear algebra for OpenCL devices. It also includes improvements across the board for ArrayFire’s OpenCL backend. A complete list ArrayFire v3.0 updates and new features can be found in the product Release Notes. With over 8 years of continuous development, the open source ArrayFire library is the top CUDA and OpenCL software library. ArrayFire supports CUDA-capable GPUs, OpenCL devices, and other accelerators. With its easy-to-use API, this hardware-neutral software library is designed for maximum speed without the hassle of writing time-consuming CUDA and OpenCL device code. With ArrayFire’s library functions, developers can maximize ...

## GTC 2015 ArrayFire Recordings

Missed visiting ArrayFire at GTC this year? We've got you covered! You can now check out the recordings of all our GTC 2015 talks and tutorials at your own convenience. Learn about accelerating your code from the best in the business. Talks Real-Time and High Resolution Feature Tracking and Object Recognition Peter Andreas Entschev This session will cover real-time feature tracking and object recognition in high resolution videos using GPUs and productive software libraries including ArrayFire. Feature tracking and object recognition are computer vision problems that have challenged researchers for decades. Over the last 15 years, numerous approaches were proposed to solve these problems, some of the most important being SIFT, SURF and ORB. Traditionally, these approaches are so computationally ...

## Machine Learning with ArrayFire: Linear Classifiers

Linear classifiers perform classification based on the linear combinition of the component features. Some examples of Linear Classifiers include: Naive Bayes Classifier, Linear Discriminant Analysis, Logistic Regression and Perceptrons. ArrayFire's easy to use API enables users to write such classifiers from scratch fairly easily. In this post, we show how you can map mathematical equations to ArrayFire code and implement them from scratch. Naive Bayes Classifier Perceptron Naive Bayes Classifier Naive bayes classifier is a probabilistic classifier that assumes all the features in a feature vector are independent of each other. This assumption simplifies the bayes rule to a simple multiplication of probabilities as show below. First we start with the simple Baye's rule. Where: Since p(x) is the same ...

## Conway's Game of Life using ArrayFire

Conway's Game of Life is a popular zero player cellular automaton devised by the John Horton Conway in 1970. The game makes for a fun evolution as the player sets the initial condition and then observes the evolution of the game. Each cell has 2 states: live or dead. There are 4 simple rules that determine this: Any live cell with fewer than two live neighbours dies, as if caused by under-population. Any live cell with two or three live neighbours lives on to the next generation. Any live cell with more than three live neighbours dies, as if by overcrowding. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction. From a programmer's ...

## Triangle Counting in Graphs on the GPU (Part 2)

A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally ...

## Demystifying PTX Code

In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let's start by looking at the complete PTX code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
// // Generated by NVIDIA NVVM Compiler // Compiler built on Sun May 18 04:44:51 2014 (1400399091) // Driver 331.79 // .version 3.0 .target sm_21, texmode_independent .address_size 32 .entry add_vectors( .param .u32 .ptr .global .align 4 add_vectors_param_0, .param .u32 .ptr .global .align 4 add_vectors_param_1, .param .u32 .ptr .global .align 4 add_vectors_param_2, .param .u32 add_vectors_param_3 ) { .reg .pred %p<2>; .reg .s32 %r<21>; ld.param.u32 %r9, [add_vectors_param_3]; mov.u32 %r5, %envreg3; mov.u32 %r6, %ntid.x; mov.u32 %r7, %ctaid.x; mov.u32 %r8, %tid.x; add.s32 %r10, %r8, %r5; mad.lo.s32 %r4, %r7, %r6, %r10; setp.lt.s32 %p1, %r4, %r9; @%p1 bra BB0_2; ret; BB0_2: shl.b32 %r11, %r4, 2; ld.param.u32 %r18, [add_vectors_param_0]; add.s32 %r12, %r18, %r11; ld.param.u32 %r19, [add_vectors_param_1]; add.s32 %r13, %r19, %r11; ld.global.u32 %r14, [%r13]; ld.global.u32 %r15, [%r12]; add.s32 %r16, %r14, %r15; ld.param.u32 %r20, [add_vectors_param_2]; add.s32 %r17, %r20, %r11; st.global.u32 [%r17], %r16; ret; } |

The file starts with a header showing some compiler information in comments, followed ...

## Generating PTX files from OpenCL code

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels. PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution. Retrieving the PTX ...