Performance Improvements to JIT in ArrayFire v3.4

Pavan Announcements, ArrayFire, Benchmarks Leave a Comment

ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware. This feature has been a favorite among our users in the domains of finance and scientific simulation. That said, ArrayFire v3.3 and earlier had a few limitations. Namely: Multiple outputs with inter-dependent variables were generating multiple kernels. The number of operations per kernel was fairly limited by default. In the latest release of ArrayFire, we addressed these issues to get some pretty impressive numbers. In the rest of the post, we demonstrate the performance improvements using our BlackScholes ...

Benchmarking parallel vector libraries

Pavan ArrayFire, Benchmarks, C/C++, CUDA Leave a Comment

There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high ...

Performance of ArrayFire JIT Code Generation

Oded ArrayFire, Benchmarks, Case Studies, Infrastructure 3 Comments

The ArrayFire library offers JIT (Just In Time) compiling for standard arithmetic operations. This includes trigonometric functions, comparisons, and element-wise operations. At run-time, ArrayFire aggregates these function calls using an Abstract Syntax Tree (AST) data structure such that when ever a JIT supported function is ''met'' it is added into the AST for a given variable instance. The AST of the variable is computed if one of the following conditions is met: an explication evaluation is required by the programmer using the eval() function member or the variable is required for the computation of a different variable that is not-JIT supported it. When the above occurs and the variable needs to be evaluated, the functions and variables in the AST ...

ArrayFire Benchmarks: AMD Kaveri vs Intel Haswell Part-1

Pradeep ArrayFire, Benchmarks 10 Comments

We have had queries in the past requesting benchmarks on integrated GPUs of Intel and AMD processors. This post is a modest attempt to answer those questions. In this post, we focus on the GPU benchmarks of AMD A10-7850K APU & Intel i7-4790K HD 4600 for the following ArrayFire functions. Bilateral Filter Erosion/Dilation 2D Convolution 2D Fast Fourier Transform Median Filter Resize Rotate Scan 1D Array Reduction of 1D Array Sort Matrix Transpose Remarks For most of the benchmarks the Intel system was outperformed by the AMD APU. We believe that we will be able to get more performance from the Intel system by modifying the kernels to use vector operations which will increase the resource utilization. Keep an eye ...

Triangle Counting in Graphs on the GPU (Part 2)

Oded Benchmarks, CUDA 2 Comments

A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally ...

Computer Vision in ArrayFire - Part 2: Feature Description and Matching

Peter ArrayFire, Benchmarks, Computer Vision 2 Comments

In the Part 1 of this series, we talked about upcoming feature detection algorithms in ArrayFire library. In this post we show case some of the preliminary results of Feature Description and matching that are under development in the ArrayFire library. Feature description is done using the ORB feature descriptor[1]. The descriptors are matched against a database of features using Hamming distance as the metric. The results we show in this blog use the same hardware and software used in the previous blog: Intel Sandy Bridge Xeon processor with 32 cores (for baseline OpenCV CPU implementation) NVIDIA Tesla K20C (for OpenCV and ArrayFire CUDA implementations) ArrayFire development version OpenCV version 2.4.9 Feature Description and Matching Benchmarks In Part 1 we showed that ...

Image Processing Benchmarks on NVIDIA Jetson TK1

Pradeep ArrayFire, Benchmarks, CUDA 6 Comments

In this post we will be looking at benchmarks of the following ArrayFire image processing functions on an ARM device. Erosion/Dilation Median filter Resize Histogram Bilateral filter Convolution We pitted the brand new compute 3.2 GPU on NVIDIA Jetson TK1 against a mobile NVIDIA GPU. The closest match to the GPU (from here on referred as TK1) on the Jetson board we have in our mobile card deck is a NVIDIA GT 650M. The GPU device properties that have critical effect on the function performance are listed below. Property Name / Device Name Jetson TK1 GK20A GT 650M Compute 3.2 3.0 Number of multiprocessors 1 2 Cores 192 384 Base clock rate 852 MHz 950 MHz Total global memory 1746 ...

ArrayFire + Scorpii Demo by CreativeC

Scott ArrayFire, Benchmarks, Case Studies, CUDA, Events Leave a Comment

CreativeC makes awesome compute + visualization systems. We got to see the demo in live action at the GPU Technology Conference last month. Tim Thomas was kind enough to let us film the demo showing how ArrayFire can be used to drive a multi-node, 9 GPU system in a physics application. Checkout the video below. If you are interested in high-throughput compute coupled with high-pixel visualizations, we recommend you talk with the folks at CreativeC. They are always pushing the envelope on what can be done with GPU computing and GPU visualizations. Also, if you have cool demos showing ArrayFire in action, let us know. We'd love to film your work and make it available on this blog! Related articles ...

ArrayFire Examples (Part 2 of 8) - Benchmarks

ArrayFire ArrayFire, Benchmarks, CUDA Leave a Comment

This is the second in a series of posts looking at our current ArrayFire examples. The code can be compiled and run from arrayfire/examples/ when you download and install the ArrayFire library. Today we will discuss the examples found in the benchmarks/ directory. In these examples, my machine has the following configuration:

Blas This example shows a simple bench-marking process using ArrayFire's matrix multiply routine. For more information on Blas, click here. The data measured in this example is the Giga-Flop (GFLOP Floating Point Operations Per Second). I got the following results using the example code on my machine:

Just for fun, I wrote a small program to perform a simple matrix multiply using the CPU in a triple-for loop (a pretty standard, but non-optimized solution). This ...

Benchmarking Tesla K20

Pavan ArrayFire, Benchmarks, CUDA 1 Comment

In this blog post, we are going to compare NVIDIA's latest high end offering, the Tesla K series (PDF) with their previous offering. In particular we are comparing the Tesla K20C with Tesla C2070/2075. This blog post follows a similar post about benchmarking the GTX680 we did last year. We take a look at similar set of functions (and a little bit more) to see what benefits the newer line brings. All of the benchmarks were done using double precision. In all of the graphs, higher trendlines are better. Matrix Multiplication In house at AccelerEyes, we use matrix multiplication as the gold standard for testing the maximum performance of all new GPUs we end up with. The K20c reaches a peak at ...