CUDA Computing on Google Colab with ArrayFire

John MelonakosArrayFire, Benchmarks, Computing Trends, CUDA, Open Source, Python Leave a Comment

For the first-time in our 14 year existence, we are now able to provide our community with the ability to run ArrayFire programs for free within minutes. Before today, users would have to download and install the library on their own systems, which can be a hassle if you just want to play around with some code and benchmarks. Today, we’re excited to announce that ArrayFire is available on Google Colab, the free GPU computing cloud service from Google. Colaboratory, or “Colab” for short, allows you to write and execute Python in your browser, with Zero configuration required, free access to GPUs, and easy sharing. You can jump right in and start playing with this new tool: Click Here to …

Performance Improvements to JIT in ArrayFire v3.4

Pavan YalamanchiliAnnouncements, ArrayFire, Benchmarks Leave a Comment

ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware. This feature has been a favorite among our users in the domains of finance and scientific simulation. That said, ArrayFire v3.3 and earlier had a few limitations. Namely: Multiple outputs with inter-dependent variables were generating multiple kernels. The number of operations per kernel was fairly limited by default. In the latest release of ArrayFire, we addressed these issues to get some pretty impressive numbers. In the rest of the post, we demonstrate the performance improvements using our BlackScholes …

Benchmarking parallel vector libraries

Pavan YalamanchiliArrayFire, Benchmarks, C/C++, CUDA Leave a Comment

There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high …

Performance of ArrayFire JIT Code Generation

Oded GreenBenchmarks, Case Studies 3 Comments

The ArrayFire library offers JIT (Just In Time) compiling for standard arithmetic operations. This includes trigonometric functions, comparisons, and element-wise operations. At run-time, ArrayFire aggregates these function calls using an Abstract Syntax Tree (AST) data structure such that whenever a JIT-supported function is ”met,” it is added into the AST for a given variable instance. The AST of the variable is computed if one of the following conditions is met: When the above occurs, and the variable needs to be evaluated, the functions and variables in the AST data structure are used to create a single kernel (”function-call”). This is done by creating a customized kernel on-the-fly that is made up of all the functions in the AST – the …

ArrayFire Benchmarks: AMD Kaveri vs Intel Haswell Part-1

Pradeep GarigipatiArrayFire, Benchmarks 10 Comments

We have had queries in the past requesting benchmarks on integrated GPUs of Intel and AMD processors. This post is a modest attempt to answer those questions. In this post, we focus on the GPU benchmarks of AMD A10-7850K APU & Intel i7-4790K HD 4600 for the following ArrayFire functions. Bilateral Filter Erosion/Dilation 2D Convolution 2D Fast Fourier Transform Median Filter Resize Rotate Scan 1D Array Reduction of 1D Array Sort Matrix Transpose Remarks For most of the benchmarks the Intel system was outperformed by the AMD APU. We believe that we will be able to get more performance from the Intel system by modifying the kernels to use vector operations which will increase the resource utilization. Keep an eye …

Triangle Counting in Graphs on the GPU (Part 2)

Oded GreenBenchmarks, CUDA 2 Comments

A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally …

Computer Vision in ArrayFire – Part 2: Feature Description and Matching

Peter EntschevArrayFire, Benchmarks, Computer Vision 2 Comments

In the Part 1 of this series, we talked about upcoming feature detection algorithms in ArrayFire library. In this post we show case some of the preliminary results of Feature Description and matching that are under development in the ArrayFire library. Feature description is done using the ORB feature descriptor[1]. The descriptors are matched against a database of features using Hamming distance as the metric. The results we show in this blog use the same hardware and software used in the previous blog: Intel Sandy Bridge Xeon processor with 32 cores (for baseline OpenCV CPU implementation) NVIDIA Tesla K20C (for OpenCV and ArrayFire CUDA implementations) ArrayFire development version OpenCV version 2.4.9 Feature Description and Matching Benchmarks In Part 1 we showed that …

Image Processing Benchmarks on NVIDIA Jetson TK1

Pradeep GarigipatiArrayFire, Benchmarks, CUDA 7 Comments

In this post we will be looking at benchmarks of the following ArrayFire image processing functions on an ARM device. Erosion/Dilation Median filter Resize Histogram Bilateral filter Convolution We pitted the brand new compute 3.2 GPU on NVIDIA Jetson TK1 against a mobile NVIDIA GPU. The closest match to the GPU (from here on referred as TK1) on the Jetson board we have in our mobile card deck is a NVIDIA GT 650M. The GPU device properties that have critical effect on the function performance are listed below. Property Name / Device Name Jetson TK1 GK20A GT 650M Compute 3.2 3.0 Number of multiprocessors 1 2 Cores 192 384 Base clock rate 852 MHz 950 MHz Total global memory 1746 …

ArrayFire + Scorpii Demo by CreativeC

ScottArrayFire, Benchmarks, Case Studies, CUDA, Events Leave a Comment

CreativeC makes awesome compute + visualization systems. We got to see the demo in live action at the GPU Technology Conference last month. Tim Thomas was kind enough to let us film the demo showing how ArrayFire can be used to drive a multi-node, 9 GPU system in a physics application. Checkout the video below. If you are interested in high-throughput compute coupled with high-pixel visualizations, we recommend you talk with the folks at CreativeC. They are always pushing the envelope on what can be done with GPU computing and GPU visualizations. Also, if you have cool demos showing ArrayFire in action, let us know. We’d love to film your work and make it available on this blog! Related articles …

ArrayFire Examples (Part 2 of 8) – Benchmarks

ArrayFireArrayFire, Benchmarks, CUDA Leave a Comment

This is the second in a series of posts looking at our current ArrayFire examples. The code can be compiled and run from arrayfire/examples/ when you download and install the ArrayFire library. Today we will discuss the examples found in the benchmarks/ directory. In these examples, my machine has the following configuration: ArrayFire v1.9 (build XXXXXXX) by AccelerEyes (64-bit Linux) License: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX CUDA toolkit 5.0, driver 304.54 GPU0 Quadro 6000, 6144 MB, Compute 2.0 (single,double) Display Device: GPU0 Quadro 6000 Memory Usage: 5549 MB free (6144 MB total)… Blas This example shows a simple bench-marking process using ArrayFire’s matrix multiply routine. For more information on Blas, click here. The data measured in this example is the Giga-Flop (GFLOP Floating Point Operations Per Second). I got the following results using …