Introduction Zero Copy access has been a part of the CUDA Toolkit for a long time (~2009). However, there were very few applications using the capability because, with time, GPU memory has become reasonably large. The only applications using zero copy access were mainly ones with extremely large memory requirements, such as database processing. Zero Copy is a way to map host memory and access it directly over PCIe without doing an explicit memory transfer. It allows CUDA kernels to directly access host memory. Instead of reading data from global memory (limit ~200GB/s), data would be read over PCIe and be limited by PCIe bandwidth (upto 16GB/s). Hence there was no real performance advantage for most applications. However, with the …
Triangle Counting in Graphs on the GPU (Part 1)
Triangle counting is a building block for many social analytics, especially for the widely used clustering coefficient analytic. Clustering coefficients is used for finding key players in a network based on their local connectivity, which is expressed based on the number of triangles that they belong to. There are many ways for finding the triangles a vertex may belong to. One of the most popular approaches for finding the triangles is to do an intersection of two adjacency lists. If a pair of vertices (u,v) have any common neighbors, these will be found in the intersection process. While the above may sound simple, implementing an efficient intersection on the GPU is not trivial or straightforward, it requires smart partitioning of the work to make full use …
Templating and Caching OpenCL Kernels
About a month ago, one of my colleagues did a post on how to author the most concise OpenCL program using the C++ API provided by Khronos. In today’s post, we shall further modify that example to achieve the following two goals. Enable the kernel to work with different integral data types out of the box Ensure that the kernels compile only once at run time per data type Let’s dive into the details now. We can template the OpenCL kernels by passing a build option -D T=”typename” to the kernel compilation step. To pass such options, we would need a construct that can give us a string literal that represents the corresponding integral type. Let us declare a struct …
Accelerating Java using ArrayFire, CUDA and OpenCL
We have previously mentioned the ability to use ArrayFire through Java. In this post, we are going to show how you can get the best performance inside Java using ArrayFire for CUDA and OpenCL. Code Here is a sample code to perform Monte Caro Estimation of Pi. import java.util.Random; // Native Java Code public static double hostCalcPi(int size) { Random rand = new Random(); int count = 0; for (int i = 0; i < size; i++) { float x = rand.nextFloat(); float y = rand.nextFloat(); boolean lt1 = (x * x + y * y) < 1; if (lt1) count++; } return 4.0 * ((double)(count)) / size; } The same code can be written using ArrayFire in the following ...
The Benefits of Array Element-Wise Operations
In this blog we will review the benefit of using element-wise operations for your computations. Element-wise operations are operations that are applied to every element in an array and allow the user to avoid coding loops and nested loops for rudimentary operations. In a simple example of an element-wise operation, we use both the addition (+) and multiply (*)operations: array A = randu(1024, 1024), B = randu(1024, 1024); array C=A*A+B*B; An element-wise operator that is applied to a single element is unary operator. An operator that works on two elements is a binary operator. ArrayFire has implemented a large number element-wise operations that are applied to the elements of an array. These operators can help reduce the programming overhead for an application designer as: Performance – …
Cross Compile to Windows From Linux
Why did I not know about this? It’s like I just discovered the screw driver! On Debian and variants (from tinc’s windows cross-compilation page), sudo apt-get install mingw-w64 # C i686-w64-mingw32-gcc hello.c -o hello32.exe # 32-bit x86_64-w64-mingw32-gcc hello.c -o hello64.exe # 64-bit # C++ i686-w64-mingw32-g++ hello.cc -o hello32.exe # 32-bit x86_64-w64-mingw32-g++ hello.cc -o hello64.exe # 64-bit Granted, this isn’t a silver bullet, but rather a quick way to get a Windows build of platform independent code that you might already have running in Linux. I’ve found that this approach makes it easy to get binaries out the door in a hurry when it’s hard to get a project building with Visual Studio or even on the Windows platform itself (due …
Image editing using ArrayFire: Part 3
Today, we will be doing the third post in our series Image editing using ArrayFire. References to old posts are available below. * Part 1 * Part 2 In this post, we will be looking at the following operations. Image Histogram Simple Binary Theshold Otsu Threshold Iterative Threshold Adaptive Binary Threshold Emboss Filter Today’s post will be mostly dominated by different types of threshold operations we can achieve using ArrayFire. Image Histogram We have a built-in function in ArrayFire that creates a histogram. The input image was converted to gray scale before histogram calculation as our histogram implementation works for vector and 2D matrices only. In case, you need histogram for all three channels of a color image, you can …
Image editing using ArrayFire: Part 2
A couple of weeks back, we did a post on a few image editing functions using ArrayFire library. Today, we shall be doing the second post in the series Image Editing using ArrayFire. We will be looking at the following operations today. Image distortion Noise addition Noise reduction Edge filters Boundary extraction Difference of gaussians Code and sample input/outputs corresponding to each operation are described below. Image distortion We will be looking at spread and pick filters in this section. Both of these filters are fundamentally the same, they replace each pixel in the original image with one of it’s neighboring pixels. How the neighbor is chosen is essentially the difference between spread and pick. Both of these functions use …
Computer Vision in ArrayFire – Part 2: Feature Description and Matching
In the Part 1 of this series, we talked about upcoming feature detection algorithms in ArrayFire library. In this post we show case some of the preliminary results of Feature Description and matching that are under development in the ArrayFire library. Feature description is done using the ORB feature descriptor[1]. The descriptors are matched against a database of features using Hamming distance as the metric. The results we show in this blog use the same hardware and software used in the previous blog: Intel Sandy Bridge Xeon processor with 32 cores (for baseline OpenCV CPU implementation) NVIDIA Tesla K20C (for OpenCV and ArrayFire CUDA implementations) ArrayFire development version OpenCV version 2.4.9 Feature Description and Matching Benchmarks In Part 1 we showed that …
ArrayFire: Write once, Run anywhere
One of ArrayFire’s biggest features is the ability for code to be written just once and run on a plethora of devices. In this post, we show the outputs of af::info() from various devices available to us. Desktop Processors AMD GPU/CPU (OpenCL) ArrayFire v2.1 (OpenCL, 64-bit Linux, build 4b9115c) License: Standalone (/home/pavan/.arrayfire.lic) Addons: MGL4, DLA, SLA Platform: AMD Accelerated Parallel Processing, Driver: 1526.3 (VM) [0]: Tahiti, 2864 MB, OpenCL Version: 1.2 1 : AMD FX(tm)-8350 Eight-Core Processor, 7953 MB, OpenCL Version: 1.2 Compute Device: [0] AMD APU (OpenCL) ArrayFire v2.1 (OpenCL, 64-bit Linux, build 586ef59) License: Standalone (/home/arrayfire/.arrayfire.lic) Addons: MGL4, DLA, SLA Platform: AMD Accelerated Parallel Processing, Driver: 1445.5 (VM) [0]: Spectre, 624 MB, OpenCL Version: 1.2 1 : AMD …