## Triangle Counting in Graphs on the GPU (Part 2)

A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally ...

## Demystifying PTX Code

In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let's start by looking at the complete PTX code: // // Generated by NVIDIA NVVM Compiler // Compiler built on Sun May 18 ...

## Generating PTX files from OpenCL code

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels. PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution. Retrieving the PTX ...

## Image editing using ArrayFire: Part 3

Today, we will be doing the third post in our series Image editing using ArrayFire. References to old posts are available below. * Part 1 * Part 2 In this post, we will be looking at the following operations. Image Histogram Simple Binary Theshold Otsu Threshold Iterative Threshold Adaptive Binary Threshold Emboss Filter Today's post will be mostly dominated by different types of threshold operations we can achieve using ArrayFire. Image Histogram We have a built-in function in ArrayFire that creates a histogram. The input image was converted to gray scale before histogram calculation as our histogram implementation works for vector and 2D matrices only. In case, you need histogram for all three channels of a color image, you can ...

## ArrayFire Capability Update - July 2014

Oded 1 Comment

In response to user requests for additional ArrayFire capabilities, we have decided to extend the library to have CPU fall back when OpenCL drivers for CPUs are not available. This means that ArrayFire code will be portable to both devices that have OpenCL setup and devices without it. This is done through the creation of additional backends. This will allow ArrayFire users to write their code once and have it run on multiple systems. We currently support the following systems and architectures: NVIDIA GPUs (Tesla, Fermi, and Kepler) AMD's GPUs, CPUs and APUs Intel's CPUs, GPUs and Xeon Phi Co-Processor Mobile and Embedded devices As part of this update process we are also looking at extending ArrayFire capabilities to low power systems such ...

## Image Processing Benchmarks on NVIDIA Jetson TK1

In this post we will be looking at benchmarks of the following ArrayFire image processing functions on an ARM device. Erosion/Dilation Median filter Resize Histogram Bilateral filter Convolution We pitted the brand new compute 3.2 GPU on NVIDIA Jetson TK1 against a mobile NVIDIA GPU. The closest match to the GPU (from here on referred as TK1) on the Jetson board we have in our mobile card deck is a NVIDIA GT 650M. The GPU device properties that have critical effect on the function performance are listed below. Property Name / Device Name Jetson TK1 GK20A GT 650M Compute 3.2 3.0 Number of multiprocessors 1 2 Cores 192 384 Base clock rate 852 MHz 950 MHz Total global memory 1746 ...

## Custom Kernels with ArrayFire

As extensive as ArrayFire is, there are a few cases where you are still working with custom CUDA or OpenCL kernels. For example, you may want to integrate ArrayFire into an existing code base for productivity or you may want to keep it around the old implementation for testing purposes. In this post we are going to talk about how to integrate your custom kernels into ArrayFire in a seamless fashion. In and Out of ArrayFire First let us look at the following code and then break it down bit by bit. int main() { af::array x = af::randu(num, 1); af::array y = af::array(num, 1); float *d_x = x.device(); float *d_y = y.device(); af::sync(); launch_simple_kernel(d_y, d_x, num); x.unlock(); y.unlock(); float err = ...

## Open Source Initiatives from ArrayFire

At ArrayFire we like to use a lot of Free/Open Source software. We use various Linux distributions, Jenkins, Gitlab, gcc, emacs, vim and numerous other FOSS tools on a daily basis. We also love the idea of developing software collaboratively and openly. Last year we started working with AMD on CL Math Libraries. Internally we've had numerous discussions about contributing to the GPGPU community. However, it's neither simple nor straightforward to take a closed software Open Source. Earlier this year, we decided to take the first step and Open Source all of the ArrayFire library's  tertiary projects. This includes all of our ArrayFire library's language wrappers, examples, and source code used for our blog posts. All of our projects are hosted at our ...