Conway’s Game of Life using ArrayFire

Shehzan MohammedArrayFire, CUDA, Image Processing, Open Source, OpenGL 4 Comments

Conway’s Game of Life is a popular zero player cellular automaton devised by the John Horton Conway in 1970. The game makes for a fun evolution as the player sets the initial condition and then observes the evolution of the game. Each cell has 2 states: live or dead. There are 4 simple rules that determine this: Any live cell with fewer than two live neighbours dies, as if caused by under-population. Any live cell with two or three live neighbours lives on to the next generation. Any live cell with more than three live neighbours dies, as if by overcrowding. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction. From a programmer’s …

Triangle Counting in Graphs on the GPU (Part 3)

Oded GreenArrayFire 1 Comment

In this blog I will finalize the work that we completed on triangle counting in graphs on the GPU using CUDA. The two previous blogs can be found: Part 1 and Part 2. The first part introduces the significance of the problem and the second part explains the algorithms that we used in our solution. This work was presented in finer detail in the Workshop on Irregular Applications: Architectures and Algorithms which took place as part of Supercomputing 2014. The full report can be found here. In previous blogs, we discussed that the performance of the triangle counting is dependent on the algorithm and the CUDA kernel. Our implementation gives the data scientist control over several important parameters: Number of …

ArrayFire is Now Open Source

ScottAnnouncements, ArrayFire, Open Source 7 Comments

Yes, you read that right! ArrayFire is open source—it’s all there and it’s all free. This is big, and you and the rest of the parallel computing community are going to love it! You can download our pre-compiled binary installers which are optimized for a wide variety of systems or you can get a copy of the ArrayFire source code from our GitHub page. ArrayFire is being released under the BSD 3-Clause License, which will enable unencumbered deployment and portability of ArrayFire for commercial use. So go check it out! We welcome your feedback and look forward to your future contributions to ArrayFire. The move to open source isn’t our only news—we’ve also made ArrayFire better than ever. Check out our recent …

New Features in ArrayFire

Pavan YalamanchiliArrayFire 2 Comments

We have previously talked about upcoming computer vision algorithms in the next version of ArrayFire. Today we are going to discuss some of the bigger changes and additions to ArrayFire. New CPU backend In addition to CUDA and OpenCL backends, you can now run ArrayFire natively on any CPU. This is another step we’ve taken in our efforts to make ArrayFire truly portable. The biggest benefits the new CPU backend include: Hardware and Software neutrality: You can now build and ship applications without worrying about the hardware and drivers preset on end users’ machines. You can also port your applications easily to embedded and mobile platforms where CUDA and OpenCL may not be available. Heterogeneous Computing: It is now easier …

ArrayFire at SC14

Aaron TaylorArrayFire, Events 3 Comments

HPC matters. That’s the tagline for SC 14, and here at ArrayFire we’re in complete agreement with them. We’ve exhibited at SC for the past few years, and we’re excited to once again be a part of this excellent conference! It’s a great place for soaking up HPC knowledge, getting inspired, and connecting with the brightest minds in the industry. Here’s a quick run-down of where we’ll be.  Visit our booth. We’re booth #2725. We’ll have beautiful demos running and our engineers will be available for questions. Ask your questions, meet the team, or just bounce some ideas. Maybe—just maybe—you’ll get a sneak peek at our most ambitious project yet… Try our in-booth tutorials. Want to learn how to use ArrayFire to accelerate …

CUDA Optimization tips for Matrix Transpose in real world applications

Pradeep GarigipatiArrayFire 1 Comment

Computer algorithms are extra friendly towards data sizes that are powers of two. GPU compute algorithms work particularly well with data sizes that are multiples of 32. In most real-world situations, however, data is rarely so conveniently sized. In today’s post, we’ll be looking at one such scenario related to GPU compute. Specifically, we’ll provide you with some tips on how to optimize matrix transpose algorithm for a GPU. Let’s start with the transpose kernel available from NVIDIA’s Parallel Forall blog. It’s been optimized to avoid bank conflicts as well, but only works on matrices with dimensions that are multiples of 32. template __global__ void transpose32(T * out, const T * in, unsigned dim0, unsigned dim1) { __shared__ T shrdMem[TILE_DIM][TILE_DIM+1]; …

Zero Copy on Tegra K1

Shehzan MohammedArrayFire 8 Comments

Introduction Zero Copy access has been a part of the CUDA Toolkit for a long time (~2009). However, there were very few applications using the capability because, with time, GPU memory has become reasonably large. The only applications using zero copy access were mainly ones with extremely large memory requirements, such as database processing. Zero Copy is a way to map host memory and access it directly over PCIe without doing an explicit memory transfer. It allows CUDA kernels to directly access host memory. Instead of reading data from global memory (limit ~200GB/s), data would be read over PCIe and be limited by PCIe bandwidth (upto 16GB/s). Hence there was no real performance advantage for most applications. However, with the …

Triangle Counting in Graphs on the GPU (Part 1)

Oded GreenArrayFire 3 Comments

Triangle counting is a building block for many social analytics, especially for the widely used clustering coefficient analytic. Clustering coefficients is used for finding key players in a network based on their local connectivity, which is expressed based on the number of triangles that they belong to. There are many ways for finding the triangles a vertex may belong to. One of the most popular approaches for finding the triangles is to do an intersection of two adjacency lists. If a pair of vertices (u,v) have any common neighbors, these will be found in the intersection process. While the above may sound simple, implementing an efficient intersection on the GPU is not trivial or straightforward, it requires smart partitioning of the work to make full use …

Templating and Caching OpenCL Kernels

Pradeep GarigipatiArrayFire 2 Comments

About a month ago, one of my colleagues did a post on how to author the most concise OpenCL program using the C++ API provided by Khronos. In today’s post, we shall further modify that example to achieve the following two goals. Enable the kernel to work with different integral data types out of the box Ensure that the kernels compile only once at run time per data type Let’s dive into the details now. We can template the OpenCL kernels by passing a build option -D T=”typename” to the kernel compilation step. To pass such options, we would need a construct that can give us a string literal that represents the corresponding integral type. Let us declare a struct …

Accelerating Java using ArrayFire, CUDA and OpenCL

Pavan YalamanchiliArrayFire, Java 3 Comments

We have previously mentioned the ability to use ArrayFire through Java. In this post, we are going to show how you can get the best performance inside Java using ArrayFire for CUDA and OpenCL. Code Here is a sample code to perform Monte Caro Estimation of Pi. import java.util.Random; // Native Java Code public static double hostCalcPi(int size) { Random rand = new Random(); int count = 0; for (int i = 0; i < size; i++) { float x = rand.nextFloat(); float y = rand.nextFloat(); boolean lt1 = (x * x + y * y) < 1; if (lt1) count++; } return 4.0 * ((double)(count)) / size; } The same code can be written using ArrayFire in the following ...