Conway’s Game of Life is a popular zero player cellular automaton devised by the John Horton Conway in 1970. The game makes for a fun evolution as the player sets the initial condition and then observes the evolution of the game. Each cell has 2 states: live or dead. There are 4 simple rules that determine this: Any live cell with fewer than two live neighbours dies, as if caused by under-population. Any live cell with two or three live neighbours lives on to the next generation. Any live cell with more than three live neighbours dies, as if by overcrowding. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction. From a programmer’s …
Triangle Counting in Graphs on the GPU (Part 3)
In this blog I will finalize the work that we completed on triangle counting in graphs on the GPU using CUDA. The two previous blogs can be found: Part 1 and Part 2. The first part introduces the significance of the problem and the second part explains the algorithms that we used in our solution. This work was presented in finer detail in the Workshop on Irregular Applications: Architectures and Algorithms which took place as part of Supercomputing 2014. The full report can be found here. In previous blogs, we discussed that the performance of the triangle counting is dependent on the algorithm and the CUDA kernel. Our implementation gives the data scientist control over several important parameters: Number of …
ArrayFire Joint Webinar with NVIDIA
Ever wish you could spread out the insights from GPU Technology Conference over the course of a year? NVIDIA’s got you covered with their GTC Express webinar series, a weekly webinar event connecting you with the top developers in the parallel processing industry. Next week, the GTC Express pulls into the ArrayFire station. Shehzan Mohammed, ArrayFire Senior Software Developer, will demonstrate how the recently open sourced ArrayFire GPU library enables high performance on NVIDIA GPUs for scientists, engineers, and researchers, in a wide range of applications including defense and intelligence, life sciences, oil and gas, finance, manufacturing, media, and others. The webinar is scheduled for Wednesday, December 10, from 12:00 pm to 1:00 pm EST. Register for your spot, and feel free to …
ArrayFire is Now Open Source
Yes, you read that right! ArrayFire is open source—it’s all there and it’s all free. This is big, and you and the rest of the parallel computing community are going to love it! You can download our pre-compiled binary installers which are optimized for a wide variety of systems or you can get a copy of the ArrayFire source code from our GitHub page. ArrayFire is being released under the BSD 3-Clause License, which will enable unencumbered deployment and portability of ArrayFire for commercial use. So go check it out! We welcome your feedback and look forward to your future contributions to ArrayFire. The move to open source isn’t our only news—we’ve also made ArrayFire better than ever. Check out our recent …
New Features in ArrayFire
We have previously talked about upcoming computer vision algorithms in the next version of ArrayFire. Today we are going to discuss some of the bigger changes and additions to ArrayFire. New CPU backend In addition to CUDA and OpenCL backends, you can now run ArrayFire natively on any CPU. This is another step we’ve taken in our efforts to make ArrayFire truly portable. The biggest benefits the new CPU backend include: Hardware and Software neutrality: You can now build and ship applications without worrying about the hardware and drivers preset on end users’ machines. You can also port your applications easily to embedded and mobile platforms where CUDA and OpenCL may not be available. Heterogeneous Computing: It is now easier …
ArrayFire at SC14
HPC matters. That’s the tagline for SC 14, and here at ArrayFire we’re in complete agreement with them. We’ve exhibited at SC for the past few years, and we’re excited to once again be a part of this excellent conference! It’s a great place for soaking up HPC knowledge, getting inspired, and connecting with the brightest minds in the industry. Here’s a quick run-down of where we’ll be. Visit our booth. We’re booth #2725. We’ll have beautiful demos running and our engineers will be available for questions. Ask your questions, meet the team, or just bounce some ideas. Maybe—just maybe—you’ll get a sneak peek at our most ambitious project yet… Try our in-booth tutorials. Want to learn how to use ArrayFire to accelerate …
Triangle Counting in Graphs on the GPU (Part 2)
A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally …
CUDA Optimization tips for Matrix Transpose in real world applications
Computer algorithms are extra friendly towards data sizes that are powers of two. GPU compute algorithms work particularly well with data sizes that are multiples of 32. In most real-world situations, however, data is rarely so conveniently sized. In today’s post, we’ll be looking at one such scenario related to GPU compute. Specifically, we’ll provide you with some tips on how to optimize matrix transpose algorithm for a GPU. Let’s start with the transpose kernel available from NVIDIA’s Parallel Forall blog. It’s been optimized to avoid bank conflicts as well, but only works on matrices with dimensions that are multiples of 32. template __global__ void transpose32(T * out, const T * in, unsigned dim0, unsigned dim1) { __shared__ T shrdMem[TILE_DIM][TILE_DIM+1]; …
Zero Copy on Tegra K1
Introduction Zero Copy access has been a part of the CUDA Toolkit for a long time (~2009). However, there were very few applications using the capability because, with time, GPU memory has become reasonably large. The only applications using zero copy access were mainly ones with extremely large memory requirements, such as database processing. Zero Copy is a way to map host memory and access it directly over PCIe without doing an explicit memory transfer. It allows CUDA kernels to directly access host memory. Instead of reading data from global memory (limit ~200GB/s), data would be read over PCIe and be limited by PCIe bandwidth (upto 16GB/s). Hence there was no real performance advantage for most applications. However, with the …
Demystifying PTX Code
In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let’s start by looking at the complete PTX code: // // Generated by NVIDIA NVVM Compiler // Compiler built on Sun May 18 …
