Intel OpenCL performance: 3rd generation hardware

Brian Kloppenborg ArrayFire, OpenCL 1 Comment

Introduction With Intel CPUs making up nearly 80% of the CPU market and 66% of computers using integrated graphics one can easily argue that integrated graphics devices represent one of the greatest markets for GPU-accelerated computing. Here at ArrayFire, we have long recognized the potential of these devices and offer built-in support for Intel CPUs, GPUs, and AMD APUs in the OpenCL backend of our ArrayFire GPU computing library. Yet one common theme for debate in the office has been how the hardware performs on different operating systems with different drivers across hardware revisions. To answer these questions (and, perhaps, to win some intra-office geek cred) I decided to write a series of blog posts about Intel’s GPU OpenCL performance. In this first installment I will compare the performance …

GTC 2015 ArrayFire Recordings

Aaron Taylor ArrayFire, Computer Vision, CUDA

Missed visiting ArrayFire at GTC this year? We’ve got you covered! You can now check out the recordings of all our GTC 2015 talks and tutorials at your own convenience. Learn about accelerating your code from the best in the business. Talks Real-Time and High Resolution Feature Tracking and Object Recognition Peter Andreas Entschev This session will cover real-time feature tracking and object recognition in high resolution videos using GPUs and productive software libraries including ArrayFire. Feature tracking and object recognition are computer vision problems that have challenged researchers for decades. Over the last 15 years, numerous approaches were proposed to solve these problems, some of the most important being SIFT, SURF and ORB. Traditionally, these approaches are so computationally …

Using zero-copy buffers on integrated GPUs

Brian Kloppenborg C/C++, OpenCL 1 Comment

One of the most powerful aspects of parallel program on integrated GPUs is taking advantage of shared memory and caches. The best example of this is sharing common data between the CPU and GPU via. zero-copy buffers. This technique permits your program to avoid the O(N) cost of copying data to/from the GPU. This feature is particularly useful for applications that deal with real-time data streams, like video processing.

Machine Learning with ArrayFire: Linear Classifiers

Pavan Yalamanchili ArrayFire, CUDA, OpenCL Leave a Comment

Linear classifiers perform classification based on the linear combinition of the component features. Some examples of Linear Classifiers include: Naive Bayes Classifier, Linear Discriminant Analysis, Logistic Regression and Perceptrons. ArrayFire’s easy to use API enables users to write such classifiers from scratch fairly easily. In this post, we show how you can map mathematical equations to ArrayFire code and implement them from scratch. Naive Bayes Classifier Perceptron Naive Bayes Classifier Naive bayes classifier is a probabilistic classifier that assumes all the features in a feature vector are independent of each other. This assumption simplifies the bayes rule to a simple multiplication of probabilities as show below. First we start with the simple Baye’s rule. $$ p(C_k | x) = \frac{p(C_k)}{p(x)} …

OpenCL on Intel HD / Iris graphics on Linux

Brian Kloppenborg OpenCL 19 Comments

Under Windows and Mac the Intel GPU drivers include OpenCL support; however, on Linux OpenCL on Intel GPUs is implemented through an open source project called Beignet (pronnounced like “ben-yay”, a type of French pastry akin to a what we would call a “fritter” in English). Below I have written a step-by-step guide on how you can get Beignet running on an Ubuntu 14.10 system which has an Intel 3rd, 4th, or 5th generation Intel processor. Instructions for other variants of Linux will be similar, except for the commands to install the prerequisite packages. There are several little caveats which need to be discussed up front. Foremost, the Beignet project supports the following hardware: There are also a few noteworthy …

Getting Started with the Intel Xeon Phi on Ubuntu 14.04/Linux Kernel 3.13.0

Peter Entschev Hardware & Infrastructure, OpenCL 25 Comments

You may already know that the Intel MPSS (Manycore Platform Software Stack) officially only supports the RedHat and SUSE Linux distros. Using an enterprise distro might be very interesting if your company is running a large server environment or short on specialized people and you rely on the distro official support for more complicated tasks. Not all companies use enterprise Linux distributions. Ubuntu has a large share of the Linux distro market (if not the largest). A while back, I needed to setup a machine running Ubuntu 14.04 and MPSS 3.4.x and could not find any documentation running the newest versions of Ubuntu/Linux Kernel/MPSS. In this blog, I will try to document how to get the Intel Xeon Phi running …

Conway’s Game of Life using ArrayFire

Shehzan Mohammed ArrayFire, CUDA, Image Processing, Open Source, OpenGL 4 Comments

Conway’s Game of Life is a popular zero player cellular automaton devised by the John Horton Conway in 1970. The game makes for a fun evolution as the player sets the initial condition and then observes the evolution of the game. Each cell has 2 states: live or dead. There are 4 simple rules that determine this: Any live cell with fewer than two live neighbours dies, as if caused by under-population. Any live cell with two or three live neighbours lives on to the next generation. Any live cell with more than three live neighbours dies, as if by overcrowding. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction. From a programmer’s …

Triangle Counting in Graphs on the GPU (Part 2)

Oded Green Benchmarks, CUDA 2 Comments

A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally …

Demystifying PTX Code

Peter Entschev C/C++, CUDA, OpenCL 3 Comments

In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let’s start by looking at the complete PTX code: // // Generated by NVIDIA NVVM Compiler // Compiler built on Sun May 18 …

Generating PTX files from OpenCL code

Peter Entschev CUDA, OpenCL 2 Comments

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels. PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution. Retrieving the PTX …