Feature detection on Xilinx FPGAs using OpenCL

Brian KloppenborgArrayFire Leave a Comment

Today at SuperComputing 2015, ArrayFire demonstrated its first FPGA-accelerated application running on a Xilinx FPGA using OpenCL. In May 2015, ArrayFire became a Xilinx Alliance member which provided us early access to the Xilinx SDAccel Development Environment within initial support for OpenCL. We helped Xilinx test and improve its OpenCL implementation and make it more accessible to first-time users. Over the last few months, we implemented several programs for the FPGA and subsequently learned a lot about optimizing OpenCL kernels to take advantage of pipelining, wide memory busses, and the various forms of configurable local memory offered on Xilinx hardware. Our first fully ported application is from computer vision: the FAST feature extractor. FAST finds corners in image. In the image below, the left side shows a frame from …

Intel OpenCL performance: 3rd generation hardware

Brian KloppenborgArrayFire, OpenCL 1 Comment

Introduction With Intel CPUs making up nearly 80% of the CPU market and 66% of computers using integrated graphics one can easily argue that integrated graphics devices represent one of the greatest markets for GPU-accelerated computing. Here at ArrayFire, we have long recognized the potential of these devices and offer built-in support for Intel CPUs, GPUs, and AMD APUs in the OpenCL backend of our ArrayFire GPU computing library. Yet one common theme for debate in the office has been how the hardware performs on different operating systems with different drivers across hardware revisions. To answer these questions (and, perhaps, to win some intra-office geek cred) I decided to write a series of blog posts about Intel’s GPU OpenCL performance. In this first installment I will compare the performance …

OpenCL on Intel HD / Iris graphics on Linux

Brian KloppenborgOpenCL 19 Comments

Under Windows and Mac the Intel GPU drivers include OpenCL support; however, on Linux OpenCL on Intel GPUs is implemented through an open source project called Beignet (pronnounced like “ben-yay”, a type of French pastry akin to a what we would call a “fritter” in English). Below I have written a step-by-step guide on how you can get Beignet running on an Ubuntu 14.10 system which has an Intel 3rd, 4th, or 5th generation Intel processor. Instructions for other variants of Linux will be similar, except for the commands to install the prerequisite packages. There are several little caveats which need to be discussed up front. Foremost, the Beignet project supports the following hardware: There are also a few noteworthy …

Templating and Caching OpenCL Kernels

Pradeep GarigipatiArrayFire 2 Comments

About a month ago, one of my colleagues did a post on how to author the most concise OpenCL program using the C++ API provided by Khronos. In today’s post, we shall further modify that example to achieve the following two goals. Enable the kernel to work with different integral data types out of the box Ensure that the kernels compile only once at run time per data type Let’s dive into the details now. We can template the OpenCL kernels by passing a build option -D T=”typename” to the kernel compilation step. To pass such options, we would need a construct that can give us a string literal that represents the corresponding integral type. Let us declare a struct …

Accelerating Java using ArrayFire, CUDA and OpenCL

Pavan YalamanchiliArrayFire, Java 3 Comments

We have previously mentioned the ability to use ArrayFire through Java. In this post, we are going to show how you can get the best performance inside Java using ArrayFire for CUDA and OpenCL. Code Here is a sample code to perform Monte Caro Estimation of Pi. import java.util.Random; // Native Java Code public static double hostCalcPi(int size) { Random rand = new Random(); int count = 0; for (int i = 0; i < size; i++) { float x = rand.nextFloat(); float y = rand.nextFloat(); boolean lt1 = (x * x + y * y) < 1; if (lt1) count++; } return 4.0 * ((double)(count)) / size; } The same code can be written using ArrayFire in the following ...

Generating PTX files from OpenCL code

Peter EntschevCUDA, OpenCL 2 Comments

Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels. PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution. Retrieving the PTX …

OpenCL SPIR 2.0

Pavan YalamanchiliOpenCL 1 Comment

At SIGGRAPH 2014 The Khronos Group announced, among other things, the SPIR 2.0 provisional specification. This release of SPIR (Standard Portable Intermediate Representation) follows the release of OpenCL 2.0 spec last year. In this post, we would like to offer our take on SPIR 2.0 and what it means to OpenCL developers. Feature Parity With OpenCL 2.0 SPIR 1.2 had feature parity with OpenCL 1.2. With 2.0, SPIR has feature parity with OpenCL 2.0. Here are a few new features that we find interesting. Generic Address Space “Where functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions …

Quest for the Smallest OpenCL Program

Umar ArshadC/C++, OpenCL 8 Comments

I have heard many complaints about the verbosity of the OpenCL API. This claim is not unwarranted. The verbosity is due to the low-level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages. A typical OpenCL program must: Query for the platform Get the device IDs from the platform Create a context from a set of device IDs Create a command queue from the context Create buffer objects for your data Transfer the data to the buffer Create and build a program from source Extract the kernels Launch the kernels Transfer the data to the host …

OpenCL on Mobile Devices

Pavan YalamanchiliAndroid, OpenCL 6 Comments

While Google has openly displayed its opposition to OpenCL, many hardware manufacturers seem to be putting their weight behind OpenCL. Qualcomm, ARM, Imagination and Vivante support OpenCL on their GPUs. Android Phone manufacturers – Samsung, HTC, Sony and Amazon – ship OpenCL drivers and libraries on their latest generation of devices. Considering the prevalence of OpenCL on shipped devices, it is likely most Renderscript implementations have an OpenCL backend. To consolidate a list of OpenCL supported Android devices, we created a publicly accessable Google document seen below. If you have an Android phone that is not listed, we’d appreciate it if you contributed to the list. To test if OpenCL is supported on your phone, you can use OpenCL Info …