About a month ago, one of my colleagues did a post on how to author the most concise OpenCL program using the C++ API provided by Khronos. In today’s post, we shall further modify that example to achieve the following two goals. Enable the kernel to work with different integral data types out of the box Ensure that the kernels compile only once at run time per data type Let’s dive into the details now. We can template the OpenCL kernels by passing a build option -D T=”typename” to the kernel compilation step. To pass such options, we would need a construct that can give us a string literal that represents the corresponding integral type. Let us declare a struct …
Accelerating Java using ArrayFire, CUDA and OpenCL
We have previously mentioned the ability to use ArrayFire through Java. In this post, we are going to show how you can get the best performance inside Java using ArrayFire for CUDA and OpenCL. Code Here is a sample code to perform Monte Caro Estimation of Pi. import java.util.Random; // Native Java Code public static double hostCalcPi(int size) { Random rand = new Random(); int count = 0; for (int i = 0; i < size; i++) { float x = rand.nextFloat(); float y = rand.nextFloat(); boolean lt1 = (x * x + y * y) < 1; if (lt1) count++; } return 4.0 * ((double)(count)) / size; } The same code can be written using ArrayFire in the following ...
Generating PTX files from OpenCL code
Here at ArrayFire, we develop code that will work efficiently on both CUDA and OpenCL platforms. Therefore, it is not uncommon that CUDA code on NVIDIA GPUs will run faster than OpenCL. A very good way to understand what is behind the curtains is to generate the PTX file for both cases and compare them. In this post, we show how to generate PTX for both CUDA and OpenCL kernels. PTX stands for Parallel Thread eXecution, which is a low-level virtual machine and instruction set architecture (ISA). For those familiar with assembly language, the PTX instruction set is not really more complicated than a single thread assembly code, except that now we are thinking in massive parallel execution. Retrieving the PTX …
The Benefits of Array Element-Wise Operations
In this blog we will review the benefit of using element-wise operations for your computations. Element-wise operations are operations that are applied to every element in an array and allow the user to avoid coding loops and nested loops for rudimentary operations. In a simple example of an element-wise operation, we use both the addition (+) and multiply (*)operations: array A = randu(1024, 1024), B = randu(1024, 1024); array C=A*A+B*B; An element-wise operator that is applied to a single element is unary operator. An operator that works on two elements is a binary operator. ArrayFire has implemented a large number element-wise operations that are applied to the elements of an array. These operators can help reduce the programming overhead for an application designer as: Performance – …
OpenCL SPIR 2.0
At SIGGRAPH 2014 The Khronos Group announced, among other things, the SPIR 2.0 provisional specification. This release of SPIR (Standard Portable Intermediate Representation) follows the release of OpenCL 2.0 spec last year. In this post, we would like to offer our take on SPIR 2.0 and what it means to OpenCL developers. Feature Parity With OpenCL 2.0 SPIR 1.2 had feature parity with OpenCL 1.2. With 2.0, SPIR has feature parity with OpenCL 2.0. Here are a few new features that we find interesting. Generic Address Space “Where functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions …
Cross Compile to Windows From Linux
Why did I not know about this? It’s like I just discovered the screw driver! On Debian and variants (from tinc’s windows cross-compilation page), sudo apt-get install mingw-w64 # C i686-w64-mingw32-gcc hello.c -o hello32.exe # 32-bit x86_64-w64-mingw32-gcc hello.c -o hello64.exe # 64-bit # C++ i686-w64-mingw32-g++ hello.cc -o hello32.exe # 32-bit x86_64-w64-mingw32-g++ hello.cc -o hello64.exe # 64-bit Granted, this isn’t a silver bullet, but rather a quick way to get a Windows build of platform independent code that you might already have running in Linux. I’ve found that this approach makes it easy to get binaries out the door in a hurry when it’s hard to get a project building with Visual Studio or even on the Windows platform itself (due …
Image editing using ArrayFire: Part 3
Today, we will be doing the third post in our series Image editing using ArrayFire. References to old posts are available below. * Part 1 * Part 2 In this post, we will be looking at the following operations. Image Histogram Simple Binary Theshold Otsu Threshold Iterative Threshold Adaptive Binary Threshold Emboss Filter Today’s post will be mostly dominated by different types of threshold operations we can achieve using ArrayFire. Image Histogram We have a built-in function in ArrayFire that creates a histogram. The input image was converted to gray scale before histogram calculation as our histogram implementation works for vector and 2D matrices only. In case, you need histogram for all three channels of a color image, you can …
Domain Expertise Vs. Compilers
Every so often people come up to us and ask, “Aren’t compilers and compiler directives good enough for HPC applications?” or “Won’t a compiler accomplish that for us?” While compilers have made massive progress in the last two decades, they are still nowhere near the point of putting us and many other HPC programmers out of business. Compilers are still a “one-size-fits-all” solution that needs to be able to deal with any and all input, whereas HPC programmers can be thought of as a designer-fitted solution. Application expertise brings a lot to the table that compilers cannot compete with: Our past experiences have helped us optimize applications that have irregular memory access patterns. While some applications such as matrix applications have regular and simple …
Quest for the Smallest OpenCL Program
I have heard many complaints about the verbosity of the OpenCL API. This claim is not unwarranted. The verbosity is due to the low-level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages. A typical OpenCL program must: Query for the platform Get the device IDs from the platform Create a context from a set of device IDs Create a command queue from the context Create buffer objects for your data Transfer the data to the buffer Create and build a program from source Extract the kernels Launch the kernels Transfer the data to the host …
Image editing using ArrayFire: Part 2
A couple of weeks back, we did a post on a few image editing functions using ArrayFire library. Today, we shall be doing the second post in the series Image Editing using ArrayFire. We will be looking at the following operations today. Image distortion Noise addition Noise reduction Edge filters Boundary extraction Difference of gaussians Code and sample input/outputs corresponding to each operation are described below. Image distortion We will be looking at spread and pick filters in this section. Both of these filters are fundamentally the same, they replace each pixel in the original image with one of it’s neighboring pixels. How the neighbor is chosen is essentially the difference between spread and pick. Both of these functions use …