There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high ...
A few weeks ago we added some computer vision functionality to our open source ArrayFire GPU computing library. Specifically, we implemented the FAST feature extractor, BRIEF feature point descriptor, ORB multi-resolution scale invariant feature extractor, and a Hamming distance function. When combined, these functions enable you to find features in videos (or images) and track them between successive frames.
One of the most powerful aspects of parallel program on integrated GPUs is taking advantage of shared memory and caches. The best example of this is sharing common data between the CPU and GPU via. zero-copy buffers. This technique permits your program to avoid the O(N) cost of copying data to/from the GPU. This feature is particularly useful for applications that deal with real-time data streams, like video processing.
A C++ interpreter compiled for ARM running on an x86 EC2 instance after following the given instructions. As C++ becomes more mature and high-level, an interpreted workflow might lead to mainstream C++ productivity in addition to development! Development through a C++ interpreter (Cling) as opposed to a standard compiler is an amazing leap in productivity and a window into the newest features of C++. This post tells you how to get your own bleeding-edge C++ interpreter built right on top of the development version of LLVM. We give you a repeatable procedure via Amazon EC2. With our prescribed steps in place, you can always have an up-to-date development version of Cling. This allows quick testing and investigation of LLVM's newest ...
In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let's start by looking at the complete PTX code:
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sun May 18 04:44:51 2014 (1400399091)
// Driver 331.79
.target sm_21, texmode_independent
.param .u32 .ptr .global .align 4 add_vectors_param_0,
.param .u32 .ptr .global .align 4 add_vectors_param_1,
.param .u32 .ptr .global .align 4 add_vectors_param_2,
.param .u32 add_vectors_param_3
.reg .pred %p<2>;
.reg .s32 %r<21>;
ld.param.u32 %r9, [add_vectors_param_3];
mov.u32 %r5, %envreg3;
mov.u32 %r6, %ntid.x;
mov.u32 %r7, %ctaid.x;
mov.u32 %r8, %tid.x;
add.s32 %r10, %r8, %r5;
mad.lo.s32 %r4, %r7, %r6, %r10;
setp.lt.s32 %p1, %r4, %r9;
@%p1 bra BB0_2;
shl.b32 %r11, %r4, 2;
ld.param.u32 %r18, [add_vectors_param_0];
add.s32 %r12, %r18, %r11;
ld.param.u32 %r19, [add_vectors_param_1];
add.s32 %r13, %r19, %r11;
ld.global.u32 %r14, [%r13];
ld.global.u32 %r15, [%r12];
add.s32 %r16, %r14, %r15;
ld.param.u32 %r20, [add_vectors_param_2];
add.s32 %r17, %r20, %r11;
st.global.u32 [%r17], %r16;
The file starts with a header showing some compiler information in comments, followed ...
Why did I not know about this? It's like I just discovered the screw driver! On Debian and variants (from tinc's windows cross-compilation page),
sudo apt-get install mingw-w64
i686-w64-mingw32-gcc hello.c -o hello32.exe # 32-bit
x86_64-w64-mingw32-gcc hello.c -o hello64.exe # 64-bit
i686-w64-mingw32-g++ hello.cc -o hello32.exe # 32-bit
x86_64-w64-mingw32-g++ hello.cc -o hello64.exe # 64-bit
Granted, this isn't a silver bullet, but rather a quick way to get a Windows build of platform independent code that you might already have running in Linux. I've found that this approach makes it easy to get binaries out the door in a hurry when it's hard to get a project building with Visual Studio or even on the Windows platform itself (due to, say, a complex build system). I cover some quick solutions to the most common caveats I've run into, below. Caveats MinGW GCC vc GCC: Not as Smart with Templates Whatever ...
Today, we will be doing the third post in our series Image editing using ArrayFire. References to old posts are available below. * Part 1 * Part 2 In this post, we will be looking at the following operations. Image Histogram Simple Binary Theshold Otsu Threshold Iterative Threshold Adaptive Binary Threshold Emboss Filter Today's post will be mostly dominated by different types of threshold operations we can achieve using ArrayFire. Image Histogram We have a built-in function in ArrayFire that creates a histogram. The input image was converted to gray scale before histogram calculation as our histogram implementation works for vector and 2D matrices only. In case, you need histogram for all three channels of a color image, you can ...
I have heard many complaints about the verbosity of the OpenCL API. This claim in not unwarranted. The verbosity is due to the low level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages. In a typical OpenCL program must: Query for the platform Get the device IDs from the platform Create a context from a set of device IDs Create a command queue from the context Create buffer objects for your data Transfer the data to the buffer Create and build a program from source Extract the kernels Launch the kernels Transfer the data to ...
In response to user requests for additional ArrayFire capabilities, we have decided to extend the library to have CPU fall back when OpenCL drivers for CPUs are not available. This means that ArrayFire code will be portable to both devices that have OpenCL setup and devices without it. This is done through the creation of additional backends. This will allow ArrayFire users to write their code once and have it run on multiple systems. We currently support the following systems and architectures: NVIDIA GPUs (Tesla, Fermi, and Kepler) AMD's GPUs, CPUs and APUs Intel's CPUs, GPUs and Xeon Phi Co-Processor Mobile and Embedded devices As part of this update process we are also looking at extending ArrayFire capabilities to low power systems such ...
Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is becoming a necessity to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be intuitive immediately. There are many ways you can vectorize a given code segment. Each method has its own benefits and drawbacks. Hence, writing vectorized code involves analyzing the pros and cons of the available methods and choosing the right one to solve your problem. In this post, I present various ways to vectorize your code using ArrayFire. ArrayFire is chosen because of my familiarity with the software. The same methods can be easily used in numpy, octave, ...