After Albert Einstein’s proposal of the General Theory of Relativity, our outlook on how gravity works has changed significantly opening the possibility of mystifying objects such as Black Holes. Black Holes are massive compact objects resulting from unhalted gravitational collapse. Their gravity affects the spacetime surrounding it to such an extreme level that any object near it, including light, will have as an unavoidable future a path of “falling” towards the black hole. While the theory of general relativity was proven to be highly accurate from experiments testing the bending of light seen during an eclipse or the gravitational time dilation experience of objects closer to massive objects, there had not been a direct observation of a black hole until …
ArrayFire v3.9.0 Release
We are pleased to announce a new release of the ArrayFire library, v3.9.0. This release makes it easier than ever to target new devices without sacrificing performance. This post describes four of these new features, including: oneAPI Backend This release is the first time since v3.0 that introduces a new backend. The new backend is built with the oneAPI specification on top of the SYCL language. oneAPI is an open specification providing a full framework for high-performance computing applications without vendor lock-in. While this has been possible with OpenCL, the oneAPI specification includes libraries like BLAS and FFT significantly reducing the burden on the developers to maintain math functions and increasing the performance of these common operations. Here is a …
CUDA Computing on Google Colab with ArrayFire
For the first-time in our 14 year existence, we are now able to provide our community with the ability to run ArrayFire programs for free within minutes. Before today, users would have to download and install the library on their own systems, which can be a hassle if you just want to play around with some code and benchmarks. Today, we’re excited to announce that ArrayFire is available on Google Colab, the free GPU computing cloud service from Google. Colaboratory, or “Colab” for short, allows you to write and execute Python in your browser, with Zero configuration required, free access to GPUs, and easy sharing. You can jump right in and start playing with this new tool: Click Here to …
Bringing Together the GPU Computing Ecosystem for Python
To date, we have not done a lot for the Python ecosystem. A few months ago, we decided it was time to change that. Like NVIDIA said in this post, the current slate of GPU tools available to Python developers is scattered. With some attention to community building, perhaps we can build something better — together. NVIDIA spoke some about its plans to help cleanup the ecosystem. We’re onboard with that mentality and have two ways we propose to contribute: We’re working on a survey paper that assesses the state of the ecosystem. What technical computing things can you do with each package? What benchmarks result from the packages on real Python user code? What plans does each group have …
Cycling through SYCL
We recently gave an overview of recent history in the technical computing hardware market. In it, we mention the energy at Intel right now. The weight of Intel is behind the SYCL standard through its new software approach, oneAPI. SYCL is a cross-platform API that targets heterogeneous hardware, similar to OpenCL and CUDA. The SYCL standard was first introduced by Codeplay and is now being managed by the Khronos group. It allows single-source compilation in C++ to target multiple devices on a system, rather than using C++ for the host and domain specific kernel languages for the device. Furthermore, SYCL is fully C++ 17 standards compliant. You don’t have any extensions to the language that would prevent any standards compliant …
Benchmarking parallel vector libraries
There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high …
Feature detection and tracking using ArrayFire
A few weeks ago we added some computer vision functionality to our open source ArrayFire GPU computing library. Specifically, we implemented the FAST feature extractor, BRIEF feature point descriptor, ORB multi-resolution scale invariant feature extractor, and a Hamming distance function. When combined, these functions enable you to find features in videos (or images) and track them between successive frames.
Using zero-copy buffers on integrated GPUs
One of the most powerful aspects of parallel program on integrated GPUs is taking advantage of shared memory and caches. The best example of this is sharing common data between the CPU and GPU via. zero-copy buffers. This technique permits your program to avoid the O(N) cost of copying data to/from the GPU. This feature is particularly useful for applications that deal with real-time data streams, like video processing.
Building a C++ Interpreter (Cling) for x86 and ARM
A C++ interpreter compiled for ARM running on an x86 EC2 instance after following the given instructions. As C++ becomes more mature and high-level, an interpreted workflow might lead to mainstream C++ productivity in addition to development! Development through a C++ interpreter (Cling) as opposed to a standard compiler is an amazing leap in productivity and a window into the newest features of C++. This post tells you how to get your own bleeding-edge C++ interpreter built right on top of the development version of LLVM. We give you a repeatable procedure via Amazon EC2. With our prescribed steps in place, you can always have an up-to-date development version of Cling. This allows quick testing and investigation of LLVM’s newest …
Demystifying PTX Code
In my recent post, I showed how to generate PTX files from both CUDA and OpenCL kernels. In this post I will address the issue of how a PTX file look, and more importantly, how to understand all those complicated instructions in a PTX files. In this post I will use the same vector addition kernel from the the previous post previous post (the complete code can be found here). For this post, I will focus on OpenCL PTX file. In a future post I will discuss the differences between PTX files of OpenCL and CUDA code. Let’s start by looking at the complete PTX code: // // Generated by NVIDIA NVVM Compiler // Compiler built on Sun May 18 …