Contributors: A big benefit to open source software

Brian KloppenborgArrayFire, Open Source 4 Comments

Making the decision to open source your software is not an easy process. Indeed, here at ArrayFire our choice to release ArrayFire under the open source, commercially friendly, BSD 3-Clause License came only after many hours of consideration and philosophical discussion (e.g. see our CEO’s blog on these topics). Thus far this decision has proven to be strictly beneficial to our company. The impact of third-party contributions Although ArrayFire is primarily developed by our engineers, there are several contributions from other developers. Therefore we feel particularly compelled to elucidate how these contributions have improved the ArrayFire ecosystem. Packaging for Linux and OSX One of the best parts of open source distribution is that your code can be packaged and distributed for …

Benchmarking parallel vector libraries

Pavan YalamanchiliArrayFire, Benchmarks, C/C++, CUDA Leave a Comment

There are many open source libraries that implement parallel versions of the algorithms in the C++ standard template libraries. Inevitably we get asked questions about how ArrayFire compares to the other libraries out in the open. In this post we are going to compare the performance of ArrayFire to that of BoostCompute, HSA-Bolt, Intel TBB and Thrust. The benchmarks include the following commonly used vector algorithms across 3 different architectures. Reductions Scan Transform The following setup has been used for the benchmarking purposes. The code to reproduce the benchmarks is linked at the bottom of the post. The hardware used for the benchmarks is listed below: NVIDIA Tesla K20 AMD FirePro S10000 Intel Xeon E5-2560v2 Background ArrayFire ArrayFire provides high …

ArrayFire v3.0 is here!

Aaron TaylorAnnouncements, ArrayFire, CUDA, Open Source, OpenCL 5 Comments

Today we are pleased to announce the release of ArrayFire v3.0. This new version features major changes to ArrayFire’s visualization library, a new CPU backend, and dense linear algebra for OpenCL devices. It also includes improvements across the board for ArrayFire’s OpenCL backend. A complete list ArrayFire v3.0 updates and new features can be found in the product Release Notes. With over 8 years of continuous development, the open source ArrayFire library is the top CUDA and OpenCL software library. ArrayFire supports CUDA-capable GPUs, OpenCL devices, and other accelerators. With its easy-to-use API, this hardware-neutral software library is designed for maximum speed without the hassle of writing time-consuming CUDA and OpenCL device code. With ArrayFire’s library functions, developers can maximize …

Using zero-copy buffers on integrated GPUs

Brian KloppenborgC/C++, OpenCL 1 Comment

One of the most powerful aspects of parallel program on integrated GPUs is taking advantage of shared memory and caches. The best example of this is sharing common data between the CPU and GPU via. zero-copy buffers. This technique permits your program to avoid the O(N) cost of copying data to/from the GPU. This feature is particularly useful for applications that deal with real-time data streams, like video processing.

The ArrayFire Blog’s Best of 2014

Aaron TaylorArrayFire 1 Comment

The year 2014 was a big one for us! Before we get too far into 2015, we thought we’d share the most popular posts of 2014. So, without further adieu, we give you the TOP TEN ARRAYFIRE BLOG POSTS OF 2014: 1. Getting Started with OpenCL on Android: In which we review how to do image processing on camera feed on Android devices using OpenCL. 2. Image Processing Benchmarks on NVIDIA Jetson TK1: In which we look at benchmarks of the following ArrayFire image processing functions on an ARM device: erosion/dilation, median filter, resize, histogram, bilateral filter, and convolution. 3. OpenCL on Mobile Devices: In which we share a consolidated list of OpenCL supported Android devices. 4. Quest for the Smallest OpenCL Program: In …

Getting Started with the Intel Xeon Phi on Ubuntu 14.04/Linux Kernel 3.13.0

Peter EntschevHardware & Infrastructure, OpenCL 25 Comments

You may already know that the Intel MPSS (Manycore Platform Software Stack) officially only supports the RedHat and SUSE Linux distros. Using an enterprise distro might be very interesting if your company is running a large server environment or short on specialized people and you rely on the distro official support for more complicated tasks. Not all companies use enterprise Linux distributions. Ubuntu has a large share of the Linux distro market (if not the largest). A while back, I needed to setup a machine running Ubuntu 14.04 and MPSS 3.4.x and could not find any documentation running the newest versions of Ubuntu/Linux Kernel/MPSS. In this blog, I will try to document how to get the Intel Xeon Phi running …

Consulting

CUDA ConsultingHire us to accelerate your NVIDIA® CUDA™ code, train your team, or even develop new code for your organization! We’ve already helped thousands of organizations speed up their code. We’ve accelerated algorithms, sped up data analysis, and solved challenging problems of all types in the past, and we’re confident that we can accelerate your code, too. Start by setting up a free consultation call to discuss your CUDA project with an ArrayFire expert today!Schedule a CUDA Consultation “Solve your problems with help from our experts, leveraging the mindshare of our full team to quickly turnaround high-quality code, documentation, and benchmarks.”-ArrayFire has 14 years of proven experienceEmbedded ConsultingNVIDIA® Jetson™ is the world’s leading platform for AI at the edge. Jetson modules …

Training

CUDA and OpenCL TrainingWe provide high-quality 2- or 4-day CUDA™ and OpenCL training courses. Since we specialize solely in CUDA and OpenCL work, we can uniquely immerse students in GPU and heterogeneous computing. Students of our courses walk away proficient at programming CUDA or OpenCL, receive the latest industry knowledge and techniques for GPU computing, and learn the tricks to maximize performance from heterogeneous computing devices. For groups, we either travel to your location, host in our Atlanta office, or train remotely via video conference, tailoring our instruction to meet your application-specific needs. For individuals, we offer 2-day training quarterly. We recommend that attendees have a working knowledge of C/C++ for a fruitful learning experience.Talk to Us to Register a …

New Features in ArrayFire

Pavan YalamanchiliArrayFire 2 Comments

We have previously talked about upcoming computer vision algorithms in the next version of ArrayFire. Today we are going to discuss some of the bigger changes and additions to ArrayFire. New CPU backend In addition to CUDA and OpenCL backends, you can now run ArrayFire natively on any CPU. This is another step we’ve taken in our efforts to make ArrayFire truly portable. The biggest benefits the new CPU backend include: Hardware and Software neutrality: You can now build and ship applications without worrying about the hardware and drivers preset on end users’ machines. You can also port your applications easily to embedded and mobile platforms where CUDA and OpenCL may not be available. Heterogeneous Computing: It is now easier …

CUDA Optimization tips for Matrix Transpose in real world applications

Pradeep GarigipatiArrayFire 1 Comment

Computer algorithms are extra friendly towards data sizes that are powers of two. GPU compute algorithms work particularly well with data sizes that are multiples of 32. In most real-world situations, however, data is rarely so conveniently sized. In today’s post, we’ll be looking at one such scenario related to GPU compute. Specifically, we’ll provide you with some tips on how to optimize matrix transpose algorithm for a GPU. Let’s start with the transpose kernel available from NVIDIA’s Parallel Forall blog. It’s been optimized to avoid bank conflicts as well, but only works on matrices with dimensions that are multiples of 32. template __global__ void transpose32(T * out, const T * in, unsigned dim0, unsigned dim1) { __shared__ T shrdMem[TILE_DIM][TILE_DIM+1]; …