If you want faster code, you’ve come to the right place.


Download ArrayFire
Schedule a consultation


Faster code

ArrayFire is a fast, hardware-neutral software library for GPU computing with an easy-to-use API. Its array-based function set makes GPU programming simple. A few lines of code in ArrayFire can replace dozens of lines of raw GPU code, saving you valuable time and lowering development costs.


Download ArrayFire





CUDA logo

OpenCL


ArrayFire for CUDA and OpenCL

ArrayFire supports CUDA-capable NVIDIA GPUs and most OpenCL devices, including AMD GPUs/APUs and Intel Xeon Phi co-processors. It also supports mobile OpenCL devices from ARM, Qualcomm, and others.

ArrayFire is a high-performance software library designed for maximum productivity and speed without the hassle of writing difficult low-level device code. Each of ArrayFire’s functions has been hand-tuned by our CUDA and OpenCL experts.


Mobile GPU Computing

Mobile GPUs offer accelerated performance. If you would like to use ArrayFire in a mobile application or if you would like to hire our mobile acceleration services, email us at sales@arrayfire.com and include a description of your application.

Our clients have achieved between 5X to 30X speed boosts for convolutions and other compute functions on mobile devices. It enables:

  • Real-time video processing
  • Better augmented reality
  • Better computational photograph
  • Faster data processing

All mobile devices have a GPU which can deliver greater compute throughput than the CPU. Why not use the GPU to make your app faster?



Documentation

ArrayFire for C, C++, and Fortran

Open the hood—take a look at the beautiful nitty-gritty.

 CUDA & OpenCL C/C++

Functions

Hundreds of functions, hand-tuned and optimized by our engineers.

 ArrayFire Functions

User Forums

Every post gets a response from ArrayFire engineers.

 User Forums




Benchmarks

All benchmarks were performed on a NVIDIA Tesla C2050 with an Intel i7-950 processor. The benchmarks compare ArrayFire with several popular CPU-based acceleration libraries, including Intel Math Kernel LibraryIntel Integrated Performance PrimitivesEigen, and Armadillo.


.
.

.
.

ArrayFire Free

.

Functions
Basic Operations
 Arithmetic
 Signal Processing
 Image Processing
 Statistics
 Visualization

Download ArrayFire
Hardware
 CUDA
 OpenCL
Languages
 C/C++
 Fortran
Advanced
  GFOR Loop
Platforms
 Windows
 Linux
 Mac
Data Types
 All types
Support
  Forums


ArrayFire Developer

$499 / year

$149 / year (Academic)
  • Dedicated license
  • Starts at 2 GPUs or devices
  • Linear algebra
  • Sparse matrices
  • Phone support
  • Premium support & consulting
Purchase

ArrayFire Enterprise

$2,999 / year

$1,049 / year (Academic)
  • Dedicated license
  • Starts at 2 GPUs or devices
  • Linear algebra
  • Sparse matrices
  • Phone support
  • Premium support & consulting
Purchase

Case Studies

Learn how ArrayFire has worked in real code, including applications in academia, finance, government, life sciences, manufacturing, media, and oil & gas.


Academia

Authors: Yuan Gao, Yin Sun, Chun Hui Zhou, Xin Su, Xi Bin Xu, Shi Dong Zhou, Tsinghua University
Speedup: 3X

Fast simulations are a driving force in several research projects. However, the accompanying long simulation times can tend to be a drag in many of these projects. In this article, we shall bring up the example of the work on 3GPP LTE System Simulation by Yuan Gao et al (from Tsinghua University, Beijing) and demonstrate how the use of ArrayFire software can significantly improve the simulator performance and lead to faster validation times in simulation projects.

Authors: Nabor Reyna and Wotao Yin from Rice University
Speedup: 5X

This work deals with reconstruction of signals using partial Fourier matrices (RecPF). The major computational components of the algorithm involve shrinkage and FFTs. ArrayFire software is employed to accelerate this compute-heavy code.

Authors: Indian Institute of Technology, Roorkee
Speedup: 35X

Power flow studies are one of the most important aspects of power system planning and operation. The power flow reveals the sinusoidal steady state characteristics of the entire system – voltages, real and reactive power generated, and absorbed and line losses- elucidating the voltage magnitudes and angles at each bus, the generation of each generating unit, and real and reactive power losses in the system. All this is necessary to ensure the security, economy, and control of electrical energy distribution. Learn how ArrayFire software can deliver magnitudes of performance improvement over CPU-based solutions.

Authors: A. Capozzoli, C. Cucio, A. Liseno at University of Naples Federico II
Speedup: 24X

Antenna array design involves repeated simulation to tune the many parameters involved, and waiting around for simulations to finish is no fun. Offloading the optimization problem onto the GPU cuts that time down significantly. In their recent paper, Capozzoli, Curcio, and Liseno of University of Naples Federico II demonstrated how a simple modification to their echo generator array simulation took advantage of the GPU to bring immediate speedups.

Authors: Patrick Kano and Moysey Brio at Acunum Algorithms and Simulations
Speedup: 3.8X

The numerical inversion of the Laplace transform is a long standing problem due its implicit ill-posedness. Patrick Kano and Moysey Brio of Acunum Algorithms and Simulations, with their experience in computational methods and algorithm development, found a solution that not only works, but is very fast.

Authors: Kuldeep Yadav, Ankush Mittal, M. A. Ansar, and Avi Srivastava, College of Engineering, Roorkee, India
Speedup: 8X

Compressed sensing is very critical in the areas of medical image reconstruction, image acquisition or sensor networks. An algorithm for compressed sensing developed using a Basis Pursuit Algorithm shows over 8X speedup when run on an NVIDIA GPU.

Authors: D. H. Johnson, S. Narayan, C. A. Flask, and D. L. Wilson, Case Western Reserve University
Speedup: 11.6X

Case Western Reserve University researchers turned to GPUs running ArrayFire software to develop a fast and robust version of the “Iterative Decomposition of water and fat with an Echo Asymmetry and Least-squares” (IDEAL) reconstruction algorithm. This algorithm uses a lot of Image Processing algorithms for reconstruction, and was shown to achieve very high speedups.

Finance

Authors: Koch Supply & Trading
Speedup: 51.8X

Andrew Shin, Market Risk Manager of Koch Supply & Trading, achieves significant performance increases on option pricing algorithms using ArrayFire software to accelerate his code with GPUs. Andrew says, “My buddy and I are, at best, novice programmers and we couldn’t imagine having to figure out how to code all this in CUDA.” But he found ArrayFire software to be straight-forward. With these results, he says he can see ArrayFire software and GPUs populating Koch’s mark-to-futures cube, which contains its assets, simulations, and simulated asset prices.

Authors: Automated Trader
Speedup: 37.5X

The Q1 2012 issue of Automated Trader contains an excellent Mashup piece reviewing software for algorithmic trading. The article provides a wonderful glimpse into the 1-2 month adventure of Andy Webb, Automated Trader.s Founder, and Wrecking Crew building a fast trading platform from several technologies. The full trading platform they built was quite extensive. The part that caught our eye was the core computational component of the pipeline. That component involved permuting 1,000 potential pairs with cointegration tests for 350 time windows on each potential pair.

The world of Quantitative finance is all about getting accurate results really, really fast. ArrayFire is working with one of the largest banks in Spain to maximize their output using GPUs. Click the link below for an overview of the uses of GPU computing in finance.

Government

Authors: NASA and UAA in Anchorage
Speedup: 5X

The main thrust of this research is improving mars rover image compression via GPUs and genetic algorithms. With ArrayFire software and GPUs, the researchers were able to achieve 5X speedups on the larger data sizes. The algorithm works by pairing neighboring pixels with a random one and then adjusting the random pixel based on whether it incrementally improves the original image. Babb described the algorithm as an embarrassingly parallel process, ideally suited to GPU acceleration. He estimates he has been able to achieve a 20 to 30 percent error reduction in subjects like fingerprints and satellite imagery.

Authors: Gary Rubin and Earl Sager – System Planning Corporation
Speedup: ~45X

Radar imaging is computationally intensive. As a result, many imaging algorithms apply FFT-based approximations. While efficient, these algorithms sacrifice data fidelity for speed. Other algorithms better preserve information, but are often too slow for many applications. At System Planning Corporation (SPC) , we have implemented a SAR/ISAR imaging routine based on the Backprojection algorithm. Using ArrayFire software, we have demonstrated speedups of roughly 45x for large datasets.

Authors: David Berger and Gary Rubin – System Planning Corporation
Speedup: 5-10X

System Planning Corporation (SPC) uses ArrayFire software to accelerate radar processing algorithms. The system processes raw data from marine navigation radars using a variety of thresholding techniques to extract real targets from clutter. This involves highly data-parallel processing in which each radar pulse is subjected to the same computations; very few operations occur across multiple pulses. Using ArrayFire software, SPC has achieved 10x speed improvements relative to a Core i7-920 CPU and 5x improvements relative to a realtime DSP implementation.

Authors: Nolan Davis and Daniel Redig, SAIC
Speedup: 3.5X

Nolan Davis and Daniel Redig at SAIC recently presented work on Hybrid GPU/Multicore Solutions for Large Linear Algebra Problems where they developed a novel algorithm for LU decomposition, one of the most important routines in linear algebra. They presented a Hybrid CPU/GPU computing approach, where problems too large to fit in GPU memory can also be solved faster than using only the CPU.

Authors: BAE Systems
Speedup: 17X

Geolocation is the identification of the real-world geographic location of a target of interest. In this application, the system receives the signal with an array of several antennas and computes the direction of arrival of the radio energy by measuring the time difference of arrival (or the phase difference) at the different antennas.

Authors: University of Minnesota, Boise State, Saint Scholastica, and NCAR
Speedup: 3-5X

Natural catastrophic disasters like tsunamis commonly strike with little warning. For most people, tsunamis are underrated as major hazards. People sometimes wrongly believe that they occur infrequently and only along distant coasts. Tsunamis are usually caused by earthquakes. Seismic signals can give some margin of warning since the speed of tsunami waves travels at 1/30 the speed of seismic waves. Still there is little time between the creation of the tsunami and its impact making fast processing critical to producing effective warning systems. ArrayFire software was used to run an RBF simulation on the GPU with a time to solution not available by other alternatives.

Life Sciences

Authors: University of Quebec
Speedup: 43X

Computerized approaches to studying the human genome are challenged by the exploding amount of data, which doubles roughly every 6 months. In order to deal with this burgeoning datasets, demands for faster processing power continue to arise. This work focuses on predicting genes using frequency analysis with FFTs and with an equivalent technique known as Goertzel’s algorithm. In these applications, the emphasis of this paper is to propose tools to geneticists and molecular biologists for the prediction or identification of new genes using existing complementary strategies. The criteria for these tools are speed, reliability, accuracy and ease of use, thus requiring little training.

Authors: Laboratory for Spectral Diagnosis at Northeastern University
Speedup: 100X+

One element of the hyperspectral image analysis workflow that requires more than a traditional desktop workstation or personal computer is Hierarchical Cluster analysis (HCA). HCA requires a large amount of data space and substantial computation time (~11 hours) for typical datasets using a single processor personal computer. Rather than following the traditional approach of moving to a lower level programming language like C or C++ and complex parallel programming paradigms such as OpenMP or the Massage Passing Interface (MPI), the lab utilized graphics processing units, or GPUs, and the ArrayFire software platform. The solution allowed the lab to dramatically increase the performance of the analysis while substantially decreasing the amount of calendar time to reach the desired results.

Authors: CDC Research and Development Team
Speedup: ~20X

This case study provides a look at biological research regarding coordinated mutations of the Hepatitis C Virus (HCV). ArrayFire provided collaborative R&D resources and greatly improved the speed of this HCV research with the use of parallelization, reducing the computing time from 40 days to less than 1 day. Most importantly, the conclusion of the case study illustrates the the relative price-performance of personal supercomputers that leverage GPUs and ArrayFire software provides a compelling solution versus other architectures and approaches.

Authors: Georgia Institute of Technology
Speedup: 3.5X

The Georgia Tech team explores the value of ArrayFire software and GPUs for fMRI workflows within the popular SPM – Statistical Parametric Mapping software widely used in neuroscience research.

Authors: Jaideep Singh, Ipseeta Aruni, R. Balasubramanian – IIT – Roorkee, India
Speedup: 38X

This study presents the acceleration of Haar wavelet-based image compression algorithm for medical imaging on the Graphics Processing Unit (GPU) using ArrayFire software. Due to bandwidth and image size constraints of medical imaging systems, image compression plays a vital role in reducing the bit rate of transmission or storage. Wavelet-based image compression provides the most promising approach for high quality image compression.

Authors: Spencer Technologies
Speedup: 12X

Spencer describes how ArrayFire software facilitates the development of fast algorithms enabling observation of brain displacement across depth with sampling density that far surpasses previous benchmarks.

Authors: Leibniz Institute of Plant Genetics and Crop Plant Research
Speedup: 20-35X

Multidimensional scaling (MDS) is a general computing technique to turn a distance matrix into a set of reconstructed points with pair-wise relationships approximating the original distances by points located in a usually low-dimensional space. ArrayFire software is used to enhance execution of the HiT-MDS procedure and delivers considerable performance improvement.

Authors: Georgia Institute of Technology
Speedup: 70X

In this work, the researchers simulate the delivery of a novel nanoparticle chemotherapy drug to cancerous tissue. Simulation allows scientists to predict experimental outcomes and thus reduce the cost of development and time to clinical relevance. The simulation model includes blood vessels, tumor cells, and healthy cells and an engine to calculate the spatial distributions of both drug and oxygen. ArrayFire software is used to speed up the diffusion calculations for the drug and oxygen within the tissue.

Authors: University of Manchester and Nofima Mat, Norway
Speedup: Hours of runtime reduction

The authors present an iterative algorithm that applies full Mie scattering theory and avoids noise accumulation in their iterative algorithm by integrating a curve-fitting step. ArrayFire software along with NVIDIA GPUs are leveraged to reduce the time added by the curve-fitting step.

Manufacturing

Authors: Andrew Ng, Stanford University
Speedup: Ability to process many images in parallel

Stanford researchers in Andrew Ng’s group used GPUs and ArrayFire software to speed up their work on Feature Learning Architectures. They decided to use ArrayFire software for this study because of the need to quickly evaluate many architectures on thousands of images. ArrayFire software taps into the immense computing power of GPUs and speeds up research utilizing many images.

Authors: Drs. Capozzoli, Curcio, di Vico, and Liseno, Universita di Napoli Federico II
Speedup: 10X

In order to investigate changes of forest biomass, scientists use microwave tomography to image the vegetation. At the smallest scale, individual plants can be imaged to investigate branching and growth, but even synthetic aperture radar can reveal large-scale changes in regional ecology. To the right, you can see the experimental setup to image an individual plant.

Authors: Quoc Le, Will Zou, Serena Yeung, Andrew Ng, Stanford University
Speedup: 4.4X

In a paper at this year’s CVPR 2011, entitled “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis”, the authors explain how their unsupervised feature learning algorithm competes with other algorithms that are hand crafted or use learned features. For their training purposes, they used a multi-layered stacked convolutional ISA (Independent subspace analysis) network. An ISA is used for learning features from image patches without supervision.

Media and Computer Vision

Authors: OpenCV Blogger
Speedup: ~10X

The OpenCV library is the defacto standard for doing computer vision and image processing research projects. OpenCV includes several hundreds of computer vision algorithms, aimed for use in realtime vision applications. This case study shows how to use both libraries together. There is a simple example application that demonstrates using OpenCV for webcam access and ArrayFire for some basic processing routines and displaying results.

Authors: Google and Stanford
Speedup: 10-20X

Video content analysis is the basis for categorizing videos and enabling search by content. Growing interest in using sparse-coding methods to extract motion features in video in support of video content analysis led to the application of ArrayFire software to improve performance by substantially accelerating the solution of the L1-regularized least-squares optimization problem.

Authors: Quoc Le, Will Zou, Serena Yeung, Andrew Ng, Stanford University
Speedup: 4.4X

In a paper at this year’s CVPR 2011, entitled “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis”, the authors explain how their unsupervised feature learning algorithm competes with other algorithms that are hand crafted or use learned features. For their training purposes, they used a multi-layered stacked convolutional ISA (Independent subspace analysis) network. An ISA is used for learning features from image patches without supervision.

Authors: Vidhur Vohra – Georgia Tech
Speedup: 15X

Did you ever wonder how the music visualizer in your media player works? Watching it pulsate in synchrony with the beats of the song is almost as entertaining as listening to the song itself! Researchers have been attempting to detect beats in audio signals for many years, and there are many techniques available, from the simplest (and least accurate) to more complicated algorithms that are highly accurate. All algorithms, though, perform some form of signal processing and frequency analysis, applications highly suited to GPU Computing.

Authors: Stanford Artificial Intelligence Laboratory
Speedup: Improved accuracy

Researchers at SAIL (Stanford Artificial Intelligence Laboratory), have done it again. They have successfully used ArrayFire software to speed up the training part of Deep Learning algorithms. In their paper titled .On Optimization Methods for Deep Learning., they experiment with some of the well known training algorithms and demostrate their scalability across parallel architectures (GPUs as well as multi-machine networks). The algorithms include SGDs (Stochastic Gradient Descent) L-BFGS (Limited BFGS used for solving non-linear problems), CG (Conjugate Gradient).

Authors: Andrew Ng, Stanford University
Speedup: Ability to process many images in parallel

Stanford researchers in Andrew Ng’s group used ArrayFire software to speed up their work on Feature Learning Architectures. They decided to use ArrayFire software for this study because of the need to quickly evaluate many architectures on thousands of images. ArrayFire software taps into the immense computing power of GPUs and speeds up research utilizing many images.

Authors: Nitesh Pandey, Damien Kelly, Bryan Hennelly and Thomas Naughton from the National University of Ireland, Maynooth
Speedup: 17X

Digital holography is a powerful imaging technique with many new applications like true 3D display. It allows the capture of both amplitude and phase information of the light reflected off the surface of 3D objects. Researchers at the National University of Ireland, Maynooth are developing techniques based on digital holography for 3D display applications.
Reconstruction of large digital holograms can be computationally intensive to generate on CPUs, but GPUs running ArrayFire software offer amazing possibilities.

Oil and Gas

Authors: Boise State, University of Colorado, University of Minnesota
Speedup: 2.5-4.5X

The authors introduce a GPU implementation of a three-dimensional mantle convection modeling at a high Rayleigh number to the solid earth geophysics community. They outline code development time, compare performance of CPUs versus GPUs, and deliver powerful visualizations.

Authors: Kevin R. Tubbs and Frank T-C. Tsai at Louisiana State University
Speedup: >20X

A lattice Boltzmann method for solving the shallow water equations and the advection-dispersion equation is developed and implemented on graphics processing unit (GPU)-based architectures. The proposed LBM is implemented to an NVIDIA Computing Processor in a single GPU workstation. GPU computing is performed using ArrayFire software. Mass transport with velocity-dependent dispersion in shallow water flow is simulated by combining the MRT-LBM model and the TRT-LBM model. The GPU parallel performance increases as the grid size increases. The results indicate the promise of the GPU-accelerated LBM for modeling mass transport phenomena in shallow water flows.

Authors: Louisiana State University
Speedup: >10X

A lattice Boltzmann method (LBM) on high performance computing (HPC) environments for three-dimensional shallow water flow fields coupled to mass transport is developed. LBM is an attractive method for solving the multilayered shallow water equations because the extension to multilayer is straight forward with all of the simplicities and advantages of the LBM in mass transport in shallow water flows and the LBM performance on central processing unit (CPU)-based and graphics processing unit (GPU)-based HPC environments.

Hire our acceleration gang

Hire us to accelerate your code. Or to train your pros. Or even to write new code for your organization!

We’ve already helped thousands of organizations speedup their code. We’ve accelerated algorithms, data sets, and challenging problems of all types in the past, and we’re confident that we can accelerate your code too.


Schedule a consultation





Starter Package ($500)*

Hire us for 1-day. We set you up with one engineer dedicated for a full day, as well as support from senior and test engineers throughout the day. Here’s how it works:

  • Initial phone call
  • You send us your code or algorithm
  • We analyze your code for potential GPU benefit
  • We accelerate as much of the code as is possible within the 1-day effort, applying our extensive GPU software libraries
  • We produce a final report, along with code
The final report outlines the best approach to follow for full GPU-acceleration of your code.

 

Buy the Starter Package

 

*Limit one (1) per customer.


Custom Coding Work

Why Choose Us?
  • Deep experience, already servicing thousands of customers
  • Quality work, with leading GPU software
  • Efficient delivery, with ArrayFire’s code base we can deliver results fast
  • Proven, we make customers happy through faster code
Our Strengths
  • Excellence in GPU code
  • Well-written, efficient C, C++, Fortran, Java, or .NET
  • Application porting and integration
  • Tuning software to peak hardware performance
  • Scalability of algorithms to handle enterprise-level programs
  •  
Schedule a consultation


Training

Attendees will receive the latest industry knowledge and techniques for GPU computing in CUDA and OpenCL.

ArrayFire offers up to four days of specialized GPU training in CUDA and OpenCL programming. We provide customized on-site training courses for customers that have 3 or more attendees from their organization. We can travel to your location and provide a CUDA or OpenCL training course tailored to meet your application-specific needs. We also offer individual trainings at our Atlanta office. Please contact us to learn more about our training offerings.

We recommend that attendees have a working knowledge of C/C++ in order to gain the most from the training courses. For more information about the content of our training lectures and hands-on practicums, please see the following training course syllabus.



Included in Course

  • Instruction by an excellent and interesting expert, with hands-on exercises
  • Use of a laptop with CUDA and OpenCL capable GPUs and CPUs
  • Choice of Linux or Windows operating system
  • Printed manual of lecture material
  • Electronic copy of programming exercises


Training Course Syllabus

Day 1, Introduction
Lectures:
GPU Computing Overview
The Programming Model
Basic Dataset Mapping Techniques
Libraries, ArrayFire
Profiling Tools

Practice:
A Simple Kernel
Equivalent ArrayFire Example
Using Libraries
Monte Carlo Pi Estimation
Timing and ArrayFire
Debugging Code

Day 3, Multi-GPU
Lectures:
Multi-GPU Use Cases
Multi-GPUs: Contexts
Existing Libraries
Scaling Across Multiple GPUs

Practice:
Out of Core Problems: Matrix Multiply
Task Level Parallelism: Optimization
ArrayFire Multi-GPU

Day 2, Optimization
Lectures:
Architecture: Grids, Blocks, and Threads
Memory Model: Global, Shared, and Constant Memory
Advanced Mapping Techniques
Streams: Asynchronos Launches and Concurrent Execution
ArrayFire: Lazy Evaluation and Code Vectorization

Practice:
Matrix Transpose
Optimization Using Shared Memory
Median Filter
Optimization Using Constant Memory
Stream Example
ArrayFire Example: Nearest Neighbor Algorithm

Day 4, Algorithm Problems

Lectures and Practice:
Reductions
Scan Algorithms
Sort
Convolution
Customer-Specific Problem


AaronHome