Giddy for GTC – We’re Taking it to the Next Level

ArrayFireAnnouncements, ArrayFire, Events Leave a Comment

GTC is quickly approaching and AccelerEyes is giddy with excitement! This year we are taking things to the next level as a Silver Sponsor at GTC 2013. Meaning, you’ll be seeing a lot more of us throughout the conference! Schedule a Meeting with Us Do you want to meet with us personally? Schedule a time to sit down with AccelerEyes engineers and account representatives using our online scheduler. Visit our Booth If  you’re attending GTC, be sure to come visit us at booth #204 to see some great demos or to chat with anyone in our Software Shop for CUDA & OpenCL. Come see how ArrayFire complements other GPU development efforts, including raw CUDA/OpenCL development, OpenACC, and other GPU libraries. Register …

GTC 2013 Tutorial – CUDA Accelerated Image Processing Libraries

John MelonakosArrayFire, CUDA, Events Leave a Comment

The 2013 GPU Technology Conference is just two weeks away. We’re super excited. We’re spending a lot of time preparing for our tutorial on CUDA Accelerated Image Processing Libraries. We think it will be well worth your while to attend. This is an 80-minute share all about CUDA image processing from James Malcolm, an AccelerEyes co-founder and lead engineer. You will walk away from the tutorial much better prepared to build fast computer vision and image processing codes. The session abstract is as follows: Image processing has consistently proven to benefit greatly from GPU acceleration. A number of libraries available from NVIDIA and AccelerEyes make image processing development efficient and lead to big speedups. Using these libraries can often significantly shorten …

Benchmarking Tesla K20

Pavan YalamanchiliArrayFire, Benchmarks, CUDA 1 Comment

In this blog post, we are going to compare NVIDIA’s latest high end offering, the Tesla K series (PDF) with their previous offering. In particular we are comparing the Tesla K20C with Tesla C2070/2075. This blog post follows a similar post about benchmarking the GTX680 we did last year. We take a look at similar set of functions (and a little bit more) to see what benefits the newer line brings. All of the benchmarks were done using double precision. In all of the graphs, higher trendlines are better. Matrix Multiplication In house at AccelerEyes, we use matrix multiplication as the gold standard for testing the maximum performance of all new GPUs we end up with. The K20c reaches a peak at …

7 Tips for CUDA & OpenCL Programming and How ArrayFire Helps

ArrayFireArrayFire, CUDA, OpenCL Leave a Comment

In order to get the best performance from your CUDA or OpenCL code, it is helpful to keep in mind some useful tips for optimizing performance. Note: By “accelerator” we refer to GPUs, APUs, co-processors, FPGAs, and any devices capable of running CUDA or OpenCL. Vectorized Code: Accelerators perform best with vectorized code because the computations map naturally onto arithmetic cores of the hardware. ArrayFire functions are inherently vectorized, so if you are using ArrayFire, you are writing vectorized code. Memory Transfers: Avoid excessive memory transfers. Each casting operation to and from the accelerator moves data back and forth between CPU memory and accelerator memory. ArrayFire makes many automatic optimizations to minimize these memory transfers by only transferring data when …

How much speedup can you get with CUDA or OpenCL?

ScottArrayFire, Benchmarks, CUDA, OpenCL Leave a Comment

Everyday developers ask us to predict how much speedup they can get with CUDA or OpenCL. Rather than gaze mysteriously into a crystal ball, we ask the developers questions to explore pertinent acceleration factors. Note, we’ll use the term accelerator to include GPUs, Xeon Phi coprocessor, APUs, FPGAs, and any other CUDA or OpenCL device. The principles we discuss below are equally applicable to all of these accelerators. The following are some of the important factors that must be considered when estimating the potential for accelerated speedups: Hardware:  The more advanced the accelerator hardware, the more the speedup you get (e.g. the NVIDIA Kepler K20 outperforms the previous NVIDIA Fermi C2090 generation). Data Sizes:  In general, accelerators will outperform CPUs to …

Getting Started with ArrayFire – a 30-minute Jump Start

ArrayFireArrayFire, C/C++, CUDA, OpenCL 1 Comment

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts. ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire. For example, the same application that takes 26 lines to write in Thrust, can be coded up in just 3 lines in ArrayFire! ArrayFire has supported …

CUDA GPUs Boost Mars Research

ArrayFireCase Studies, CUDA Leave a Comment

With the recent news release from NASA about the Mars Curiosity rover, and as a continuation of our previous post “Powering Mars Research”, Brendan Babb is here again to provide us with an exciting look into Jacket’s role in Mars research from the Curiosity rover . Brendan Babb and colleague Frank Moore, at the University of Alaska in Anchorage, work with NASA’s Jet Propulsion Lab to improve image quality and image compression of the Mars Rover images. Here is what Brendan had to tell us about the use of Jacket in his GPU computing challenges… Brendan Babb:  I was thrilled to watch the new Mars Rover Curiosity successful landing with my visiting nieces and nephews. The new rover will take pictures, …

No Free Lunch for GPU Compiler Directives

John MelonakosArrayFire, C/C++, CUDA, Fortran 3 Comments

Last week, Steve Scott at NVIDIA put up a viral post entitled, “No Free Lunch for Intel MIC (or GPU’s).”  It was a great read and a big hit in technical computing circles. The centrepiece of Scott’s piece was to say that there are no magic compilers.  GPUs don’t have them, and neither will MIC.  No compiler will be able to automatically recompile existing code and get great performance from MIC or GPUs.  Rather, it takes a good amount of elbow grease to write high-performance code. We totally agree.  The problem Scott addresses is real.  Despite marketing spin to the contrary, developing code for GPUs requires work. However, we don’t agree with Scott’s conclusion that compiler directives are a good solution. You can’t fight …

CUDA over Remote Desktop now available for Tesla GPUs

John MelonakosAnnouncements, CUDA 5 Comments

Update: Jacket over Remote Desktop is now available for Quadro devices too! Read this post. Jacket over Remote Connections is also documented extensively on the AccelerEyes Wiki. Over the past several years, many Jacket programmers have requested support for Remote Desktop in Windows.  We are pleased to report that recent NVIDIA drivers now enable Jacket to run over Remote Desktop, for some system configurations. Specifically, the requirements to make this work include: Windows Vista, Windows 7, Windows HPC Server 2008, or Windows HPC Server 2008 R2 The latest NVIDIA driver (as required by Jacket) Tesla GPU TCC-mode enabled on at least one (Tesla) GPU To enable TCC, the Tesla cannot be connected to a display. This means you need to …

Beam Propagation Methods – Jacket is 3.5X faster than the CPU and 2X faster than PCT

John MelonakosBenchmarks, Case Studies, CUDA 2 Comments

A couple weeks ago, a GPU-enabled code appeared on MATLAB Central entitled, “A CUDA accelerated Beam Propagation Method [BPM] Solver using the Parallel Computing Toolbox.”  In this post, we share a video which showcases how Jacket is much better than PCT at GPU computing, by analyzing performance on this Beam Propagation Method code. To reproduce these results, download the source code here:  CUDA_BPM_NOV_04_2010_AccelerEyes These benchmarks were run on an NVIDIA Tesla C2070 GPU versus a quad-core Intel CPU.  MATLAB + PCT R2010B were used for the PCT-GPU experiments.  MATLAB + Jacket 1.6 (prerelease) were used for the Jacket-GPU experiments. Take Home Message Due to Jacket’s extensive library of GPU functions and its optimized GPU runtime, it performs 3.5X faster than …