Image Processing Benchmarks on NVIDIA Jetson TK1

Pradeep Garigipati ArrayFire, Benchmarks, CUDA 7 Comments

In this post we will be looking at benchmarks of the following ArrayFire image processing functions on an ARM device. Erosion/Dilation Median filter Resize Histogram Bilateral filter Convolution We pitted the brand new compute 3.2 GPU on NVIDIA Jetson TK1 against a mobile NVIDIA GPU. The closest match to the GPU (from here on referred as TK1) on the Jetson board we have in our mobile card deck is a NVIDIA GT 650M. The GPU device properties that have critical effect on the function performance are listed below. Property Name / Device Name Jetson TK1 GK20A GT 650M Compute 3.2 3.0 Number of multiprocessors 1 2 Cores 192 384 Base clock rate 852 MHz 950 MHz Total global memory 1746 ...

Benchmarking the new Kepler (GTX 680)

Pavan Benchmarks, CUDA 13 Comments

NVIDIA has launched their next generation GPU based on their Kepler Architecture. They followed it up with a rather quick update to their CUDA toolkit. Considering that we have access to 3 generations of their GTX cards (480, 580 and 680), we thought we would show case how the performance has changed over the generations. Matrix multiplication: It can be seen that the GTX 680 breaches the 1 Terraflop mark comfortably for single precision, while the GTX 580 barely scratches it. However the performance seems to peak around 2048 x 2048 and then rallies downward to match the performance of the GTX 580 at larger sizes. The high end Tesla C2070 finishes last for single precision behind the third placed ...

Jacket on Lenovo Systems

Scott Announcements, Benchmarks, Jacket 1 Comment

Lenovo and AccelerEyes have a joint solution for optimizing M code on Lenovo workstations.  The combined HPC solution combines high Intel Xeon CPU performance for daily productivity with unprecedented NVIDIA graphics (GPU) performance for parallel computing with Jacket. Jacket’s comprehensive benchmark suite, when run on Lenovo ThinkStation systems, shows tremendous amounts of speedups for a wide variety of computationally-intensive applications. Jacket is the world’s fastest and broadest GPU software accelerating the M-language commonly found in MATLAB®.  Thousands of customers around the world have used Jacket to accelerate their MATLAB code. Lenovo ThinkStation systems are ideally suited for running real-world high-performance applications using Jacket. While the high-end CPUs are ideal for daily productivity tasks, Jacket and the Quadro GPUs perform HPC ...

A better way to time Jacket code

ArrayFire Benchmarks 1 Comment

Whether you are a new Jacket programmer or a GPU maestro, you are bound to speed-test Jacket at some point. There are many factors to keep in mind while benchmarking Jacket code - a simple tic-func()-toc won't do. For example, this is some typical benchmarking code: % warm up x = rand(n,'single'); x = grand(n, 'single'); geval(x); % CPU timing tic for r = 1:reps x = rand(n,'single'); end cpu_time = toc; % GPU timing gsync, tic for r = 1:reps x = grand(n,'single'); geval(x); end gsync, gpu_time = toc With Jacket 1.7, this entire code chunk is now replaced by two lines: cpu_time = timeit(@()  rand(n,'single')); gpu_time = timeit(@() grand(n,'single'));

Hybrid GPU & Multicore Processing for LU Decomposition

Scott Benchmarks, Case Studies, CUDA Leave a Comment

One of the hot areas in supercomputing is hybrid compute: balancing the computational load between one or more CPUs and GPUs. Along these lines Nolan Davis and Daniel Redig at SAIC recently presented work on Hybrid GPU/Multicore Solutions for Large Linear Algebra Problems where they developed a novel algorithm for LU decomposition, one of the most important routines in linear algebra. Here's a snapshot view of their setup: System Specs: GPU Nvidia® Tesla™ 2050 448 processing cores3 GB dedicated memory Multicore Host 24 cores64 GB system memory Red Hat® Enterprise Linux 5 Two AMD Opteron™ 6172 12-core processors Host-to-GPU Communications PCIE 2.0 16 channels at 500 MB/sec/laneTheoretical peak bandwidth of 8 GB/sec   Their initial results are very promising. For ...

Beam Propagation Methods - Jacket is 3.5X faster than the CPU and 2X faster than PCT

John Melonakos Benchmarks, Case Studies, CUDA 2 Comments

A couple weeks ago, a GPU-enabled code appeared on MATLAB Central entitled, "A CUDA accelerated Beam Propagation Method [BPM] Solver using the Parallel Computing Toolbox."  In this post, we share a video which showcases how Jacket is much better than PCT at GPU computing, by analyzing performance on this Beam Propagation Method code. To reproduce these results, download the source code here:  CUDA_BPM_NOV_04_2010_AccelerEyes These benchmarks were run on an NVIDIA Tesla C2070 GPU versus a quad-core Intel CPU.  MATLAB + PCT R2010B were used for the PCT-GPU experiments.  MATLAB + Jacket 1.6 (prerelease) were used for the Jacket-GPU experiments. Take Home Message Due to Jacket's extensive library of GPU functions and its optimized GPU runtime, it performs 3.5X faster than ...

A Jacket built for Speed

ArrayFire Benchmarks, CUDA 1 Comment

Just a few months ago, Jacket 1.4 was released sporting an improved MTIMES routine that brought about radical improvements to Jacket's matrix multiplication. The quest for performance never ends though. Now, in the release of Jacket 1.5, MTIMES is even faster than before for SGEMM routines. Checkout the MTIMES Benchmarks wiki for more information. I you are attending GTC, you may want to attend this session also!

Torben’s Corner - A GPU Computing Gem for Jacket Programmers!

John Melonakos Benchmarks Leave a Comment

In January, we introduced you to Torben’s Corner – a resource wiki created and maintained by Jacket programming guru, Torben Larsen at Aalborg University in Denmark.  Many Jacket programmers have gained valuable insights from Torben’s Corner, including GPU performance charts, coding guidelines, special tricks. Since January, many wonderful additions have been added to Torben’s Corner.  We think you will find value in not only this new information but the entire resource.  Here is a quick summary of the most recent additions with links to the information: Benchmarking Update Torben’s Corner maintains a long list of benchmarks specifically detailing speedups of Jacket relative to standard MATLAB. This became an enormous task due to the sheer number of functions supported by Jacket ...

NVIDIA Fermi with CUDA and OpenCL

ArrayFire Benchmarks, CUDA, OpenCL 1 Comment

In December of 2008, we did a blog post answering questions from customers and prospects about the use of OpenCL for Jacket.  If you have not reviewed that blog post to gain some insight into our progress you can access it here - Some things have changed since that original post.  For example, NVIDIA now provides an OpenCL driver, toolkit, programming guide, and SDK examples.  Given the new tools available and the new Fermi hardware, we ran some tests on the Tesla c2050 to compare OpenCL performance to CUDA performance.  The Tesla C2050 is an amazing beast of a card, providing upto 512 Gigaflops of double precision arithmetic (at peak). Before we present the benchmarks, we should comment on ...

Torben's Corner

Gallagher Pryor Announcements Leave a Comment

We work very closely with our customers and really appreciate the feedback we receive and value the insight provided.  One Jacket programmer has started to post fantastic content on the Jacket Documentation Wiki under Torben's Corner. This content is maintained by Torben Larsen's team at AAU focusing primarily on outlining performance observations between GPUs and CPUs.  This information is not only of great value to our technical team but also valuable to the entire Jacket community.  Thanks Torben for this great resource!