Jacket on Lenovo Systems

ScottAnnouncements, Benchmarks 1 Comment

Lenovo and AccelerEyes have a joint solution for optimizing M code on Lenovo workstations.  The combined HPC solution combines high Intel Xeon CPU performance for daily productivity with unprecedented NVIDIA graphics (GPU) performance for parallel computing with Jacket. Jacket’s comprehensive benchmark suite, when run on Lenovo ThinkStation systems, shows tremendous amounts of speedups for a wide variety of computationally-intensive applications. Jacket is the world’s fastest and broadest GPU software accelerating the M-language commonly found in MATLAB®.  Thousands of customers around the world have used Jacket to accelerate their MATLAB code. Lenovo ThinkStation systems are ideally suited for running real-world high-performance applications using Jacket. While the high-end CPUs are ideal for daily productivity tasks, Jacket and the Quadro GPUs perform HPC …

Filtering Benchmarks – OpenCV GPU vs LibJacket

ArrayFireBenchmarks, CUDA Leave a Comment

OpenCV is one of the most popular computer vision toolkits, and over the last year they’ve been integrating more GPU processing into the core. One of the most common image processing tasks is convolution. Since LibJacket and OpenCV both support this, one of my coworkers rolled up his sleeves and benchmarked the latest versions from both libraries: OpenCV/CPU, OpenCV/GPU, LibJacket. Jump over to his personal website for the full benchmark results and source code.  From the graphs, the GPU implementations from OpenCV and LibJacket both easily outperform the default CPU version in OpenCV, but notice that LibJacket pushes performance even further and dominates OpenCV’s GPU implementation, especially when using separable filters. We’ve worked really hard the last few years to …

Jacket Demo – CPU vs GPU runtimes on MATLAB® code

John MelonakosBenchmarks, CUDA 1 Comment

To explore the differences between CPU-only computing and GPU-accelerated computing, the new Jacket Demo is really convenient.  The Jacket Demo automatically launches two MATLAB® sessions, one running on the CPU-only and the other running on the GPU with Jacket. This side-by-side demo shows the computational speed of each processor as well as a visual depiction of the algorithm’s progression.  A variety of different demos are provided. The Jacket Demo is included in every Jacket installation (found in the examples directory and launchable from the Start Menu in Windows). Checkout this video of the Jacket Demo in action on an i7 CPU with a Tesla C2050 GPU.  Enjoy!

Fast Computer Vision with OpenCV and ArrayFire

John MelonakosArrayFire, Benchmarks, Case Studies, CUDA Leave a Comment

Update:  While the post below discusses LibJacket (no longer a product), you can do the same thing in the newer, but different, ArrayFire library.  Improved performance benchmarks and a simpler API are the results of moving from LibJacket to ArrayFire. Mcclanahoochie just posted some code and instructions for pairing OpenCV with LibJacket to get accelerated computer vision.  You can do really fast image processing on video cam feeds too, see picture below: Really cool stuff.  Computer vision is really hot with applications emerging in defense, radiology, games, automotive, and other consumer applications. Computer vision algorithms like these are also going mobile.  For instance, we have started to build LibJacket for Mobile applications, which runs on Tegra, PowerVR, and other mobile …

High Performance Compressive Sensing

ArrayFireBenchmarks, Case Studies Leave a Comment

A few weeks ago, we published a blog entry that demonstrated the ability of Jacket to speed up “compressive sensing”, a technology that has wide applications in areas such as Image processing, reconstruction and spectroscopy. Here, we discuss the work of Nabor Reyna Jr. and Wotao Yin from Rice University using Jacket to speed up “compressive sensing” algorithms in reconstruction. This work deals with reconstruction of signals using partial Fourier matrices (RecPF).  The major computational components of the algorithm involve shrinkage and FFTs.  Jacket is employed to accelerate this compute-heavy code, and the resultant version (gRecPF) was about 5x faster! To reduce the cost involved in generating the random matrices involved in the above method, a second method (RecPC) that …

Chan-Vese Active Contours on the GPU

ArrayFireBenchmarks, Case Studies 1 Comment

Active Contours are mathematical models that enable detection of objects within images, and are extensively used in Computer Vision as self-adapting frameworks for the delineation and tracking of objects. To demonstrate Jacket’s cross-platform versatility, we implemented the Chan Vese contour tracking app on Android. The video can be viewed here. Today, however, we’d like to use a MATLAB implementation of active contours as an example of how to take a large project, and with minimal changes, achieve speedups with Jacket. We’ll dangle the proverbial carrot first: the GPU Chan-Vese implementation contains only three kinds of changes overall, and the computational code is exactly the same for both CPU and GPU versions. Plus, take a look at the speed-ups below! How …

A better way to time Jacket code

ArrayFireBenchmarks 1 Comment

Whether you are a new Jacket programmer or a GPU maestro, you are bound to speed-test Jacket at some point. There are many factors to keep in mind while benchmarking Jacket code – a simple tic-func()-toc won’t do. For example, this is some typical benchmarking code: % warm up x = rand(n,’single’); x = grand(n, ‘single’); geval(x); % CPU timing tic for r = 1:reps x = rand(n,’single’); end cpu_time = toc; % GPU timing gsync, tic for r = 1:reps x = grand(n,’single’); geval(x); end gsync, gpu_time = toc With Jacket 1.7, this entire code chunk is now replaced by two lines: cpu_time = timeit(@()  rand(n,’single’)); gpu_time = timeit(@() grand(n,’single’));

Improved Fat/Water Reconstruction Algorithm with Jacket

ScottBenchmarks, Case Studies, CUDA 1 Comment

Case Western Reserve University researchers turned to GPUs running Jacket to develop a fast and robust Iterative Decomposition of water and fat with an Echo Asymmetry and Least-squares (IDEAL) reconstruction algorithm. The complete article can be found here. The authors report that “GPU usage is critical for the future of high resolution, small animal and human imaging” and Jacket “enables GPU computations in MATLAB.” Their research was performed on a desktop system with 32GB RAM, dual Intel Xeon X5450 3.0 GHz processors, an NVIDIA Quadro FX5800 (4GB RAM, 240 cores, 400 MHz clock), and MATLAB R2009a 64bit.  Jacket v1.1, an older version, was used to produce these results. Reconstruction tests with different sized images were performed to evaluate computation times …

Hybrid GPU & Multicore Processing for LU Decomposition

ScottBenchmarks, Case Studies, CUDA Leave a Comment

One of the hot areas in supercomputing is hybrid compute: balancing the computational load between one or more CPUs and GPUs. Along these lines Nolan Davis and Daniel Redig at SAIC recently presented work on Hybrid GPU/Multicore Solutions for Large Linear Algebra Problems where they developed a novel algorithm for LU decomposition, one of the most important routines in linear algebra. Here’s a snapshot view of their setup: System Specs: GPU Nvidia® Tesla™ 2050 448 processing cores3 GB dedicated memory Multicore Host 24 cores64 GB system memory Red Hat® Enterprise Linux 5 Two AMD Opteron™ 6172 12-core processors Host-to-GPU Communications PCIE 2.0 16 channels at 500 MB/sec/laneTheoretical peak bandwidth of 8 GB/sec   Their initial results are very promising. For …

Unraveling Speedups: Two Important Questions

John MelonakosBenchmarks, CUDA 1 Comment

One Jacket programmer recently emailed the following to us: Our chief scientists asked me a question that I’d like to pass on to you.  I think I know the answer, but you guys can be much more definitive than I can. He recently read about people achieving ~10x speedups by converting parts of their code to MEX files.  He was wondering how much of the observed speedup is due to that MEX and how much is due to CUDA and the GPU. Two Questions You Should Ask Yourself When contemplating an effort to optimize a piece of code, it is important to unravel the effort into two separate questions.  Both need to be addressed to improve performance: How well-written is …