History of the Modern GPU Series

John Melonakos Computing Trends Leave a Comment

Graham Singer over at Techspot posted a series of articles a few weeks ago covering the history of the modern GPU. It is well-written and in-depth. For GPU affectionados, this is a nice read. There are 4 parts to the series: Part 1: (1976 - 1995) The Early Days of 3D Consumer Graphics Part 2: (1995 - 1999) 3Dfx Voodoo: The Game-changer Part 3: (2000 - 2006) The Nvidia vs. ATI Era Begins Part 4: (2006 - 2013) The Modern GPU: Stream processing units a.k.a. GPGPU Enjoy!

Jacket v2.3 Now Available

John Melonakos Announcements, CUDA, Jacket 1 Comment

We are pleased to announce the new release of Jacket v2.3.  This new version of Jacket brings even greater performance improvements through GPU computing for MATLAB® codes.  (Click here to download v2.3) With v2.3, new support has been added for CUDA 5.0.  This newer version of CUDA enables computation on the latest Kepler K20 GPUs of the NVIDIA Tesla product line. This morning we received an email from a Jacket user who said, "V2.3 + CUDA 5 = wow. Just upgraded and re-ran one of the routines that previously took just under 4 minutes - now less than 2 minutes!" This is a must-have release for all Jacket users.  The performance improvements are generally felt across the board.  Existing Jacket ...

Image Processing with ArrayFire and OpenCV on the GPU

John Melonakos ArrayFire, C/C++, Case Studies, CUDA Leave a Comment

ArrayFire is a great way to supplement OpenCV for faster processing on the GPU. Mcclanahoochie recently posted an interactive demo showing the use of OpenCV with ArrayFire for computing Local Contrast Enhancement on the GPU from webcam video. Mcclanahoochie also shows how easy it is to convert OpenCV Mat images into ArrayFire GPU array images, as seen in the code snippit below: All the source code is available on Google Code, linked to from his website. Simply download ArrayFire and OpenCV and try it out for yourself!

GPU Computing with Jacket in Automated Trader

John Melonakos Benchmarks, Case Studies, Jacket Leave a Comment

The Q1 2012 issue of Automated Trader contains an excellent "Mashup!" piece reviewing software for algorithmic trading.  The article provides a wonderful glimpse into the 1-2 month adventure of Andy Webb, Automated Trader's Founder, and Wrecking Crew building a fast trading platform from several technologies.  We heartily recommend that those of you in financial computing go subscribe to get the full story and access to ongoing developments from these Automated Trader thought leaders! The full trading platform they built was quite extensive.  The part that caught our eye was the core computational component of the pipeline.  That component involved permuting 1,000 potential pairs with cointegration tests for 350 time windows on each potential pair. The single core MATLAB® version took 70 minutes ...

Getting More out of GPU Computing with LIBJACKET v1.0

John Melonakos Announcements, CUDA Leave a Comment

LIBJACKET v1.0 is here! It is the Matrix Companion to CUDA, providing a high-productivity performance layer for GPU computing. Download now to start a free 15-day trial. It integrates seamlessly with any CUDA code, but can also be used to avoid writing complicated GPU kernels yourself via its matrix interface. Soak up its features, here. We're celebrating this launch by offering two big promotions, one for existing Jacket programmers and one for the broader GPU computing community: Existing Jacket customers get 50% off libJacket. Buy a Tesla, Get a Free libJacket subscription. Learn more about these offers. Here are some other links of interest to this launch: Tour Documentation Function benchmarks Press release Over the years, we've been thrilled to ...

Hybrid GPU & Multicore Processing for LU Decomposition

Scott Benchmarks, Case Studies, CUDA Leave a Comment

One of the hot areas in supercomputing is hybrid compute: balancing the computational load between one or more CPUs and GPUs. Along these lines Nolan Davis and Daniel Redig at SAIC recently presented work on Hybrid GPU/Multicore Solutions for Large Linear Algebra Problems where they developed a novel algorithm for LU decomposition, one of the most important routines in linear algebra. Here's a snapshot view of their setup: System Specs: GPU Nvidia® Tesla™ 2050 448 processing cores3 GB dedicated memory Multicore Host 24 cores64 GB system memory Red Hat® Enterprise Linux 5 Two AMD Opteron™ 6172 12-core processors Host-to-GPU Communications PCIE 2.0 16 channels at 500 MB/sec/laneTheoretical peak bandwidth of 8 GB/sec   Their initial results are very promising. For ...

Unraveling Speedups: Two Important Questions

John Melonakos Benchmarks, CUDA 1 Comment

One Jacket programmer recently emailed the following to us: Our chief scientists asked me a question that I'd like to pass on to you.  I think I know the answer, but you guys can be much more definitive than I can. He recently read about people achieving ~10x speedups by converting parts of their code to MEX files.  He was wondering how much of the observed speedup is due to that MEX and how much is due to CUDA and the GPU. Two Questions You Should Ask Yourself When contemplating an effort to optimize a piece of code, it is important to unravel the effort into two separate questions.  Both need to be addressed to improve performance: How well-written is ...

Stanford GPU Benchmarks: Jacket vs PCT/GPU

John Melonakos Benchmarks, Case Studies, CUDA Leave a Comment

Researchers in the Pervasive Parallelism Laboratory at Stanford University recently published work describing a novel framework for parallel computing with a paper entitled, "A Domain-Specific Approach to Heterogeneous Parallelism."  As part of their research, they compared Jacket to the GPU support in the Parallel Computing Toolbox™.  The results clearly show that Jacket's optimizations make a big difference in performance. In this blog post, we highlight 4 algorithms included in their research: NAME DESCRIPTION INPUT Gaussian Discriminant Analysis (GDA) Generative learning algorithm for modeling the probability distribution of a set of data as a multivariate Gaussian 1,200x1,024 Matrix Restricted Boltzmann Machine (RBM) Stochastic recurrent neural network, without connections between hidden units 2,000 Hidden Units 2,000 Dimensions Support Vector Machine (SVM) Optimal ...

LIBJACKET on Amazon EC2 GPU Cloud Instances

Pavan Benchmarks, CUDA 1 Comment

Amazon recently added GPUs to their Elastic Compute Cloud. We decided to throw LIBJACKET into this GPU cloud to see how it would fare. The $2/hr pay-on-demand pricing is a great option for many Jacket programmers. This post is full of screenshots detailing the steps we took to get going with GPU computing in Amazon's cloud: Sign up with Amazon EC2 Launch a GPU instance Login to the instance using ssh Setup the environment Download, build, and test LIBJACKET! Everything in this post applies equally well to running Jacket for MATLAB® on EC2. Simply install MATLAB + Jacket in your Amazon GPU instance and start working over ssh.

GPU accelerated lattice Boltzmann model for shallow water flow and mass transport

John Melonakos Benchmarks, Case Studies, CUDA 3 Comments

Dr. Kevin Tubbs and Professor Tsai at Louisiana State University recently published an interesting paper using GPUs and Jacket to accelerate lattice Boltzmann models for shallow water flow and mass transport.  More details about this work are provided in the full success story page on the website. Jacket makes GPU programming easy.  "Very little recoding was needed to promote the LBM code to run on the GPU," say the authors at one point in their paper. In this blog post, we share the highlights of this work.  Using these methods, the authors are able to simulate shallow water flow and mass transport.  For instance, checkout these videos of a dam break: The authors completed this work with a relatively older ...