Stanford GPU Benchmarks: Jacket vs PCT/GPU

John MelonakosBenchmarks, Case Studies, CUDA Leave a Comment

Researchers in the Pervasive Parallelism Laboratory at Stanford University recently published work describing a novel framework for parallel computing with a paper entitled, “A Domain-Specific Approach to Heterogeneous Parallelism.”  As part of their research, they compared Jacket to the GPU support in the Parallel Computing Toolbox™.  The results clearly show that Jacket’s optimizations make a big difference in performance. In this blog post, we highlight 4 algorithms included in their research: NAME DESCRIPTION INPUT Gaussian Discriminant Analysis (GDA) Generative learning algorithm for modeling the probability distribution of a set of data as a multivariate Gaussian 1,200×1,024 Matrix Restricted Boltzmann Machine (RBM) Stochastic recurrent neural network, without connections between hidden units 2,000 Hidden Units 2,000 Dimensions Support Vector Machine (SVM) Optimal …

LIBJACKET on Amazon EC2 GPU Cloud Instances

Pavan YalamanchiliBenchmarks, CUDA 1 Comment

Amazon recently added GPUs to their Elastic Compute Cloud. We decided to throw LIBJACKET into this GPU cloud to see how it would fare. The $2/hr pay-on-demand pricing is a great option for many Jacket programmers. This post is full of screenshots detailing the steps we took to get going with GPU computing in Amazon’s cloud: Sign up with Amazon EC2 Launch a GPU instance Login to the instance using ssh Setup the environment Download, build, and test LIBJACKET! Everything in this post applies equally well to running Jacket for MATLAB® on EC2. Simply install MATLAB + Jacket in your Amazon GPU instance and start working over ssh.

GPU accelerated lattice Boltzmann model for shallow water flow and mass transport

John MelonakosBenchmarks, Case Studies, CUDA 3 Comments

Dr. Kevin Tubbs and Professor Tsai at Louisiana State University recently published an interesting paper using GPUs and Jacket to accelerate lattice Boltzmann models for shallow water flow and mass transport.  More details about this work are provided in the full success story page on the website. Jacket makes GPU programming easy.  “Very little recoding was needed to promote the LBM code to run on the GPU,” say the authors at one point in their paper. In this blog post, we share the highlights of this work.  Using these methods, the authors are able to simulate shallow water flow and mass transport.  For instance, checkout these videos of a dam break: The authors completed this work with a relatively older …

Beam Propagation Methods – Jacket is 3.5X faster than the CPU and 2X faster than PCT

John MelonakosBenchmarks, Case Studies, CUDA 2 Comments

A couple weeks ago, a GPU-enabled code appeared on MATLAB Central entitled, “A CUDA accelerated Beam Propagation Method [BPM] Solver using the Parallel Computing Toolbox.”  In this post, we share a video which showcases how Jacket is much better than PCT at GPU computing, by analyzing performance on this Beam Propagation Method code. To reproduce these results, download the source code here:  CUDA_BPM_NOV_04_2010_AccelerEyes These benchmarks were run on an NVIDIA Tesla C2070 GPU versus a quad-core Intel CPU.  MATLAB + PCT R2010B were used for the PCT-GPU experiments.  MATLAB + Jacket 1.6 (prerelease) were used for the Jacket-GPU experiments. Take Home Message Due to Jacket’s extensive library of GPU functions and its optimized GPU runtime, it performs 3.5X faster than …

A Jacket built for Speed

ArrayFireBenchmarks, CUDA 1 Comment

Just a few months ago, Jacket 1.4 was released sporting an improved MTIMES routine that brought about radical improvements to Jacket’s matrix multiplication. The quest for performance never ends though. Now, in the release of Jacket 1.5, MTIMES is even faster than before for SGEMM routines. Checkout the MTIMES Benchmarks wiki for more information. I you are attending GTC, you may want to attend this session also!

Jacket for MATLAB on HP Z Workstation series

John MelonakosBenchmarks Leave a Comment

AccelerEyes has had access to a couple of Z Workstation series from Hewlett Packard for Jacket testing. Our goal is to make GPU computing more economical and accessible to technical computing users by working with leading computer OEMs. According to most analysts that follow the workstation market, HP is a leader for workstations. Support for HP’s Z Workstations enables users, who previously didn’t have the budgets and programming knowledge, to tap the power of GPU computing to solve their growing scientific and engineering problems. AccelerEyes has certified and completed performance testing on both the entry level HP Z200 Workstation and the high performance HP Z800 Workstation. The results from these tests can be reviewed at http://www.accelereyes.com/partners/hp. The HP Z Series …

Torben’s Corner – A GPU Computing Gem for Jacket Programmers!

John MelonakosBenchmarks Leave a Comment

In January, we introduced you to Torben’s Corner – a resource wiki created and maintained by Jacket programming guru, Torben Larsen at Aalborg University in Denmark.  Many Jacket programmers have gained valuable insights from Torben’s Corner, including GPU performance charts, coding guidelines, special tricks. Since January, many wonderful additions have been added to Torben’s Corner.  We think you will find value in not only this new information but the entire resource.  Here is a quick summary of the most recent additions with links to the information: Benchmarking Update Torben’s Corner maintains a long list of benchmarks specifically detailing speedups of Jacket relative to standard MATLAB. This became an enormous task due to the sheer number of functions supported by Jacket …

Tesla C2050 versus C1060 on Real MATLAB Applications

John MelonakosBenchmarks 7 Comments

Following our recent Jacket v1.4 Fermi architecture release, many of you requested data comparing the new NVIDIA Fermi-based Tesla C2050 versus the older Tesla C1060. Over the years, AccelerEyes has developed an extensive suite of benchmark MATLAB applications, which are included in every Jacket installation. Using this suite of tests, we compared performance of the C2050 vs C1060 and are pleased to report the results here. We hope this information will be useful to Jacket programmers. All tests were run on the same standard workstation with Jacket 1.4. The only thing that changed was the actual GPU board. In every case the C2050 beat the C1060. Double-precision examples on the Fermi-based board outperformed the older board by 50% in every …

SGEMM, MTIMES & CUBLAS performance on the GPU

ArrayFireBenchmarks, CUDA 5 Comments

AccelerEyes is focused on not only providing the most easy to use GPU programming platform for CUDA capable GPUs by leveraging the MATLAB® language, our engineering organization is always looking for ways to improve the performance of all areas in the Jacket platform. A case in point is some recent work with matrix multiplication, specifically (Single General Matrix Multiply) SGEMM, or MTIMES.  The Jacket 1.3 release was based on CUBLAS for matrix multiplication and given the importance of matrix multiplication to so many of our customers, we decided to find out if we could improve performance of the function. Update: The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken. Have …

NVIDIA Fermi with CUDA and OpenCL

ArrayFireBenchmarks, CUDA, OpenCL 1 Comment

In December of 2008, we did a blog post answering questions from customers and prospects about the use of OpenCL for Jacket.  If you have not reviewed that blog post to gain some insight into our progress you can access it here – http://blog.accelereyes.com/blog/2008/12/30/opencl/. Some things have changed since that original post.  For example, NVIDIA now provides an OpenCL driver, toolkit, programming guide, and SDK examples.  Given the new tools available and the new Fermi hardware, we ran some tests on the Tesla c2050 to compare OpenCL performance to CUDA performance.  The Tesla C2050 is an amazing beast of a card, providing upto 512 Gigaflops of double precision arithmetic (at peak). Before we present the benchmarks, we should comment on …