Dr. Kevin Tubbs and Professor Tsai at Louisiana State University recently published an interesting paper using GPUs and Jacket to accelerate lattice Boltzmann models for shallow water flow and mass transport. More details about this work are provided in the full success story page on the website. Jacket makes GPU programming easy. “Very little recoding was needed to promote the LBM code to run on the GPU,” say the authors at one point in their paper. In this blog post, we share the highlights of this work. Using these methods, the authors are able to simulate shallow water flow and mass transport. For instance, checkout these videos of a dam break: The authors completed this work with a relatively older …
Computer Vision Demos at SC’10 with 8-GPU Colfax CXT8000
We just returned from SC’10, the biggest supercomputing show of the year. At the show, we demoed Jacket driving computer vision demos on an 8-GPU Colfax CXT8000 system… pure eye candy! We had CPU and GPU versions of the demos running on 8 different monitors, each attached to the 8 Tesla C2050 GPUs in the system. Input data for the various demos was sourced from 3 webcams and 2 Blu-ray video inputs. Checkout the demo details, below: Demo 1 Sobel edge detection with image dilation and interpolation overlaid on Blu-ray video in realtime. Demo 2 Feature detection on a 4-level pyramid of 640×480 realtime webcam video. Demo 3 Gradient descent feature tracking , a stripped down version of KLT, tracking …
Beam Propagation Methods – Jacket is 3.5X faster than the CPU and 2X faster than PCT
A couple weeks ago, a GPU-enabled code appeared on MATLAB Central entitled, “A CUDA accelerated Beam Propagation Method [BPM] Solver using the Parallel Computing Toolbox.” In this post, we share a video which showcases how Jacket is much better than PCT at GPU computing, by analyzing performance on this Beam Propagation Method code. To reproduce these results, download the source code here: CUDA_BPM_NOV_04_2010_AccelerEyes These benchmarks were run on an NVIDIA Tesla C2070 GPU versus a quad-core Intel CPU. MATLAB + PCT R2010B were used for the PCT-GPU experiments. MATLAB + Jacket 1.6 (prerelease) were used for the Jacket-GPU experiments. Take Home Message Due to Jacket’s extensive library of GPU functions and its optimized GPU runtime, it performs 3.5X faster than …
Speeding up critical code
With Jacket 1.5, we released a big new feature: GCOMPILE. This allows you to convert critical sections of your MATLAB code directly into GPU kernels to further increase speed. In an earlier post we introduced the prototype and have been working with several beta users over the past month to get it ready. In this post, we’ll give some more details and start to look at the speedups you can quickly and easily achieve. You can find more information about it on the wiki. Some of the best features of GCOMPILE are the ability to use IF statements, WHILE loops, and FOR loops in your code now. Make sure to check out the wiki pages about these and the other …
GPU Giddy – Excitement Building for GTC
GTC is coming up… The GPU Technology Conference (GTC) starts later this month and is sure to generate a new level of excitement and energy around GPU computing. The conference includes over 250 technology sessions presented by industry, government, and academic technology leaders. AccelerEyes is pleased to be well represented at this year’s conference by our technical leadership and a number of our customers. If you plan to attend the conference be sure to include the sessions outlined below on your agenda. In addition to being well represented, we are also flattered to see that others in the market have recognized that GPU Computing with MATLAB delivers clear productivity gains and that the performance improvements made possible by GPUs is …
Tesla C2050 versus C1060 on Real MATLAB Applications
Following our recent Jacket v1.4 Fermi architecture release, many of you requested data comparing the new NVIDIA Fermi-based Tesla C2050 versus the older Tesla C1060. Over the years, AccelerEyes has developed an extensive suite of benchmark MATLAB applications, which are included in every Jacket installation. Using this suite of tests, we compared performance of the C2050 vs C1060 and are pleased to report the results here. We hope this information will be useful to Jacket programmers. All tests were run on the same standard workstation with Jacket 1.4. The only thing that changed was the actual GPU board. In every case the C2050 beat the C1060. Double-precision examples on the Fermi-based board outperformed the older board by 50% in every …
NVIDIA Fermi with CUDA and OpenCL
In December of 2008, we did a blog post answering questions from customers and prospects about the use of OpenCL for Jacket. If you have not reviewed that blog post to gain some insight into our progress you can access it here – http://blog.accelereyes.com/blog/2008/12/30/opencl/. Some things have changed since that original post. For example, NVIDIA now provides an OpenCL driver, toolkit, programming guide, and SDK examples. Given the new tools available and the new Fermi hardware, we ran some tests on the Tesla c2050 to compare OpenCL performance to CUDA performance. The Tesla C2050 is an amazing beast of a card, providing upto 512 Gigaflops of double precision arithmetic (at peak). Before we present the benchmarks, we should comment on …
Jacket and GPUs show promise in Neuroscience with fMRI and SPM
For those of you interested in neuroscience and neuroimaging, you have probably heard of a software capability called SPM or Statistical Parametric Mapping developed by a group at University College London. Well, a group at Georgia Tech has been doing some work with Jacket and CUDA on SPM and have produced some initial results that show some promise. Being able to speed up the image analysis of functional MRI can benefit the medical community in a big way. AccelerEyes has been supporting these efforts at Georgia Tech and with the permission of the authors we have produced an initial look at their work. Enjoy. http://www.accelereyes.com/resources/spm-fmri
Median Filtering: CUDA tips and tricks
Last week we posted a video recording from NVIDIA’s GTC09 conference. In the video, I walked through median filtering, presenting the vanilla implementation and then walking through progressive CUDA optimizations. A comment on that post suggested trying some other compiler flags, and it sparked a new series of experiments. In the original video, we started with a vanilla CPU implementation of 3×3 median filtering. We then ported this to the GPU to realize some immediate gains, but then we started a string of optimizations to see how far we could drive up performance: switching to textured memory, switching to shared memory, switching the internal sorting of pixels, etc. The conclusion: pay attention to the resource usage reported by nvcc (registers, …
A case study in CUDA optimization
Jimi Malcolm, VP of Engineering and Co-founder of AccelerEyes takes about 15 minutes to share CUDA optimization strategies to maximize performance of CUDA code. Watch the video below to find out what needs to go into strategizing CUDA development to maximize performance. Jimi uses Median Filtering for this case study. Get the Flash Player to see this player.