We are pleased to announce Jacket 1.4, with support for the latest NVIDIA graphics processing units based on the Fermi architecture (Tesla 20-series and GeForce GTX 4xx-series). NVIDIA’s release of the Fermi architecture brings with it 448 computational cores, increased IEEE-754 floating-point arithmetic precision, error-correcting memory for reliable computation, and enhanced memory caching mechanisms. Highlights for Jacket 1.4 are as follows: Added support for the NVIDIA Fermi architecture (GTX400 and Tesla C2000 series) Jacket DLA support for Fermi Dramatically improved the performance of Jacket’s JIT (Just-In-Time) compilation technology Operations involving random scalar constants do not incur a recompile Removed dependencies on MINGW and NVCC Logical indexing now supported for SUBSREF and SUBSASGN, e.g. B = A(A > x) MTIMES supports …
SGEMM, MTIMES & CUBLAS performance on the GPU
AccelerEyes is focused on not only providing the most easy to use GPU programming platform for CUDA capable GPUs by leveraging the MATLAB® language, our engineering organization is always looking for ways to improve the performance of all areas in the Jacket platform. A case in point is some recent work with matrix multiplication, specifically (Single General Matrix Multiply) SGEMM, or MTIMES. The Jacket 1.3 release was based on CUBLAS for matrix multiplication and given the importance of matrix multiplication to so many of our customers, we decided to find out if we could improve performance of the function. Update: The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken. Have …
Power Flow with Jacket & MATLAB on the GPU!
Learn how Jacket, GPUs, and MATLAB can deliver magnitudes of performance improvement over CPU-based solutions for Power flow studies. AccelerEyes, in collaboration with the Indian Institute of Technology in Roorkee, has developed this case study to illustrate the ability to study power flow models on graphics processing units using Jacket and MATLAB. Implementation on the GPU is 35 times faster than CPU alternatives. http://www.accelereyes.com/resources/powerflow
Crushing MATLAB Loop Runtimes with BSXFUN
One of the slowest blocks of code that inflate runtimes in MATLAB are for/while loops. In this blog post, I’m going to talk about a little known way of crushing MATLAB loop runtimes for many commonplace use cases by utilizing one of the most amazingly underrated and unknown functions in MATLAB’s repertoire: bsxfun. Using this function, one can break seemingly iterative code into clean, vectorized, snippets that beat the socks off even MATLAB’s JIT engine. Better still, Jacket fully supports bsxfun meaning that if you thought a vectorized loop was fast, you haven’t seen anything, yet. Also, in the end, a loop represented using bsxfun is just good programming practice. As we’ll see, the technique I’m going to describe is …
Jacket with MATLAB for Optics and DSP
Over the last month I have heard many Jacket customers talk about their use of the Jacket platform for MATLAB to solve optics problems. NASA and the University of Rochester are two that come to mind immediately. We found some work that has been done recently to show an example of how Jacket can be used to solve an Optical Flow problem using the Horn and Schunk method and thought it might be useful to share. In addition, last week Seth Benton, a blogger for dspreleated.com shares his experience in working with Jacket. After about a week of getting up to speed and running some examples his experience is worth sharing if you have not already seen it.
Accelerate Computer Vision Data Access Patterns with Jacket & GPUs
For computer vision, we’ve found that efficient implementations require a new data access pattern that MATLAB does not currently support. MATLAB and the M language is great for linear algebra where blocks of matrices are the typical access pattern, but not for Computer Vision where algorithms typically operate on patches of imagery. For instance, to pull out patches of imagery in M, one must do a double nested for loop, A = rand(100,100) for xs = -W:W for ys = -W:W patch(xs+W+1, ys+W+1) = A(xs+1+x, ys+1+y); end end …with guards for boundary conditions, etc. It gets even more complicated with non-square patches. On top of that, these implementations don’t translate to the GPUs memory hierarchy at all and are thus …
Jacket in a GPU Cloud
Wow! Jacket is now running MATLAB on a GPU cloud server from Penguin Computing! We were setting up demos today at SuperComputing 2009 and just got things setup inside Penguin Computing’s booth. Jacket is now running as compiled MATLAB code on Penguin Computing’s POD (Penguin on Demand) cloud service! The OpenGL visualizations are running without a hitch through VGL, so for everyone on the forums, this seems like another effective method of running Jacket remotely — at least on Linux. Does anyone know if VGL runs under windows? Penguin Computing was using it quite effecitvely – their setup was very slick!
Developer SDK Upgrade
In Jacket v1.1, an optional Developer SDK Upgrade is available. This upgrade provides the ability for you to integrate custom CUDA code for use with MATLAB. With a few simple jkt functions (which mimic standard MEX API functions), you can integrate custom CUDA kernels into Jacket. This task is as simple as replacing the main function in your program with jktFunction, which is used in the place of mexFunction for integration of CUDA code into MATLAB and Jacket. This serves an an entry point to Jacket’s runtime. Within a jktFunction, you have access to several jkt API functions to do tasks such as getting input from MATLAB, allocating device memory, calling the CUDA kernels, and casting the kernel’s output to …
Commentary on Jacket v1.1
I’m pleased to announce the release of Jacket v1.1! This release represents a major milestone in Jacket’s development and a great boost in functionality for Jacket customers. The major feature of this release is the inclusion of new GPU datatypes, most notably double-precision. We are very pleased with the performance we’ve seen for double-precision computations. At the time of this writing, the NVIDIA Tesla T10 series is the newest GPU on the market and NVIDIA’s first in what will become a great line of double-precision enabled GPUs. Even on this first double-precision generation card, we are seeing ~20x speedups for a lot of our examples and test cases. Of course, GPUs still give higher speedups when comparing single-precision GPU to …
Data-parallelism vs Task-parallelism
In order to understand how Jacket works, it is important to understand the difference between data parallelism and task parallelism. There are many ways to define this, but simply put and in our context: Task parallelism is the simultaneous execution on multiple cores of many different functions across the same or different datasets. Data parallelism (aka SIMD) is the simultaneous execution on multiple cores of the same function across the elements of a dataset. Jacket focuses on exploiting data parallelism or SIMD computations. The vectorized MATLAB language is especially conducive to good SIMD operations (more so than a non-vectorized language such as C/C++). And if you’re going to need a vectorized notation to achieve SIMD computation, why not choose the …