Just a few months ago, Jacket 1.4 was released sporting an improved MTIMES routine that brought about radical improvements to Jacket’s matrix multiplication. The quest for performance never ends though. Now, in the release of Jacket 1.5, MTIMES is even faster than before for SGEMM routines. Checkout the MTIMES Benchmarks wiki for more information. I you are attending GTC, you may want to attend this session also!
Jacket for MATLAB now available for NVIDIA Fermi!
We are pleased to announce Jacket 1.4, with support for the latest NVIDIA graphics processing units based on the Fermi architecture (Tesla 20-series and GeForce GTX 4xx-series). NVIDIA’s release of the Fermi architecture brings with it 448 computational cores, increased IEEE-754 floating-point arithmetic precision, error-correcting memory for reliable computation, and enhanced memory caching mechanisms. Highlights for Jacket 1.4 are as follows: Added support for the NVIDIA Fermi architecture (GTX400 and Tesla C2000 series) Jacket DLA support for Fermi Dramatically improved the performance of Jacket’s JIT (Just-In-Time) compilation technology Operations involving random scalar constants do not incur a recompile Removed dependencies on MINGW and NVCC Logical indexing now supported for SUBSREF and SUBSASGN, e.g. B = A(A > x) MTIMES supports …
SGEMM, MTIMES & CUBLAS performance on the GPU
AccelerEyes is focused on not only providing the most easy to use GPU programming platform for CUDA capable GPUs by leveraging the MATLAB® language, our engineering organization is always looking for ways to improve the performance of all areas in the Jacket platform. A case in point is some recent work with matrix multiplication, specifically (Single General Matrix Multiply) SGEMM, or MTIMES. The Jacket 1.3 release was based on CUBLAS for matrix multiplication and given the importance of matrix multiplication to so many of our customers, we decided to find out if we could improve performance of the function. Update: The new MTIMES routine in Jacket 1.4 has improved sigificantly since these benchmarks of the Release Candidate were taken. Have …
NVIDIA Fermi with CUDA and OpenCL
In December of 2008, we did a blog post answering questions from customers and prospects about the use of OpenCL for Jacket. If you have not reviewed that blog post to gain some insight into our progress you can access it here – http://blog.accelereyes.com/blog/2008/12/30/opencl/. Some things have changed since that original post. For example, NVIDIA now provides an OpenCL driver, toolkit, programming guide, and SDK examples. Given the new tools available and the new Fermi hardware, we ran some tests on the Tesla c2050 to compare OpenCL performance to CUDA performance. The Tesla C2050 is an amazing beast of a card, providing upto 512 Gigaflops of double precision arithmetic (at peak). Before we present the benchmarks, we should comment on …
Median Filtering: CUDA tips and tricks
Last week we posted a video recording from NVIDIA’s GTC09 conference. In the video, I walked through median filtering, presenting the vanilla implementation and then walking through progressive CUDA optimizations. A comment on that post suggested trying some other compiler flags, and it sparked a new series of experiments. In the original video, we started with a vanilla CPU implementation of 3×3 median filtering. We then ported this to the GPU to realize some immediate gains, but then we started a string of optimizations to see how far we could drive up performance: switching to textured memory, switching to shared memory, switching the internal sorting of pixels, etc. The conclusion: pay attention to the resource usage reported by nvcc (registers, …
A case study in CUDA optimization
Jimi Malcolm, VP of Engineering and Co-founder of AccelerEyes takes about 15 minutes to share CUDA optimization strategies to maximize performance of CUDA code. Watch the video below to find out what needs to go into strategizing CUDA development to maximize performance. Jimi uses Median Filtering for this case study. Get the Flash Player to see this player.
Using Parallel For Loops (parfor) with MATLAB® and Jacket
MATLAB® parallel for loops (parfor) allow the body of a for loop to be executed across multiple workers simultaneously, but with some pretty large restrictions. With Jacket MGL, Jacket can be used within parfor loops, with the same restrictions. However, it is important to note that Jacket MGL does not currently support co-distributed arrays. Problem Size Problem size might be the single most important consideration in parallelization using the Parallel Computing Toolbox (PCT) and Jacket MGL. When data is used by a worker in the MATLAB pool it must be copied from MATLAB to the worker, and must be copied back when the computation is complete. Additionally, when GPU data is used, it must then be copied by the worker …
Streaming data to the GPU
Learn how to stream data directly to the GPU using the Jacket SDK.
Developer SDK Upgrade
In Jacket v1.1, an optional Developer SDK Upgrade is available. This upgrade provides the ability for you to integrate custom CUDA code for use with MATLAB. With a few simple jkt functions (which mimic standard MEX API functions), you can integrate custom CUDA kernels into Jacket. This task is as simple as replacing the main function in your program with jktFunction, which is used in the place of mexFunction for integration of CUDA code into MATLAB and Jacket. This serves an an entry point to Jacket’s runtime. Within a jktFunction, you have access to several jkt API functions to do tasks such as getting input from MATLAB, allocating device memory, calling the CUDA kernels, and casting the kernel’s output to …