Benchmarking the new Kepler (GTX 680)

Pavan YalamanchiliBenchmarks, CUDA 13 Comments

NVIDIA has launched their next generation GPU based on their Kepler Architecture. They followed it up with a rather quick update to their CUDA toolkit. Considering that we have access to 3 generations of their GTX cards (480, 580 and 680), we thought we would show case how the performance has changed over the generations. Matrix multiplication: It can be seen that the GTX 680 breaches the 1 Terraflop mark comfortably for single precision, while the GTX 580 barely scratches it. However the performance seems to peak around 2048 x 2048 and then rallies downward to match the performance of the GTX 580 at larger sizes. The high end Tesla C2070 finishes last for single precision behind the third placed …

ArrayFire for Defense and Intelligence Applications

ArrayFireC/C++, Case Studies, CUDA, Events, Fortran Leave a Comment

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library and its applications to Defense and Intelligence functions. Defense projects often have hard deadlines and definite speed targets, and ArrayFire is a fast and easy-to-use choice for these applications. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of Jacket and ArrayFire, while interacting with AccelerEyes GPU computing experts.  John Melonakos, our CEO, introduced ArrayFire and talked about some exciting recent customer successes in the field of defense. He then ran through the mechanics of compiling and running code on a machine with 2 Quadro 6000 GPUs, and talked about customer success stories. …

No Free Lunch for GPU Compiler Directives

John MelonakosArrayFire, C/C++, CUDA, Fortran 3 Comments

Last week, Steve Scott at NVIDIA put up a viral post entitled, “No Free Lunch for Intel MIC (or GPU’s).”  It was a great read and a big hit in technical computing circles. The centrepiece of Scott’s piece was to say that there are no magic compilers.  GPUs don’t have them, and neither will MIC.  No compiler will be able to automatically recompile existing code and get great performance from MIC or GPUs.  Rather, it takes a good amount of elbow grease to write high-performance code. We totally agree.  The problem Scott addresses is real.  Despite marketing spin to the contrary, developing code for GPUs requires work. However, we don’t agree with Scott’s conclusion that compiler directives are a good solution. You can’t fight …

Jacket v2.1 Now Available

ScottAnnouncements, CUDA 2 Comments

Optimization Library, Sparse Functionality, Graphics Library Improvements, CUDA 4.1 Enhancements, and much more… AccelerEyes announces the release of Jacket v2.1, adding GPU computing capabilities for use with MATLAB®. Jacket v2.1 delivers even more speed through a host of new improvements, maximizing GPU device performance and utilization.. Notable new features include an Optimization Library and additional functions to our Graphics Library. With Jacket v2.1, we have also extended support for sparse matrix subscripting and made improvements to host-to-device and device-to-host data transfer speeds for complex data. In addition, we have included various GFOR enhancements. Jacket v2.1 now includes NVIDIA CUDA 4.1 enhancements to provide improved functionality and performance (requires latest drivers). Jacket is the premier GPU software plugin for MATLAB®, better than alternative …

12,288 CUDA Cores in One Computer

John MelonakosAnnouncements, CUDA 3 Comments

Kepler is here.  And it’s fantastic! The news came out today that the first Kepler GPU, the GeForce GTX 680, has been launched.  A single GPU has 1,536 CUDA Cores.  This means that those high-end workstations with 8 PCIe slots will be able to pack 12,288 CUDA cores into a single computer.  That’s some serious computational power. Current high-end Fermi cards have 512 cores, so this new Kepler architecture boasts 3X the number of computation cores. Normally we focus on the higher-end Tesla products because those more aptly fit the needs of our science, engineering, and financial computing readers.  But we are excited nonetheless by this GeForce GPU.  It is a major step forward in GPU technology.  And this GeForce card portends …

ArrayFire for Financial Computing Applications

ArrayFireCase Studies, Events Leave a Comment

In case you missed it, we recently held a webinar on how to accelerate financial computing applications using Jacket.  The performance advantages brought to financial computing algorithms through Jacket and GPUs represents the best way to accelerate MATLAB® code. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of Jacket and ArrayFire, while interacting with AccelerEyes GPU computing experts.  Scott Blakeslee, our Director of Business Development, introduced Jacket and talked about some exciting recent customer successes in the field of financial computing. Gallagher Pryor, CTO of AccelerEyes, then demoed some financial code speedups on one of our office machines. The major takeaway from the webinar video was that Jacket is …

ArrayFire Pro : Features and Scalability

ArrayFireArrayFire, C/C++, CUDA, Fortran Leave a Comment

ArrayFire is a fast GPU library that off-loads compute intensive tasks onto many-core GPUs, thereby reducing application runtime and accelerating it many times. ArrayFire is built on top of NVIDIA CUDA software stack which is currently the best and most stable GPU Software Development Kit available for GPU-based computing. ArrayFire comes with a huge set of functions that span across various domains like image processing, signal processing, financial modeling, applications requiring graphics support. ArrayFire has an array based notation (supports N-dimensional arrays) and allows sub-referencing and assignment into these multi-dimensional arrays. The following code snippet shows how you can index into array objects. // Generate a 3×3 array of random numbers on the GPU array A = randu(3,3); array a1 …

GPU Computing with Jacket in Automated Trader

John MelonakosBenchmarks, Case Studies Leave a Comment

The Q1 2012 issue of Automated Trader contains an excellent “Mashup!” piece reviewing software for algorithmic trading.  The article provides a wonderful glimpse into the 1-2 month adventure of Andy Webb, Automated Trader’s Founder, and Wrecking Crew building a fast trading platform from several technologies.  We heartily recommend that those of you in financial computing go subscribe to get the full story and access to ongoing developments from these Automated Trader thought leaders! The full trading platform they built was quite extensive.  The part that caught our eye was the core computational component of the pipeline.  That component involved permuting 1,000 potential pairs with cointegration tests for 350 time windows on each potential pair. The single core MATLAB® version took 70 minutes …

Jacket Continues to Crush the Clone

John MelonakosArrayFire 6 Comments

This morning, I woke up to find the following comment in the MATLAB® Newsgroup: Over two years ago, MathWorks® started to build a clone of Jacket, which you now know as the GPU computing support in the Parallel Computing Toolbox (TM).  At the time, there were many naysayers suggesting that Jacket would somehow be eclipsed by the clone.  Made sense, right? Wrong!  Here we are 2 years later and the clone is still a poor imitation. There are several technical reasons for this, but if you are serious about getting great performance from your GPU, Jacket is the better option.  Look at all the real customers that are getting big benefit. Here are some other recent benchmarks from the Walking …

CUDA and OpenCL Benchmarks – Keeneland Workshop Day 1

John MelonakosBenchmarks, CUDA, Events, OpenCL 3 Comments

Today was Day 1 of the Keeneland Workshop.  Many great talks were given, across a broad range of GPU computing topics. With last week’s ArrayFire Webinar fresh in mind, it was interesting to see similar conclusions drawn in a presentation by Kyle Spafford of Oak Ridge National Laboratory.  Kyle independently ran a number of benchmarks over a period of time which show how quickly OpenCL has matured and where it yet has room for improvement.  The slide below comes from Kyle’s presentation.  For numbers >1, CUDA is faster.  For numbers <1, OpenCL is faster.  Performance in most cases is close to equivalent. Just as we showed in the ArrayFire webinar, OpenCL performance is quite comparable with CUDA performance.  The Achilles heel …