Hybrid GPU & Multicore Processing for LU Decomposition

One of the hot areas in supercomputing is hybrid compute: balancing the computational load between one or more CPUs and GPUs. Along these lines Nolan Davis and Daniel Redig at SAIC recently presented work on Hybrid GPU/Multicore Solutions for Large Linear Algebra Problems where they developed a novel algorithm for LU decomposition, one of the most important routines in linear algebra.

Here’s a snapshot view of their setup:

System Specs:
GPU	Nvidia® Tesla™ 2050	448 processing cores3 GB dedicated memory
Multicore Host	24 cores64 GB system memory Red Hat® Enterprise Linux 5	Two AMD Opteron™ 6172 12-core processors
Host-to-GPU Communications	PCIE 2.0	16 channels at 500 MB/sec/laneTheoretical peak bandwidth of 8 GB/sec

Their initial results are very promising. For matrix sizes N up to around 11,000 (single precision) the GPU running Jacket computed at about 3.5x faster than the 24-core host running MATLAB. The GPU with Jacket reached a performance peak of 287 Gflops/sec with N=8,100.

For double precision, the GPU-native LU decomposition performance peaks around 110 Gflop/sec for size N=6400. This is still about 3.5x faster than the same multi-threaded computation performed on the 24-core host. The system delivers an LU decomposition of a N=6400 double-precision matrix in 1.6 seconds, and N=8100 in 3.7 seconds.

When problem sizes exceed GPU memory, the researchers show that a Hybrid GPU + CPU computing approach effectively provides better performance than dropping the GPU altogether. For LU decomposition, they break the larger problem into smaller manageable sub-problems, which fit into GPU memory and can be solved one-at-a-time.

Most of SAIC research efforts are pretty quiet, so it was cool to see these guys publicly describe some of their work. We’ll keep an eye out for more. Special thanks to Nolan Davis and Daniel Redig for sharing their numbers.

Leave a Reply Cancel reply