ArrayFire Examples (Part 6 of 8) - Multiple GPUs

Unlock the full potential of your hardware with ArrayFire's multi-GPU managementâ€”discover how to harness multiple GPUs for lightning-fast computations in our latest post!

John Melonakos

Jun 12, 2013

6 min read

This is the sixth in a series of posts looking at our current ArrayFire examples. The code can be compiled and run from arrayfire/examples/ when you download and install the ArrayFire library. Today we will discuss the examples found in the multi_gpu/ directory.

In these examples, my machine has the following configuration:

ArrayFire v1.9.1 (build XXXXXXX) by AccelerEyes (64-bit Linux)
License: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CUDA toolkit 5.0, driver 319.17
GPU0 Tesla K20c, 5120 MB, Compute 3.5  (current)
GPU1 Tesla C2075, 6144 MB, Compute 2.0
GPU2 Tesla C1060, 4096 MB, Compute 1.3
Memory Usage: 4935 MB free (5120 MB total)

*The following order represents the speed of GPUs in my machine from fastest to slowest: K20c, C2070, C1060.

ArrayFire is capable of multi-GPU management. This capability becomes useful for benchmarking a multiple GPUs in dynamic executions.

Let's take a look at the following examples.

1. Fast Fourier Transform - fft.cpp

This is an example of calculating the elapsed time for analyzing signal of each column in a matrix with random complex-valued floating point for each device in your machine. Basically, it is attempting to calculate the Fast Fourier Transform performance with collected data with your GPUs. Kindly, ArrayFire already contains the Fast Fourier Transform method - so we don't really need to know how FFT works.

 19     try {
 20         info();
 21         int total_device = devicecount();

As you have found a several times in previous examples, "info()" was also a function in multi-GPU management and it prints the current diagnosis on your driver, runtime, memory, and devices. You can find the output result from "info()" for my device on the top of this page. Since my machine is equipped with three GPUs, "devicecount()" will return 3 for "total_device".

  7 static void bench()
  8 {
  9     for (int i = 0; i < ndevice; ++i) {
 10         deviceset(i);              // switch to device
 11         array x = randu(n,n,c32);  // generate random complex input
 12         array y = fft(x);          // analyze signal of each column
 13     }
 14 }

After defining the number of devices in your machine, it iterates by that number in order to generate a matrix random complex-valued floating point for each device set and to analyze FFT values for the matrices. "deviceset(...)" switches among multiple GPUs to use by its index. Once the device is set, you can find the index of a device currently in use by "deviceget()". Our example is benchmarking FFT for three cases with first device, first two devices, and then all three devices.

2048x2048 random number gen and complex FFT on 1 devices...  101.1 gflops
2048x2048 random number gen and complex FFT on 2 devices...  404.7 gflops
2048x2048 random number gen and complex FFT on 3 devices...  303.4 gflops

If you're paying attention, you'll realize that the results above are counter-intuitive. Why does running on more GPUs result in less performance? The answer is due to using a mix of different GPUs in the configuration and requiring each of those GPUs to complete the same amount of work before completion.

The execution configuration of each iteration is as follows: a C1060 for the first benchmark; a C2070 and a K20 for the second benchmark; and a C1060, a C2070, and a K20 for the third benchmark. Because C1060 is the slowest GPU in my machine, the output was 101.1 glfops which was the slowest as expected. For the second output, because K20 is the fastest GPU in my machine, C2070 and K20 hit the best result with 404.7 gflops. The third benchmark took K20 as well; however, since C1060 became a bottleneck for benchmark in this case. it hit the second best result with 303.4 gflops.

One last thing to know in this example, is TIMEIT function.

24            double time_s = timeit(bench);

ArrayFire provides TIMER functions to determine the elapsed time in many cases, but this timer isn't sometimes accurate because GPU takes some time for warm-up in the first run with some initialization and loading libraries. TIMEIT solves this problem by generating a few dummy functions to prepare the GPU before performing the actual input function and returns the robust timing result. So this is really useful method for those who frequently benchmark GPU performances.

Thus, ArrayFire assists users with more flexibility in device use and also provides an accurate FFT benchmark result.

2. General Matrix-Vector Product - gemv.cpp This is an example of General Matrix-Vector Product, known as one of the fundamental operations in the BLAS (Basic Linear Algebra Subprograms). Following is simply what the General Matrix-Vector Product does:

1. Multi-GPU Matrix-Vector Multiply: y = A*x. 2. The system matrix A is distributed across the available devices. 3. Each iteration pushes x to the devices, multiplies against the matrix A, 4. and pulls the result y back to the host. (In these steps, device refers to the GPU and host refers to CPU)

 69         float *A = new float[n*n], *X = new float[n], *Y = new float[n];
 70         ones(A, n*n);
 71         ones(X, n*1);
 72
 73         // Distribute A partitions across devices
 74         array *AMatrix = new array[ngpu];
 75         for (int idx = 0; idx < ngpu; idx++) {
 76             deviceset(idx);
 77             AMatrix[idx] = array(n, n/ngpu, A + (n*n/ngpu * idx), afHost);
 78         }
 79         delete[] A; // done with A, keep X to push each iteration

Initially, Array A contains ones with size of n x n and it is allocated in host memory (CPU). As it declares and initializes float arrays in line 69-71, the arrays are stored in the host memory with ones. Then, AMatrix - created on the device memory (GPU) - stores the data distributed from the host memory.

In order to do this, it first sets the desired device for current use by "deviceset(...)" and performs ARRAY(dim0, dim1, ptr, af_source) function that creates a matrix array onto the device with the data copied from the corresponding host memory.

 14         // Multiply each piece of A*x
 15         for (int idx = 0; idx < ngpu; idx++) {
 16             deviceset(idx);
 17             YVector[idx] = matmul(AMatrix[idx], array(n/ngpu, X + idx*(n/ngpu), afHost));
 18         }

Once "AMatrix" is set, it temporarily pushes "X" data into each device memory, again, using ARRAY(dim0, dim1, ptr, af_source) function - it is to perform "X" data with the data on device. Then, each pushed "X" data is multiplied to the "AMatrix" by using MATMUL and stored in "YVector" array.

 20         // Copy partial results back to host (implicitly blocks until done)
 21         for (int idx = 0; idx < ngpu; idx++) {
 22             deviceset(idx);
 23             YVector_host[idx] = YVector[idx].host<float>();
 24         }

Lastly, it pulls the result from "YVector" back to the host memory by assigning it to "YVector_host", a float array on the host memory. Finally, it repeats these steps 1000 iterations in this particular example to calculate the performance time with accuracy.

 81         // do matrix-vector multiply using all GPUs
 82         af::sync();
 83         timer::start();
 84         multi_Sgemv(iterations, ngpu, AMatrix, n, X, n, Y);
 85         af::sync();

Of course, ArrayFire performs all these complicated steps so quickly - within less than 0.01 seconds in average as result below.

size(A)=[21000,21000]  (1683 mb)
Benchmarking........
Average time for 1000 iterations : 0.00997175 seconds

ArrayFire provides so much flexibility in device control. If you want to be capable of managing multiple devices, download the ArrayFire library and have the flexibility in your GPU programming today! ---

1. Fast Fourier Transform - fft.cpp

Posts in this series: