ArrayFire Examples (Part 2 of 8) – Benchmarks

This is the second in a series of posts looking at our current ArrayFire examples. The code can be compiled and run from arrayfire/examples/ when you download and install the ArrayFire library. Today we will discuss the examples found in the benchmarks/ directory.

In these examples, my machine has the following configuration:

ArrayFire v1.9 (build XXXXXXX) by AccelerEyes (64-bit Linux)
License: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CUDA toolkit 5.0, driver 304.54
GPU0 Quadro 6000, 6144 MB, Compute 2.0 (single,double)
Display Device: GPU0 Quadro 6000
Memory Usage: 5549 MB free (6144 MB total)...

Blas

This example shows a simple bench-marking process using ArrayFire’s matrix multiply routine. For more information on Blas, click here. The data measured in this example is the Giga-Flop (GFLOP Floating Point Operations Per Second). I got the following results using the example code on my machine:

GPU0 Quadro 6000, 6144 MB, Compute 2.0 (single,double)
Display Device: GPU0 Quadro 6000
Memory Usage: 5543 MB free (6144 MB total)
128 x 128: 124 Gflops
256 x 256: 283 Gflops
384 x 384: 478 Gflops
512 x 512: 483 Gflops
640 x 640: 583 Gflops
768 x 768: 628 Gflops
896 x 896: 646 Gflops
1024 x 1024: 647 Gflops
1152 x 1152: 648 Gflops
1280 x 1280: 658 Gflops
1408 x 1408: 659 Gflops
1536 x 1536: 668 Gflops
1664 x 1664: 673 Gflops
1792 x 1792: 674 Gflops
1920 x 1920: 677 Gflops
2048 x 2048: 676 Gflops

Just for fun, I wrote a small program to perform a simple matrix multiply using the CPU in a triple-for loop (a pretty standard, but non-optimized solution). This basic algorithm has a run-time of O(n^3). Here is what I got measuring the same benchmark using my simple algorithm:

128 x 128: 1 Mflops
256 x 256: 1 Mflops
...

As you can see, I got bored fast waiting for my CPU to do what my GPU did in a matter of seconds! I would have ended up spending a long time waiting for this program to finish running on the CPU. Also notice that I measured the CPU computation in terms of Mega-Flops instead of Giga-Flops. Of course, a real linear algebra library would do much better than my simple routine, but using ArrayFire takes the next step beyond smart algorithms, by taking full advantage of the GPU and parallelism.

Pi is a fun little example we use to make comparisons between regular programming, and GPU programming. We use a Monte-Carlo simulation to estimate Pi – the idea is to take a bunch of random points in the Cartesian plane unit-square, and then count the ratio of points that lie within the unit circle to the points in the square. The process of generating millions of random points, then using the Pythagorean theorem to calculate the hypotenuse is fairly simple math, and can be done in a for loop that counts the number of times the resulting hypotenuse has a length of less than 1 (and therefore lies in the unit circle) and compares it to the total number of points generated:

    int count = 0;
    for (int i = 0; i < samples; ++i) {
        float x = float(rand()) / RAND_MAX;
        float y = float(rand()) / RAND_MAX;
        if (sqrt(x*x + y*y) < 1)
            count++;
    }
    return (4.0 * count / samples);

The ArrayFire version is of course, much simpler:

    array x = randu(samples,f32);
    array y = randu(samples,f32);
    return (4.0 * sum<float>(sqrt(x*x + y*y) < 1) / samples);

The speed difference is huge, here are some of the results I get on my machine:

gpu: 0.00635 seconds to estimate pi = 3.1417060
cpu: 1.89067 seconds to estimate pi = 3.1418986
speedup: 297.7433

gpu: 0.00639 seconds to estimate pi = 3.1421003
cpu: 1.84324 seconds to estimate pi = 3.1417162
speedup: 288.457

gpu: 0.00635 seconds to estimate pi = 3.1417060
cpu: 1.83939 seconds to estimate pi = 3.1420715
speedup: 289.6677

Pi Cuda

The reason this code example exists is to show you the synthesis between CUDA programming and the ArrayFire library. Of course, we just saw how ArrayFire could calculate Pi just fine all by itself, but in some cases, you will have more complex CUDA routines and need more control. You can still facilitate pieces of your program with the ArrayFire constructs.

For example, although we are using CUDA vectors instead of ArrayFire arrays, we can still take advantage of the ArrayFire library to populate these vectors with random numbers:

float *d_x = NULL, *d_y = NULL;
unsigned bytes = sizeof(float) * n;
CUDA(cudaMalloc(&d_x, bytes));
CUDA(cudaMalloc(&d_y, bytes));
AF(af_randu_S(d_x, n));
AF(af_randu_S(d_y, n));

… and take the sum of the elements of a vector:

    float h_result;
    AF(af_sum_vector_S(&h_result, n, d_inside));

One last insight that we can gain from this example is just how much simpler ArrayFire makes GPU programming. Compare this example, to the regular Pi example above, and you can see that ArrayFire abstracts away a lot of initialization/boilerplate coding. There are a lot of lines of code required to set-up devices, allocate memory, manage memory, and handle errors that you don’t have to worry about when you are using the ArrayFire library.

So go for it, download the ArrayFire library and see how fast your computer can go!

—

Posts in this series:

Related articles

Leave a Reply Cancel reply