This is the second in a series of posts looking at our current ArrayFire examples. The code can be compiled and run from arrayfire/examples/
when you download and install the ArrayFire library. Today we will discuss the examples found in the benchmarks/
directory.
In these examples, my machine has the following configuration:
ArrayFire v1.9 (build XXXXXXX) by AccelerEyes (64-bit Linux) License: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX CUDA toolkit 5.0, driver 304.54 GPU0 Quadro 6000, 6144 MB, Compute 2.0 (single,double) Display Device: GPU0 Quadro 6000 Memory Usage: 5549 MB free (6144 MB total)...
This example shows a simple bench-marking process using ArrayFire’s matrix multiply routine. For more information on Blas, click here. The data measured in this example is the Giga-Flop (GFLOP Floating Point Operations Per Second). I got the following results using the example code on my machine:
GPU0 Quadro 6000, 6144 MB, Compute 2.0 (single,double) Display Device: GPU0 Quadro 6000 Memory Usage: 5543 MB free (6144 MB total) 128 x 128: 124 Gflops 256 x 256: 283 Gflops 384 x 384: 478 Gflops 512 x 512: 483 Gflops 640 x 640: 583 Gflops 768 x 768: 628 Gflops 896 x 896: 646 Gflops 1024 x 1024: 647 Gflops 1152 x 1152: 648 Gflops 1280 x 1280: 658 Gflops 1408 x 1408: 659 Gflops 1536 x 1536: 668 Gflops 1664 x 1664: 673 Gflops 1792 x 1792: 674 Gflops 1920 x 1920: 677 Gflops 2048 x 2048: 676 Gflops
Just for fun, I wrote a small program to perform a simple matrix multiply using the CPU in a triple-for loop (a pretty standard, but non-optimized solution). This basic algorithm has a run-time of O(n^3). Here is what I got measuring the same benchmark using my simple algorithm:
128 x 128: 1 Mflops 256 x 256: 1 Mflops ...
As you can see, I got bored fast waiting for my CPU to do what my GPU did in a matter of seconds! I would have ended up spending a long time waiting for this program to finish running on the CPU. Also notice that I measured the CPU computation in terms of Mega-Flops instead of Giga-Flops. Of course, a real linear algebra library would do much better than my simple routine, but using ArrayFire takes the next step beyond smart algorithms, by taking full advantage of the GPU and parallelism.
Pi is a fun little example we use to make comparisons between regular programming, and GPU programming. We use a Monte-Carlo simulation to estimate Pi – the idea is to take a bunch of random points in the Cartesian plane unit-square, and then count the ratio of points that lie within the unit circle to the points in the square. The process of generating millions of random points, then using the Pythagorean theorem to calculate the hypotenuse is fairly simple math, and can be done in a for loop that counts the number of times the resulting hypotenuse has a length of less than 1 (and therefore lies in the unit circle) and compares it to the total number of points generated:
int count = 0; for (int i = 0; i < samples; ++i) { float x = float(rand()) / RAND_MAX; float y = float(rand()) / RAND_MAX; if (sqrt(x*x + y*y) < 1) count++; } return (4.0 * count / samples);
The ArrayFire version is of course, much simpler:
array x = randu(samples,f32); array y = randu(samples,f32); return (4.0 * sum<float>(sqrt(x*x + y*y) < 1) / samples);
The speed difference is huge, here are some of the results I get on my machine:
gpu: 0.00635 seconds to estimate pi = 3.1417060 cpu: 1.89067 seconds to estimate pi = 3.1418986 speedup: 297.7433 gpu: 0.00639 seconds to estimate pi = 3.1421003 cpu: 1.84324 seconds to estimate pi = 3.1417162 speedup: 288.457 gpu: 0.00635 seconds to estimate pi = 3.1417060 cpu: 1.83939 seconds to estimate pi = 3.1420715 speedup: 289.6677
The reason this code example exists is to show you the synthesis between CUDA programming and the ArrayFire library. Of course, we just saw how ArrayFire could calculate Pi just fine all by itself, but in some cases, you will have more complex CUDA routines and need more control. You can still facilitate pieces of your program with the ArrayFire constructs.
For example, although we are using CUDA vectors instead of ArrayFire arrays, we can still take advantage of the ArrayFire library to populate these vectors with random numbers:
float *d_x = NULL, *d_y = NULL; unsigned bytes = sizeof(float) * n; CUDA(cudaMalloc(&d_x, bytes)); CUDA(cudaMalloc(&d_y, bytes)); AF(af_randu_S(d_x, n)); AF(af_randu_S(d_y, n));
… and take the sum of the elements of a vector:
float h_result; AF(af_sum_vector_S(&h_result, n, d_inside));
One last insight that we can gain from this example is just how much simpler ArrayFire makes GPU programming. Compare this example, to the regular Pi example above, and you can see that ArrayFire abstracts away a lot of initialization/boilerplate coding. There are a lot of lines of code required to set-up devices, allocate memory, manage memory, and handle errors that you don’t have to worry about when you are using the ArrayFire library.
So go for it, download the ArrayFire library and see how fast your computer can go!
—
Posts in this series:
- ArrayFire Examples (Part 1 of 8) – Getting Started
- ArrayFire Examples (Part 2 of 8) – Benchmarks
- ArrayFire Examples (Part 3 of 8) – Financial
- ArrayFire Examples (Part 4 of 8) – Image Processing
- ArrayFire Examples (Part 5 of 8) – Machine Learning
- ArrayFire Examples (Part 6 of 8) – Multiple GPU
- ArrayFire Examples (Part 7 of 8) – PDE