Intel OpenCL performance: 3rd generation hardware

Brian KloppenborgArrayFire, OpenCL 1 Comment

Introduction

With Intel CPUs making up nearly 80% of the CPU market and 66% of computers using integrated graphics one can easily argue that integrated graphics devices represent one of the greatest markets for GPU-accelerated computing. Here at ArrayFire, we have long recognized the potential of these devices and offer built-in support for Intel CPUs, GPUs, and AMD APUs in the OpenCL backend of our ArrayFire GPU computing library. Yet one common theme for debate in the office has been how the hardware performs on different operating systems with different drivers across hardware revisions.

To answer these questions (and, perhaps, to win some intra-office geek cred) I decided to write a series of blog posts about Intel’s GPU OpenCL performance. In this first installment I will compare the performance of OpenCL on a 3rd generation Intel processor on Linux and Windows.

Monte Carlo Pi estimation

To evaluate the performance of Intel’s hardware I took a textbook problem of computing Pi using a Monte Carlo method. This is a classic example of a memory bound application where the time to read/write data from/to memory dominates most of the computation time. Thus, throwing N-more compute cores at the problem may not always achieve a linear acceleration unless careful attention to memory access patterns and cache usage is performed.

The Monte Carlo process for computing pi is fairly simple, consisting of three primary steps:

  1. Generate a series of random (x,y) pairs from within the unit square
  2. Determine which of these samples reside within the unit circle
  3. Compute the ratio of the samples that reside within the square to those which reside within the circle and then multiply by four (to account for this being only one quadrant of a circle).

There is a great visualization of this process and a demonstration of how the number of samples impacts the Pi estimate on wikipedia.

If you were to implement this in C/C++, your code might look like this:

...

int count = 0;
float x, y;
vector<float> randomNums(2*samples);

// init random numbers within the unit square.
// Pack as (x,y) pairs
for(int i = 0; i < 2*samples; i++)	{
	randomNums[i] = float(rand()) / RAND_MAX;
}

// count how many samples reside in the circle
for(int i = 0; i < samples; i++)
{
	x = randomNums[2*i];
	y = randomNums[2*i + 1];

	if(x*x + y*y < 1.0)
		count++;
}

// calculate the estimate and print the result
float pi_estimate = 4 * float(count) / samples;

...

Benchmark methods

We implemented four different versions of the Pi problem with small changes in the implementation which test various aspects of the hardware and software. These five implementations are:

  1. Single core CPU – A direct implementation of the code listing above
  2. A naive port of the CPU code to the GPU with no optimizations. Region testing is performed on the GPU and the result is summed on the CPU
  3. An effort to reduce the workload of the CPU by partially summing the region testing results on the GPU
  4. A classic GPU optimization using coalesced memory load/stores
  5. And a final integrated-device specific optimization that uses zero-copy memory buffers

For these tests we used a few-year old ASUS N56V laptop we had in the office which features an Intel i7-3610QM CPU with a HD Graphics 4000 GT2 GPU. This particular GPU has 16 execution units which can host up to 128 concurrent threads. We installed Window 7 (Intel OpenCL GPU driver 10.18.10.395; Intel OpenCL CPU drivers 5.1.0.25) and Ubuntu 14.10 (GPU driver Beignet 1.1 from git acfc2a2 and CPU driver “Build 92”). In future blog posts we will cover 4th and 5th generation hardware (including Iris pro).

Results and discussion

On Linux a single core CPU took 31.583 ms to execute on a sample size of 20 million sample points. Interestingly, the same code on Windows too over 66 ms to execute on the same sample size (in release mode, with /02 and high power settings selected). This curious result isn’t particularly important as the OpenCL performance on the CPU does not appear to be degraded between operating systems.

Below we plot the speed up of our OpenCL kernels relative to the performance of the Linux single core CPU result above for both the i7-3610QM CPU and HD Graphics 4000 GPU.

We see that the naive port of the CPU code directly to OpenCL is slower than the pure CPU implementation by a factor of four. This result is not terribly surprising once you consider the poorly conceived implementation encounters a full O(N) cost for data transfer and summing region test result for the 20 million sample points.

The OCL reduction kernel offloads some of the CPU’s work by partially summing the region detection results on the GPU. This tweak increased the performance of the kernel between 4x (for the OpenCL on the GPU) to 10x (for OpenCL on the CPU). Curiously, the CPU kernels on Ubuntu and Windows perform very well, achieving speed-ups of about 3x the CPU. This is still below the theoretical 4x speed improvement one would expect on this quad-core CPU namely due to thread management and poor memory access patterns.

Lastly we will discuss the coalesced memory and zero copy kernels. The optimization here involves changing the memory access pattern from using strieded memory access:

x = randomNums[2*i];
y = randomNums[2*i + 1];

to using OpenCL’s float2 data type to access sequential elements, thus improving utilization of the memory bus. This technique improved throughput by 2.2x on Ubuntu and 3.25x on Windows. This makes the GPU about twice as fast as our initial CPU implementation. We see little to no acceleration in using zero copy buffers for this problem because the problem size for the final CPU sum is quite small (on the order of 1000 elements).

Conclusion and a preview of the next post

For the Monte Carlo Pi problem, it appears Ubuntu and Beignet achieve slightly better performance than the current Windows Intel HD Graphics OpenCL drivers; however, given the limited number of trials performed (and lack of statistical error bars) we may conservatively state that the two OpenCL implementations perform similarly on the same hardware.

We found that using OpenCL on the GPU achieves better acceleration than OpenCL on the CPU in general. In one of our kernels, OpenCL on the i7-3610QM CPU was slightly faster than on the GPU, but this is probably due to automatic vectorization in the CPU OpenCL backend for this particular kernel.

In our next post we will compare the performance of OpenCL on 3rd, 4th, and 5th generation Intel processors using both memory and compute-bound kernels.

Comments 1

  1. In comparision to pure CUDA ArrayFire is quite overcomplicated to install. I spent two days on trying to install AF on Ubuntu Desktop 16.04 and it failed. Theoretically it is user-friendly. Practically I would be truthfully disappointed like quite rich, bought-by-myself edition of Matlab with comparision to GNU Octave empowered by GNU Parallel prototyping model of programming. To sum up: I do prototype on Octave and rarely, efficiently implement sth. on CUDA C++11.
    Post Scriptum: I am only an free amateur, ask for paid advise an expert.
    Post Post Scriptum: cheap optimal GPU: GTX770; expensive optimal GPU: GTX1080ti; cheap optimal server: Dell 2950 II; expensive optimal server: Dell R720xd; Dell R820. Optimal by the mean of GFLOPs/USD; GFLOPs/W; Quality/price.

Leave a Reply to PiotrLenarczyk Cancel reply

Your email address will not be published. Required fields are marked *