Performance Improvements to JIT in ArrayFire v3.4

Pavan Announcements, ArrayFire, Benchmarks Leave a Comment

ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware.

This feature has been a favorite among our users in the domains of finance and scientific simulation.

That said, ArrayFire v3.3 and earlier had a few limitations. Namely:

  • Multiple outputs with inter-dependent variables were generating multiple kernels.
  • The number of operations per kernel was fairly limited by default.

In the latest release of ArrayFire, we addressed these issues to get some pretty impressive numbers.

In the rest of the post, we demonstrate the performance improvements using our BlackScholes example. For this particular example, the improvements for the CUDA backend were ~8x and OpenCL backend were between ~5x.

Performance Comparison between v3.4 and v3.3

BlackScholes is a commonly used mathematical model in financial markets.

It takes the following 5 inputs and generates 2 outputs.

The code to implement this model in ArrayFire can be seen below.

The black_scholes function contains a long chain of operations and generates two outputs. In older versions of ArrayFire, the Just-In-Time compiler of arrayfire was splitting up this single function call to multiple kernels.

We solved this in the latest release by:

  • Enabling af::eval to hint ArrayFire's JIT compiler to combine kernels generating multiple outputs.
  • Changing the heuristics to increase the number of operations per kernel.

These changes resulted in dramatic performance improvements seen below. The following graph compares the number of call and put options being calculated per second on v3.3 and v3.4.

 

 

As we can see, the performance has improved significantly. The absolute time taken when to calculate 2 million options is show below.

 

Note:

  • The additional gains made by CUDA are because of a separate issue involving CUDA JIT that was fixed.
  • Only 24 threads of the 32 available were used for the Xeon CPU.

Performance Comparison to Native CUDA

To see how close we can get to the peak performance, we compared the performance of ArrayFire and CUDA on the new GTX 1080.

To do this we built ArrayFire 3.4 using the CUDA 8 RC and compared its performance with that of the example provided in CUDA Samples.

The results of this benchmark can be seen below.

The results indicate ArrayFire v3.4 performs identical to the native kernel at larger simulation sizes, but there is a constant overhead causing issues at smaller sizes. This is one of the issues we want to address in future releases.

Note: The black_scholes function had to be changed slightly to be similar to the sample provided by CUDA.

Coming Soon

While the new improvements to the JIT engine in ArrayFire can seamlessly generate kernels that perform close to handwritten native kernels on GPUs, there is room for improvement at smaller sizes. To see what lies ahead for ArrayFire JIT, please check out this github issue.

Download

ArrayFire v3.4 can be downloaded from these locations:

Community

ArrayFire is continually improving through the addition of new JIT enhancements. We welcome your feedback:

Finally, as you find success with ArrayFire, we invite you to contribute a post to this blog to share with the broader community. Email scott@arrayfire.com to contribute to this blog.

 

Facebooktwittergoogle_plusredditlinkedinmail