Performance Improvements to JIT in ArrayFire v3.4

ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware.

This feature has been a favorite among our users in the domains of finance and scientific simulation.

That said, ArrayFire v3.3 and earlier had a few limitations. Namely:

Multiple outputs with inter-dependent variables were generating multiple kernels.
The number of operations per kernel was fairly limited by default.

In the latest release of ArrayFire, we addressed these issues to get some pretty impressive numbers.

In the rest of the post, we demonstrate the performance improvements using our BlackScholes example. For this particular example, the improvements for the CUDA backend were ~8x and OpenCL backend were between ~5x.

Performance Comparison between v3.4 and v3.3

BlackScholes is a commonly used mathematical model in financial markets.

It takes the following 5 inputs and generates 2 outputs.

The code to implement this model in ArrayFire can be seen below.

array cnd(array x)
{
    // constants
    const float a1 =  0.254829592;
    const float a2 = -0.284496736;
    const float a3 =  1.421413741;
    const float a4 = -1.453152027;
    const float a5 =  1.061405429;
    const float p  =  0.3275911;
    const float sqrt2 = sqrt(2.0);

    // Save the sign of x
    array xSign = sign(x);

    x = abs(x) / sqrt2;

    // A & S formula 7.1.26
    array t = 1.0f / (1.0f + p*x);
    array y = 1.0f + 0.5f * (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*exp(-x*x);

    return xSign * y + !xSign * (1 - y); // equivalent of (x >= 0) ? y : (1 - y);
}

static void black_scholes(array& C, array& P,
                          const array& S, const array& X,
                          const array& R, const array& V,
                          const array& T)
{
    // Inputs:
    // S: Stock price
    // X: Strike price
    // R: Risk free rate of interest
    // V: Volatility
    // T: Time to maturity

    // Outputs:
    // C: Call option
    // P: Put option

    array d1 = log(S / X);
    d1 = d1 + (R + (V*V)*0.5) * T;
    d1 = d1 / (V*sqrt(T));

    array d2 = d1 - (V*sqrt(T));

    array cnd_d1 = cnd(d1);
    array cnd_d2 = cnd(d2);

    C = S * cnd_d1  - (X * exp((-R)*T) * cnd_d2);
    P = X * exp((-R)*T) * (1 - cnd_d2) - (S * (1 - cnd_d1));
}

The black_scholes function contains a long chain of operations and generates two outputs. In older versions of ArrayFire, the Just-In-Time compiler of arrayfire was splitting up this single function call to multiple kernels.

We solved this in the latest release by:

Enabling af::eval to hint ArrayFire’s JIT compiler to combine kernels generating multiple outputs.
Changing the heuristics to increase the number of operations per kernel.

black_scholes(C, P, S, X, R, V, T);
eval(C, P); // Only change required for v3.4

These changes resulted in dramatic performance improvements seen below. The following graph compares the number of call and put options being calculated per second on v3.3 and v3.4.

<br />

As we can see, the performance has improved significantly. The absolute time taken when to calculate 2 million options is show below.

Note:

The additional gains made by CUDA are because of a separate issue involving CUDA JIT that was fixed.
Only 24 threads of the 32 available were used for the Xeon CPU.

Performance Comparison to Native CUDA

To see how close we can get to the peak performance, we compared the performance of ArrayFire and CUDA on the new GTX 1080.

To do this we built ArrayFire 3.4 using the CUDA 8 RC and compared its performance with that of the example provided in CUDA Samples.

The results of this benchmark can be seen below.

The results indicate ArrayFire v3.4 performs identical to the native kernel at larger simulation sizes, but there is a constant overhead causing issues at smaller sizes. This is one of the issues we want to address in future releases.

Note: The black_scholes function had to be changed slightly to be similar to the sample provided by CUDA.

Coming Soon

While the new improvements to the JIT engine in ArrayFire can seamlessly generate kernels that perform close to handwritten native kernels on GPUs, there is room for improvement at smaller sizes. To see what lies ahead for ArrayFire JIT, please check out this github issue.

Download

ArrayFire v3.4 can be downloaded from these locations:

Community

ArrayFire is continually improving through the addition of new JIT enhancements. We welcome your feedback:

General discussion forums on the ArrayFire Google Group
Live discussion chat on the ArrayFire Gitter
Issue reports on the ArrayFire GitHub

Finally, as you find success with ArrayFire, we invite you to contribute a post to this blog to share with the broader community. Email scott@arrayfire.com to contribute to this blog.

Performance Comparison between v3.4 and v3.3

Performance Comparison to Native CUDA

Coming Soon

Download

Community

Leave a Reply Cancel reply