How to write vectorized code

Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is becoming a necessity to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be intuitive immediately. There are many ways you can vectorize a given code segment. Each method has its own benefits and drawbacks. Hence, writing vectorized code involves analyzing the pros and cons of the available methods and choosing the right one to solve your problem.

In this post, I present various ways to vectorize your code using ArrayFire. ArrayFire is chosen because of my familiarity with the software. The same methods can be easily used in numpy, octave, julia or other scientific computing software.

The problem

The problem we will be looking at is the calculation of distances from every point in a set A to every point in set B. For example, if A has 200 points and B has 300 points, the distance matrix will be of size 200 x 300. Each point has coordinates in 3 dimensional space. We can generate the necessary data in the following manner.

A = randu(3, 200);
B = randu(3, 300);

Naive implementation

The naive method would be to have three loops. The outer most loop iterating through points in A. The second loop iterating through points in B. The inner most loop iterating through the three dimensions. You can find difference in coordinates between the two points and find the distance between the two points.

This can be implemented in the following manner. The code is shown for reference. This is extremely slow on GPUs

// CODE SHOWN FOR REFERENCE. DO NOT USE. IT IS EXTREMELY SLOW.
static array dist_naive(array a, array b)
{
    array dist_mat = constant(0, a.dims(1), b.dims(1));

    // Iterate through columns a
    for (int ii = 0; ii < a.dims(1); ii++) {

        // Iterate through columns of b
        for (int jj = 0; jj < b.dims(1); jj++) {

            // Get the sum of absolute differences
            for (int kk = 0; kk < a.dims(0); kk++) {
                dist_mat(ii, jj) += abs(a(kk, ii) - b(kk, jj));
            }
        }
    }

    return dist_mat;
}

Vectorizing along one dimension

Keeping the outer two loops the same, you can improve the performance by treating the set of three coordinates as a single vector. You can perform the difference between these vectors in parallel. A reduction operation after getting the absolute differences will give us the distance without the need of a third loop.

static array dist_vec(array a, array b)
{
    array dist_mat = constant(0, a.dims(1), b.dims(1));

    // Iterate through columns a
    for (int ii = 0; ii < a.dims(1); ii++) {
        array avec = a(span, ii);

        // Iterate through columns of b
        for (int jj = 0; jj < b.dims(1); jj++) {
            array bvec = b(span, jj);

            // get SAD using sum on the vector
            dist_mat(ii, jj) = sum(abs(avec - bvec));
        }
    }

    return dist_mat;
}

Using GFOR

GFOR is a parallel version of the for loop in ArrayFire. Since GFOR's cannot be nested, we can replace one of the for loops using GFOR. This can be done in the following manners.

static array dist_gfor1(array a, array b)
{
    array dist_mat = constant(0, a.dims(1), b.dims(1));

    // GFOR along columns of a
    gfor (array ii, a.dims(1)) {
        array avec = a(span, ii);

        // Itere through columns of b
        for (int jj = 0; jj < b.dims(1); jj++) {
            array bvec = b(span, jj);

            // get SAD using sum on the vector
            dist_mat(ii, jj) = sum(abs(avec - bvec));
        }
    }

    return dist_mat;
}

static array dist_gfor2(array a, array b)
{
    array dist_mat = constant(0, a.dims(1), b.dims(1));

    // GFOR along columns of b
    gfor (array jj, b.dims(1)) {
        array bvec = b(span, jj);

        // Iterate through columns of A
        for (int ii = 0; ii < a.dims(1); ii++) {
            array avec = a(span, ii);

            // get SAD using sum on the vector
            dist_mat(ii, jj) = sum(abs(avec - bvec));
        }
    }

    return dist_mat;
}

Vectorizing along two dimensions

We can get rid of another FOR loop to have 2-D vectorized code. You can get a single point from set A and replicate it to create a matrix that is the same size as B. You can now find distance of a single point in A from all points in B. The code can be seen below.

static array dist_tile1(array a, array b)
{
    // int feat_len = a.dims(0); // Same as b.dims(0);
    int alen = a.dims(1);
    int blen = b.dims(1);

    array dist_mat = constant(0, alen, blen);

    // Iterate through columns of b
    for (int jj = 0; jj < blen; jj++) {

        // Get the column vector of b
        // shape of bvec is (feat_len, 1)
        array bvec = b(span, jj);

        // Tile avec to be same size as a
        // shape of bvec_tiled is (feat_len, alen)
        array bvec_tiled = tile(bvec, 1, alen);

        // Get the sum of absolute differences
        array sad = sum(abs(bvec_tiled - a));

        // sad is row vector, dist_mat needs column vector
        // transpose sad and fill in dist_mat
        dist_mat(span, jj) = sad.T();
    }

    return dist_mat;
}

Vectorizing along three dimensions

The previous method can be further extended. This would involve getting the differences between every point in A and every point in B in parallel. This can be achieved by modifying and tiling A, B to get matrices of size [3, Alen, Blen].

For A, we need to tile it for each point present in B. This generates a 3D array of the desired dimensions. For B, we need to first modify it such that each point in B is represented in the 3rd dimension rather than the second. This is followed up by a tiling operation in the second dimension to generate a 3D array of the desired dimensions.

A simple illustration of what tiling does to the data can be seen below.

Original
A [2 4] =
       10.0000    11.0000    12.0000    13.0000
       14.0000    15.0000    16.0000    17.0000

B [2 3] =
       20.0000    21.0000    22.0000
       23.0000    24.0000    25.0000

Modified
A_mod [2 4] =
       10.0000    11.0000    12.0000    13.0000
       14.0000    15.0000    16.0000    17.0000

B_mod [2 1 3] =
       20.0000
       23.0000

       21.0000
       24.0000

       22.0000
       25.0000

Tiled
A_tiled [2 4 3] =
       10.0000    11.0000    12.0000    13.0000
       14.0000    15.0000    16.0000    17.0000

       10.0000    11.0000    12.0000    13.0000
       14.0000    15.0000    16.0000    17.0000

       10.0000    11.0000    12.0000    13.0000
       14.0000    15.0000    16.0000    17.0000

B_tiled [2 4 3] =
       20.0000    20.0000    20.0000    20.0000
       23.0000    23.0000    23.0000    23.0000

       21.0000    21.0000    21.0000    21.0000
       24.0000    24.0000    24.0000    24.0000

       22.0000    22.0000    22.0000    22.0000
       25.0000    25.0000    25.0000    25.0000

Once we have the tiled versions of the matrices, a single reduction generates the desired results in a single step. The code for this optimization can be seen below.

static array dist_tile2(array a, array b)
{
    int feat_len = a.dims(0);
    int alen = a.dims(1);
    int blen = b.dims(1);

    // Shape of a is (feat_len, alen, 1)
    array a_mod = a;
    // Reshape b from (feat_len, blen) to (feat_len, 1, blen)
    array b_mod = moddims(b, feat_len, 1, blen);

    // Tile both matrices to be (feat_len, alen, blen)
    array a_tiled = tile(a_mod, 1, 1, blen);
    array b_tiled = tile(b_mod, 1, alen, 1);

    // Do The sum operation along first dimension
    // Output is of shape (1, alen, blen)
    array dist_mod = sum(abs(a_tiled - b_tiled));

    // Reshape dist_mat from (1, alen, blen) to (alen, blen)
    array dist_mat = moddims(dist_mod, alen, blen);
    return dist_mat;
}

Results

The following results have been generated on a NVIDIA Quadro K5000.

Time for dist_vec  : 2177.73ms
Time for dist_gfor1: 10.80ms
Time for dist_gfor2: 7.35ms
Time for dist_tile1: 7.23ms
Time for dist_tile2: 0.25ms

As you can see the final code is about 8,700 times faster than the 1D vectorized code. It is also 29 times faster than the next fastest methods.
gfor2 (GFOR over B) is faster than gfor1 (GFOR over A) because B has more points than A.
tile1 and gfor2 have similar performance because they employ similar mechanisms internally.

Considerations

tile2 is the fastest method, but it also uses up the most memory as scratch space. It requires two temporary matrices of sizes [feat_len, Alen, Blen]. In this particular case, feat_len was 3, Alen was 200 and Blen was 300. However, it is easy for the scratch space requirements to grow to a few gigabytes when the number for points is much higher (as is the case with most machine learning algorithms).
Though gfor2 and tile1 give the same performance, tile1 should be preferred. This is because GFOR can not be nested. However using the tile1 method enables us to use GFOR at a higher level in the program.
Of the functions presented here, tile1 is perhaps the best compromise when constrained by memory.
Hybrid algorithms, which combine one or two of the methods shown here, can provide better performance than tile1. We will talk about such algorithms at a later date.

The relevant code for this post can be found on our github page over here.