Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is becoming a necessity to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be intuitive immediately. There are many ways you can vectorize a given code segment. Each method has its own benefits and drawbacks. Hence, writing vectorized code involves analyzing the pros and cons of the available methods and choosing the right one to solve your problem.

In this post, I present various ways to vectorize your code using ArrayFire. ArrayFire is chosen because of my familiarity with the software. The same methods can be easily used in numpy, octave, julia or other scientific computing software.

### The problem

The problem we will be looking at is the calculation of distances from every point in a set `A`

to every point in set `B`

. For example, if ` A `

has ` 200 `

points and ` B `

has ` 300 `

points, the distance matrix will be of size `200 x 300`

. Each point has coordinates in 3 dimensional space. We can generate the necessary data in the following manner.

A = randu(3, 200); B = randu(3, 300);

### Naive implementation

The naive method would be to have three loops. The outer most loop iterating through points in A. The second loop iterating through points in B. The inner most loop iterating through the three dimensions. You can find difference in coordinates between the two points and find the distance between the two points.

This can be implemented in the following manner. * The code is shown for reference. This is extremely slow on GPUs *

// CODE SHOWN FOR REFERENCE. DO NOT USE. IT IS EXTREMELY SLOW. static array dist_naive(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // Iterate through columns a for (int ii = 0; ii < a.dims(1); ii++) { // Iterate through columns of b for (int jj = 0; jj < b.dims(1); jj++) { // Get the sum of absolute differences for (int kk = 0; kk < a.dims(0); kk++) { dist_mat(ii, jj) += abs(a(kk, ii) - b(kk, jj)); } } } return dist_mat; }

### Vectorizing along one dimension

Keeping the outer two loops the same, you can improve the performance by treating the set of three coordinates as a single vector. You can perform the difference between these vectors in parallel. A reduction operation after getting the absolute differences will give us the distance without the need of a third loop.

static array dist_vec(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // Iterate through columns a for (int ii = 0; ii < a.dims(1); ii++) { array avec = a(span, ii); // Iterate through columns of b for (int jj = 0; jj < b.dims(1); jj++) { array bvec = b(span, jj); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; }

### Using GFOR

GFOR is a parallel version of the for loop in ArrayFire. Since GFOR's cannot be nested, we can replace one of the for loops using GFOR. This can be done in the following manners.

static array dist_gfor1(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // GFOR along columns of a gfor (array ii, a.dims(1)) { array avec = a(span, ii); // Itere through columns of b for (int jj = 0; jj < b.dims(1); jj++) { array bvec = b(span, jj); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; } static array dist_gfor2(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // GFOR along columns of b gfor (array jj, b.dims(1)) { array bvec = b(span, jj); // Iterate through columns of A for (int ii = 0; ii < a.dims(1); ii++) { array avec = a(span, ii); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; }

### Vectorizing along two dimensions

We can get rid of another FOR loop to have 2-D vectorized code. You can get a single point from set `A `

and replicate it to create a matrix that is the same size as `B`

. You can now find distance of a

point in *single*`A `

from

points in *all *`B`

. The code can be seen below.

static array dist_tile1(array a, array b) { // int feat_len = a.dims(0); // Same as b.dims(0); int alen = a.dims(1); int blen = b.dims(1); array dist_mat = constant(0, alen, blen); // Iterate through columns of b for (int jj = 0; jj < blen; jj++) { // Get the column vector of b // shape of bvec is (feat_len, 1) array bvec = b(span, jj); // Tile avec to be same size as a // shape of bvec_tiled is (feat_len, alen) array bvec_tiled = tile(bvec, 1, alen); // Get the sum of absolute differences array sad = sum(abs(bvec_tiled - a)); // sad is row vector, dist_mat needs column vector // transpose sad and fill in dist_mat dist_mat(span, jj) = sad.T(); } return dist_mat; }

### Vectorizing along three dimensions

The previous method can be further extended. This would involve getting the differences between every point in `A`

and every point in * *`B `

in parallel. This can be achieved by modifying and tiling `A`

, `B `

to get matrices of size `[3, Alen, Blen]`

.

For `A, `

we need to tile it for each point present in `B`

.` `

This generates a 3D array of the desired dimensions. For `B, `

we need to first modify it such that each point in `B`

is represented in the 3rd dimension rather than the second. This is followed up by a tiling operation in the second dimension to generate a 3D array of the desired dimensions.

A simple illustration of what tiling does to the data can be seen below.

Original A [2 4] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B [2 3] = 20.0000 21.0000 22.0000 23.0000 24.0000 25.0000 Modified A_mod [2 4] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B_mod [2 1 3] = 20.0000 23.0000 21.0000 24.0000 22.0000 25.0000 Tiled A_tiled [2 4 3] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B_tiled [2 4 3] = 20.0000 20.0000 20.0000 20.0000 23.0000 23.0000 23.0000 23.0000 21.0000 21.0000 21.0000 21.0000 24.0000 24.0000 24.0000 24.0000 22.0000 22.0000 22.0000 22.0000 25.0000 25.0000 25.0000 25.0000

Once we have the tiled versions of the matrices, a single reduction generates the desired results in a single step. The code for this optimization can be seen below.

static array dist_tile2(array a, array b) { int feat_len = a.dims(0); int alen = a.dims(1); int blen = b.dims(1); // Shape of a is (feat_len, alen, 1) array a_mod = a; // Reshape b from (feat_len, blen) to (feat_len, 1, blen) array b_mod = moddims(b, feat_len, 1, blen); // Tile both matrices to be (feat_len, alen, blen) array a_tiled = tile(a_mod, 1, 1, blen); array b_tiled = tile(b_mod, 1, alen, 1); // Do The sum operation along first dimension // Output is of shape (1, alen, blen) array dist_mod = sum(abs(a_tiled - b_tiled)); // Reshape dist_mat from (1, alen, blen) to (alen, blen) array dist_mat = moddims(dist_mod, alen, blen); return dist_mat; }

### Results

The following results have been generated on a NVIDIA Quadro K5000.

Time for dist_vec : 2177.73ms Time for dist_gfor1: 10.80ms Time for dist_gfor2: 7.35ms Time for dist_tile1: 7.23ms Time for dist_tile2: 0.25ms

- As you can see the final code is about
than the 1D vectorized code. It is also`8,700`

`times`

`faster`

than the next fastest methods.`29 times faster`

`gfor2`

(GFOR over B) is faster than`gfor1`

(GFOR over A) becausehas more points than`B`

`A.`

`tile1`

and`gfor2`

have similar performance because they employ similar mechanisms internally.

### Considerations

`tile2`

is the fastest method, but it also uses up the most memory as scratch space. It requires two temporary matrices of sizes`[feat_len, Alen, Blen].`

In this particular case,`feat_len`

was 3,`Alen`

was 200 and`Blen`

was 300. However, it is easy for the scratch space requirements to grow to a few gigabytes when the number for points is much higher (as is the case with most machine learning algorithms).- Though
`gfor2`

and`tile1`

give the same performance,`tile1`

should be preferred. This is because`GFOR`

can not be nested. However using the`tile1`

method enables us to use GFOR at a higher level in the program. - Of the functions presented here,
`tile1`

is perhaps the best compromise when constrained by memory. - Hybrid algorithms, which combine one or two of the methods shown here, can provide better performance than
`tile1.`

We will talk about such algorithms at a later date.

The relevant code for this post can be found on our github page over here.