How to write vectorized code

Pavan ArrayFire, C/C++ Leave a Comment

Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is becoming a necessity to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be intuitive immediately. There are many ways you can vectorize a given code segment. Each method has its own benefits and drawbacks. Hence, writing vectorized code involves analyzing the pros and cons of the available methods and choosing the right one to solve your problem.

In this post, I present various ways to vectorize your code using ArrayFire. ArrayFire is chosen because of my familiarity with the software. The same methods can be easily used in numpy, octave, julia or other scientific computing software.

The problem

The problem we will be looking at is the calculation of distances from every point in a set A to every point in set B. For example, if A has 200 points and B has 300 points, the distance matrix will be of size 200 x 300. Each point has coordinates in 3 dimensional space. We can generate the necessary data in the following manner.

Naive implementation

The naive method would be to have three loops. The outer most loop iterating through points in A. The second loop iterating through points in B. The inner most loop iterating through the three dimensions. You can find difference in coordinates between the two points and find the distance between the two points.

This can be implemented in the following manner. The code is shown for reference. This is extremely slow on GPUs

Vectorizing along one dimension

Keeping the outer two loops the same, you can improve the performance by treating the set of three coordinates as a single vector. You can perform the difference between these vectors in parallel. A reduction operation after getting the absolute differences will give us the distance without the need of a third loop.

Using GFOR

GFOR is a parallel version of the for loop in ArrayFire. Since GFOR's cannot be nested, we can replace one of the for loops using GFOR. This can be done in the following manners.

Vectorizing along two dimensions

We can get rid of another FOR loop to have 2-D vectorized code. You can get a single point from set and replicate it to create a matrix that is the same size as B. You can now find distance of a single point in from all points in B. The code can be seen below.

Vectorizing along three dimensions

The previous method can be further extended. This would involve getting the differences between every point in A  and every point in in parallel. This can be achieved by modifying and tiling A, to get matrices of size [3, Alen, Blen].

For A, we need to tile it for each point present in B. This generates a 3D array of the desired dimensions. For B, we need to first modify it such that each point in B is represented in the 3rd dimension rather than the second. This is followed up by a tiling operation in the second dimension to generate a 3D array of the desired dimensions.

A simple illustration of what tiling does to the data can be seen below.

Once we have the tiled versions of the matrices, a single reduction generates the desired results in a single step. The code for this optimization can be seen below.


The following results have been generated on a NVIDIA Quadro K5000.

  • As you can see the final code is about 8,700 times faster than the 1D vectorized code. It is also 29 times faster than the next fastest methods.
  • gfor2 (GFOR over B) is faster than gfor1 (GFOR over A) because B has more points than A. 
  • tile1 and gfor2 have similar performance because they employ similar mechanisms internally.


  • tile2 is the fastest method, but it also uses up the most memory as scratch space. It requires two temporary matrices of sizes [feat_len, Alen, Blen]. In this particular case, feat_len was 3, Alen was 200 and Blen was 300. However, it is easy for the scratch space requirements to grow to a few gigabytes when the number for points is much higher (as is the case with most machine learning algorithms).
  • Though gfor2 and tile1 give the same performance, tile1 should be preferred. This is because GFOR can not be nested. However using the tile1 method enables us to use GFOR at a higher level in the program.
  • Of the functions presented here, tile1 is perhaps the best compromise when constrained by memory.
  • Hybrid algorithms, which combine one or two of the methods shown here, can provide better performance than tile1. We will talk about such algorithms at a later date.

The relevant code for this post can be found on our github page over here.