Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is becoming a necessity to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be intuitive immediately. There are many ways you can vectorize a given code segment. Each method has its own benefits and drawbacks. Hence, writing vectorized code involves analyzing the pros and cons of the available methods and choosing the right one to solve your problem.
In this post, I present various ways to vectorize your code using ArrayFire. ArrayFire is chosen because of my familiarity with the software. The same methods can be easily used in numpy, octave, julia or other scientific computing software.
The problem
The problem we will be looking at is the calculation of distances from every point in a set A
to every point in set B
. For example, if A
has 200
points and B
has 300
points, the distance matrix will be of size 200 x 300
. Each point has coordinates in 3 dimensional space. We can generate the necessary data in the following manner.
A = randu(3, 200); B = randu(3, 300);
Naive implementation
The naive method would be to have three loops. The outer most loop iterating through points in A. The second loop iterating through points in B. The inner most loop iterating through the three dimensions. You can find difference in coordinates between the two points and find the distance between the two points.
This can be implemented in the following manner. The code is shown for reference. This is extremely slow on GPUs
// CODE SHOWN FOR REFERENCE. DO NOT USE. IT IS EXTREMELY SLOW. static array dist_naive(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // Iterate through columns a for (int ii = 0; ii < a.dims(1); ii++) { // Iterate through columns of b for (int jj = 0; jj < b.dims(1); jj++) { // Get the sum of absolute differences for (int kk = 0; kk < a.dims(0); kk++) { dist_mat(ii, jj) += abs(a(kk, ii) - b(kk, jj)); } } } return dist_mat; }
Vectorizing along one dimension
Keeping the outer two loops the same, you can improve the performance by treating the set of three coordinates as a single vector. You can perform the difference between these vectors in parallel. A reduction operation after getting the absolute differences will give us the distance without the need of a third loop.
static array dist_vec(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // Iterate through columns a for (int ii = 0; ii < a.dims(1); ii++) { array avec = a(span, ii); // Iterate through columns of b for (int jj = 0; jj < b.dims(1); jj++) { array bvec = b(span, jj); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; }
Using GFOR
GFOR is a parallel version of the for loop in ArrayFire. Since GFOR's cannot be nested, we can replace one of the for loops using GFOR. This can be done in the following manners.
static array dist_gfor1(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // GFOR along columns of a gfor (array ii, a.dims(1)) { array avec = a(span, ii); // Itere through columns of b for (int jj = 0; jj < b.dims(1); jj++) { array bvec = b(span, jj); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; } static array dist_gfor2(array a, array b) { array dist_mat = constant(0, a.dims(1), b.dims(1)); // GFOR along columns of b gfor (array jj, b.dims(1)) { array bvec = b(span, jj); // Iterate through columns of A for (int ii = 0; ii < a.dims(1); ii++) { array avec = a(span, ii); // get SAD using sum on the vector dist_mat(ii, jj) = sum(abs(avec - bvec)); } } return dist_mat; }
Vectorizing along two dimensions
We can get rid of another FOR loop to have 2-D vectorized code. You can get a single point from set A
and replicate it to create a matrix that is the same size as B
. You can now find distance of a single
point in A
from all
points in B
. The code can be seen below.
static array dist_tile1(array a, array b) { // int feat_len = a.dims(0); // Same as b.dims(0); int alen = a.dims(1); int blen = b.dims(1); array dist_mat = constant(0, alen, blen); // Iterate through columns of b for (int jj = 0; jj < blen; jj++) { // Get the column vector of b // shape of bvec is (feat_len, 1) array bvec = b(span, jj); // Tile avec to be same size as a // shape of bvec_tiled is (feat_len, alen) array bvec_tiled = tile(bvec, 1, alen); // Get the sum of absolute differences array sad = sum(abs(bvec_tiled - a)); // sad is row vector, dist_mat needs column vector // transpose sad and fill in dist_mat dist_mat(span, jj) = sad.T(); } return dist_mat; }
Vectorizing along three dimensions
The previous method can be further extended. This would involve getting the differences between every point in A
and every point in B
in parallel. This can be achieved by modifying and tiling A
, B
to get matrices of size [3, Alen, Blen]
.
For A,
we need to tile it for each point present in B
.
This generates a 3D array of the desired dimensions. For B,
we need to first modify it such that each point in B
is represented in the 3rd dimension rather than the second. This is followed up by a tiling operation in the second dimension to generate a 3D array of the desired dimensions.
A simple illustration of what tiling does to the data can be seen below.
Original A [2 4] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B [2 3] = 20.0000 21.0000 22.0000 23.0000 24.0000 25.0000 Modified A_mod [2 4] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B_mod [2 1 3] = 20.0000 23.0000 21.0000 24.0000 22.0000 25.0000 Tiled A_tiled [2 4 3] = 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.0000 17.0000 B_tiled [2 4 3] = 20.0000 20.0000 20.0000 20.0000 23.0000 23.0000 23.0000 23.0000 21.0000 21.0000 21.0000 21.0000 24.0000 24.0000 24.0000 24.0000 22.0000 22.0000 22.0000 22.0000 25.0000 25.0000 25.0000 25.0000
Once we have the tiled versions of the matrices, a single reduction generates the desired results in a single step. The code for this optimization can be seen below.
static array dist_tile2(array a, array b) { int feat_len = a.dims(0); int alen = a.dims(1); int blen = b.dims(1); // Shape of a is (feat_len, alen, 1) array a_mod = a; // Reshape b from (feat_len, blen) to (feat_len, 1, blen) array b_mod = moddims(b, feat_len, 1, blen); // Tile both matrices to be (feat_len, alen, blen) array a_tiled = tile(a_mod, 1, 1, blen); array b_tiled = tile(b_mod, 1, alen, 1); // Do The sum operation along first dimension // Output is of shape (1, alen, blen) array dist_mod = sum(abs(a_tiled - b_tiled)); // Reshape dist_mat from (1, alen, blen) to (alen, blen) array dist_mat = moddims(dist_mod, alen, blen); return dist_mat; }
Results
The following results have been generated on a NVIDIA Quadro K5000.
Time for dist_vec : 2177.73ms Time for dist_gfor1: 10.80ms Time for dist_gfor2: 7.35ms Time for dist_tile1: 7.23ms Time for dist_tile2: 0.25ms
- As you can see the final code is about
8,700
times
faster
than the 1D vectorized code. It is also29 times faster
than the next fastest methods. gfor2
(GFOR over B) is faster thangfor1
(GFOR over A) becauseB
has more points thanA.
tile1
andgfor2
have similar performance because they employ similar mechanisms internally.
Considerations
tile2
is the fastest method, but it also uses up the most memory as scratch space. It requires two temporary matrices of sizes[feat_len, Alen, Blen].
In this particular case,feat_len
was 3,Alen
was 200 andBlen
was 300. However, it is easy for the scratch space requirements to grow to a few gigabytes when the number for points is much higher (as is the case with most machine learning algorithms).- Though
gfor2
andtile1
give the same performance,tile1
should be preferred. This is becauseGFOR
can not be nested. However using thetile1
method enables us to use GFOR at a higher level in the program. - Of the functions presented here,
tile1
is perhaps the best compromise when constrained by memory. - Hybrid algorithms, which combine one or two of the methods shown here, can provide better performance than
tile1.
We will talk about such algorithms at a later date.
The relevant code for this post can be found on our github page over here.