A while back I wrote a blog on triangle counting in networks using CUDA (see Part 1 for the post). In this post, I will cover in more detail the internals of the algorithm and the CUDA implementation. Before I take a deep dive into the details of the algorithm, I want to remind the reader that there are multiple ways for finding triangles in a graph. Our approach is based off the intersection of two adjacency lists and finding the common elements in both those lists. Two additional approaches would simply be to compare all the possible node-triplets, either in the graph or via matrix multiplication of the incidence array. The latter of these two approaches can be computationally demanding--especially for sparse networks. This makes the adjacency list intersection approach preferable in most cases. For the intersection to be efficient, the lists are typically presorted. This by no means is an exaggerated requirement.

An extended discussion on the pros and cons of the different approaches can be found in both the papers by Shanck & Wagner and in Green et al.

The significance of the adjacency lists being sorted is in fact important. The main reason is that the intersecting two sort lists is very similar conceptually to merging two sorted lists. Thus, parallel merge algorithms can be adopted for the intersection.

The key difference between a merge and the intersection is that most parallel merge algorithms offer exclusive writing (in terms of location) in the output array--meaning each thread is responsible for merging independent sections of the output and, in fact, each thread knows the exact position in the output array that it will write to. For intersection this is more complicated, as the exact position to which a thread will need to write to in a sorted fashion is dependent on the sorted input and how many prior intersections were found. Fortunately, in the case of triangle counting, the actual set of overlapping elements is not necessary--rather, only the number of common elements is required. This allows us to use the recently developed Merge Path and GPU Merge Path algorithms for our purposes. Merge Path is a visual approach for merging two sorted arrays into one sorted array.

Merge Path restates the problem of merging two sorted arrays into the traversal of a 2-d grid from the top-left corner to the bottom-right corner with the only legal moves being downwards and to the right. This can be seen in the figure below. The traversal of the grid is dependent on the values of the input arrays (one has been placed at the top of the grid and the other has been placed on the left of the grid). The decision on either moving down or to the right is dependent on the location of the indices of the path. As the path starts at index (0,0), the value A[0] is compared with the value B[0]. Based on the comparison, the movement direction is decided. The movement direction essentially increases the pointer-index of one of these arrays. Thus, only elements on the path are computed (compared).

Using the Merge Path concept, it is possible to partition the work into equal length paths such that each core is responsible for merging an equal number of elements. With a little augmentation to the movement pattern, we can change Merge Path to support intersection. This involves adding an additional move option: down-right. This move occurs when the elements pointed by the indices in both lists are equal. So why is all this important? Well, the simple truth is that this allows us to split the adjacency list intersection between multiple GPU threads. This also permits executing several different adjacency list executions concurrently in a single CUDA block.

In my next blog I will cover performance and give some additional details on benefits of doing multiple list intersections concurrently. A full tech-report on this will be published in the near future, so stay tuned.