Triangle Counting in Graphs on the GPU (Part 3)

In this blog I will finalize the work that we completed on triangle counting in graphs on the GPU using CUDA. The two previous blogs can be found: Part 1 and Part 2. The first part introduces the significance of the problem and the second part explains the algorithms that we used in our solution.

This work was presented in finer detail in the Workshop on Irregular Applications: Architectures and Algorithms which took place as part of Supercomputing 2014. The full report can be found here.

In previous blogs, we discussed that the performance of the triangle counting is dependent on the algorithm and the CUDA kernel. Our implementation gives the data scientist control over several important parameters:

Number of threads blocks – the number of threads blocks that will be created impacts the number of vertices of each thread block is responsible for. Each thread block receives an equal number of vertices that it will be responsible for the triangle counting via the adjacency list intersection approach. A small number of thread blocks can cause workload imbalance due to the fact that many real-world networks follow a power law distribution. It is possible to allocate a single thread-block to each vertex due to the large number of available thread-blocks – this however can also increase overhead. As such it is necessary to find a good balance to over-allocating and under-allocating – we used 16000 thread blocks.
Number of concurrent intersections (in a thread block) – while it is feasible to allocate an entire thread block for a single intersection, this will typically bring about poor performance due to under utilization. Perhaps the single most important requirement for utilizing a thread-block is to ensure that the size of the thread-block is greater than the size of a warp. For adjacency list intersection this can cause an under-utilization of the warp (and the thread-block) due to the sparsity found in social networks where many vertices only have a handful of adjacent vertices. As such an entire warp in not necessary for the intersection. The other extreme is to allocate a single thread for every intersection for a specific vertex. Once again, this can lead to under-utilization as there are not enough threads to utilize the entire thread-block. Typically it is also preferable to allocate large thread-blocks that are greater in size than a single warp.
Number of threads per intersection – the number of threads allocated for an intersection also plays an important role in achieving a high-degree of system utilization. As mentioned earlier there are two extremes: one thread per intersection and an entire thread-block per intersection (with a minimum of a warp size block). We do not recommend allocating more than a single-warp for a given intersection as this requires additional synchronization which can cause overhead. For sparse networks, where the average adjacency is not larger than a warp size, it is simply not efficient. However, as networks become denser and, therefore, the work of the intersection becomes greater, it might become necessary to allocate more than a single warp for the intersection.

In the above figure we show execution time of the triangle counting algorithm for the “preferential attachment” network from the DIMACS 10 Graph Challenge. The abscissa represents the different configurations. The ordinate represents time – as such lower is better. Each configuration is represented by a triplet x-y-z, where x is the block size, y is the number of threads used for each intersection and z is the number of concurrent intersections in each block. In our full report there are additional performance figures for other networks.

Using smaller block sizes such as 32 or 64 threads requires each SM to execute several intersections (aka multiple blocks) for different vertices concurrently in order to achieve full system utilization. Smaller block-sizes incurred scheduling over-head. On the other extreme, large block configurations (such as $256$ threads) also showed slower performance due to low utilization. We found the best performance to be in the range of 92, 128, or 192 threads per block.

In our implementation, the number of threads used for a single intersection was limited to 32 threads, which is the warp size – this was done to avoid synchronization across multiple warps. The configurations that are presented use either 1, 2, 4, 8, 16 or 32 threads per intersection which in turn allows 32, 16, 8, 4, 2 or 1 concurrent intersections (respectively).

We noticed that the best performance is typically obtained when allocating 8 threads per intersection. These intermediate size configurations allow for good memory locality, good balancing between the work partition stage and the intersection stage. The two extreme configurations (one intersection per thread and one intersection per warp) do not offer the best performance. In fact the difference between the best performing configuration and the worst performing configuration is in the range of 3X-7X. Thus, fine-tuning and the control parameters are important.

Comments 1

Leave a Reply Cancel reply