A Fortune 300 Financial Company located in the North East USA ported their CPU code (C/C++) to CUDA and this enabled them to speed up the necessary financial calculations. Their primary objective was to reduce the time taken to run this code. Before porting to CUDA, they required ten to twelve hours to run the entire code. After porting to CUDA, they were able to do the same work in about 30 minutes.
The Client has several C/C++ programmers in their team and they trained many of them to use CUDA. Their code used CUDA to speed up a calculation using Monte Carlo methods to analyze various scenarios in their hedge-fund projection system (HPS). The major portion of their calculations was carried out on independent policies and thus the entire process was ideal for parallel programming (no communication bottlenecks).
The CPU code was slow and this was holding them back. Because the code took the entire night to run, they were not able to run several different scenarios and take corrective action based on the results of the code. The primary reason to move to GPUs was to speed up the calculations.
The Client invested heavily into building CPU+GPU clusters relying on AMAX. They had the code running on four compute nodes, with four GPUs on each node. Each GPU was an NVIDIA K20X. The code ran on 16 GPUs. They were in the process of installing an even bigger system, with 64 total GPUs.
Our initial plan was adjusted as the project progressed, and we identified key areas that needed improvement. The first couple of weeks were somewhat tricky because we were not able to get a working version of the codebase that was representative of the code they were actually running. This was because the code was in a very active state of development and integration on their side.
We made several suggestions to improve the efficiency of the codebase, but most of these were minor improvements. There were two levels of parallelism in the code—at the kernel level (GPU CUDA code) and the cluster level (MPI). One of the really useful suggestions we made was relevant at the MPI level, where we suggested they organize their policies in a random manner to distribute the computing load in a more uniform manner. This led to a major efficiency increase. This suggestion was immediately implemented by their engineers. Another useful contribution we made was to stress-test their code to get an estimate of how many more policies they could handle. This was really simple to do but was a big hit within the company.
At the beginning of the project, we were unable to access a working version of their code. Additionally, we were not able to access their machines from our laptops. Both of these challenges were overcome in the following week and no significant adjustments to the original plan were required.
The final deliverable consisted of a detailed technical report covering six modules of our assessment of the Client's HPS, including the following system reviews: code base, performance and optimization, fault tolerance, scalability, compatibility with emerging CUDA trends, and disaster recovery. We also delivered an executive summary of our findings suitable for higher management.
The Client had already developed the code base for their HPS. They engaged ArrayFire to obtain an expert assessment of their work and to get a "stamp of approval" that the code will work as intended and is of sufficiently good quality. We successfully delivered on all objectives. Our assessment and findings were fully documented in nearly 100 pages of detailed technical reports and slide modules. These contained a wealth of significant benchmarks and visuals related to the code base and our assessment of their HPS.
The Client was extremely satisfied with our input and suggestions. We met or exceeded their expectations in all project objectives and deliverables.