A Fortune 200 Insurance Company located on the West coast was using a third-party, closed-source application to make inferences about future market conditions in order to hedge their investments. In order to carry out calculations over millions of possible scenarios, they were using a small cluster consisting of 1,300 hyperthreaded CPU cores in blade servers. In the near term, they projected that their business needs would grow by a factor of two and they were near capacity with their existing system.
Mission
The Client hired ArrayFire to conduct a comprehensive audit of their system to access its performance in nine key areas:
- The technological platform – a high-level overview of their system, example execution flows, discussions of shortcomings of their current approach, and recommendations for how the system could be optimized to be more efficient, execute faster, and be more reliable.
- The code base – our engineers evaluated the code for compsition, quality, conformance with the best industry practices, and performance characteristics.
- Data management and governance policies – a comparision of Client’s policies with industry standards.
- Performance and optimization strategies – Using runtime profiling ArrayFire explored various approaches for how the system might be optmized to improve throughput.
- Fault tolerance, stability, and reliability – Using on-site inspections, interviews with employees, and parsing documentation, ArrayFire accessed whether machines were sufficiently redudnant to meet business needs, inspected flows of data throughout the system (storage, network, compute infrastructure, etc.), and attempted to locate single points of failure and unreliable cross-over points.
- Scalability – ArrayFire tested various assumptions about how their platform might tolerate projected business growth and provided suggestions on how their computational enviornment could be modified to tolerate future conditions.
- Compatibility with emerging trends – ArrayFire parsed academic and business literature to find state-of-the-art implementations of the Client’s application.
- Disaster recovery platforms – Through parsing the disaster recovery documentation and interviews with employees, ArrayFire compared Client’s disaster recovery platform with industry standards.
- Change management process – ArrayFire compared Client’s process of implementing and testing changes to their code with industry best practices.
Action
Over a period of two months, ArrayFire investigated the aforementioned items, conducted on-site and phone interviews with employees, and profiled all aspects of their existing software. This included disk I/O, database access patterns, network bandwidth utilization, processor utilization, load balancing, execution order modifications, and code optimizations.
Results
These techniques resulted in the discovery of hitherto unknown bottlenecks. In particular, poorly chosen virtual disk layouts in a SAN resulted in disk I/O limitations, job order execution resulted in heavy database lock contention, and poor load balancing caused a significant portion of the cluster to sit idle during production runs.
Through profiling, ArrayFire predicted portions of their code could be accelerated by 1.3 – 18 times. Furthermore, components of their system could be improved by nearly 50% with minor modifications. Using known runtime information, ArrayFire predicted the entire process could be reduced by approximately 45%, on the existing hardware, if all of our suggestions were implemented. Subsequent communications with the Client indicated they implemented most of our (prioritized) suggestions and decreased total production runtimes by nearly 39%.