C++ Design Patterns For Low-Latency Applications

With performance optimizations seemingly having lost their relevance in an era of ever-increasing hardware performance, there are still many good reasons to spend some time optimizing code. In a recent preprint article by [Paul Bilokon] and [Burak Gunduz] of the Imperial College London the focus is specifically on low-latency patterns that are relevant for applications such as high-frequency trading (HFT). In HFT the small margins are compensated for by churning through absolutely massive volumes of trades, all of which relies on extremely low latency to gain every advantage. Although FPGA-based solutions are very common in HFT due their low-latency, high-parallelism, C++ is the main language being used beyond FPGAs.

Although many of the optimizations listed in the paper are quite obvious, such as prewarming the CPU caches, using constexpr, loop unrolling and use of inlining, other patterns are less obvious, such as hotpath versus coldpath. This overlaps with the branch reduction pattern, with both patterns involving the separation of commonly and rarely executed code (like error handling and logging), improving use of the CPU’s caches and preventing branch mispredictions, as the benchmarks (using Google Benchmark) clearly demonstrates. All design patterns can also be found in the GitHub repository.

Other interesting tidbits are the impact of signed and unsigned comparisons, mixing floating point datatypes and of course lock-free programming using a ring buffer design. Only missing from this list appears to be aligned vs unaligned memory accesses and zero-copy optimizations, but those should be easy additions to implement and test next to the other optimizations in this paper.

Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

What is the right time to optimize code? This is a very good question, which usually comes down to two answers. The first answer is to have a good design for the code to begin with, because ‘optimization’ does not mean ‘fixing bad design decisions’. The second answer is that it should happen after the application has been sufficiently debugged and its developers are at risk of getting bored.

There should also be a goal for the optimization, based on what makes sense for the application. Does it need to process data faster? Should it send less data over the network or to disk? Shouldn’t one really have a look at that memory usage? And just what is going on inside those CPU caches that makes performance sometimes drop off a cliff on a single core?

All of this and more can be analyzed using tools from the Valgrind suite, including Cachegrind, Callgrind, DHAT and Massif.

Keeping Those Cores Cool

Modern day processors are designed with low power usage in mind, regardless of whether they are aimed at servers, desktop systems or embedded applications. This essentially means that they are in a low power state when not doing any work (idle loop), with some CPUs and microcontrollers turning off power to parts of the chip which are not being used. Consequently, the more the processor has to do, the more power it will use and the hotter it will get.

Continue reading “Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs”