C++ Design Patterns For Low-Latency Applications

With performance optimizations seemingly having lost their relevance in an era of ever-increasing hardware performance, there are still many good reasons to spend some time optimizing code. In a recent preprint article by [Paul Bilokon] and [Burak Gunduz] of the Imperial College London the focus is specifically on low-latency patterns that are relevant for applications such as high-frequency trading (HFT). In HFT the small margins are compensated for by churning through absolutely massive volumes of trades, all of which relies on extremely low latency to gain every advantage. Although FPGA-based solutions are very common in HFT due their low-latency, high-parallelism, C++ is the main language being used beyond FPGAs.

Although many of the optimizations listed in the paper are quite obvious, such as prewarming the CPU caches, using constexpr, loop unrolling and use of inlining, other patterns are less obvious, such as hotpath versus coldpath. This overlaps with the branch reduction pattern, with both patterns involving the separation of commonly and rarely executed code (like error handling and logging), improving use of the CPU’s caches and preventing branch mispredictions, as the benchmarks (using Google Benchmark) clearly demonstrates. All design patterns can also be found in the GitHub repository.

Other interesting tidbits are the impact of signed and unsigned comparisons, mixing floating point datatypes and of course lock-free programming using a ring buffer design. Only missing from this list appears to be aligned vs unaligned memory accesses and zero-copy optimizations, but those should be easy additions to implement and test next to the other optimizations in this paper.

C Compiler Exists Entirely In Vim

8cc.vim is a C compiler that exists as pure Vimscript. Is it small? It sure is! How about fast? Absolutely not! Efficient? Also no. But does it work and is it neat? You betcha!

Ever typed :wq to write the buffer and exit in Vim? When you do that, you’re using Vimscript. Whenever one enters command mode : in Vim, one is in fact using a live Vimscript interpreter. That’s the space in which this project exists and does its magic. Given enough time, anyway.

Vimscript itself was created by [Bram Moolenaar] in 1991. The idea was to execute batches of vim commands programmatically. It’s been used for a variety of purposes since then.

8cc is a lightweight C compiler that has been supplanted by chibicc, but that doesn’t matter much because as author [rhysd] admits, this is really just a fun concept project more than anything. It may take twenty minutes or more to compile “hello world”, but doing it entirely from within Vim is a trip.

The Performance Impact Of C++’s `final` Keyword For Optimization

In the world of software development the term ‘optimization’ is generally reason for experienced developers to start feeling decidedly nervous, especially when a feature is marked as an ‘easy and free optimization’. The final keyword introduced in C++11 is one of such features. It promises a way to speed up object-oriented code by omitting the vtable call indirection by marking a class or member function as – unsurprisingly – final, meaning that it cannot be inherited from or overridden. Inspired by this promise, [Benjamin Summerton] figured that he’d run a range of benchmarks to see what performance uplift he’d get on his ray tracing project.

To be as thorough as possible, the tests were run on three different systems, including 64-bit Intel and AMD systems, as well as on Apple Silicon (M1). For the compilers various versions of GCC (12.x, 13.x), as well as Clang  (15, 17) and MSVC (17) were employed, with rather interesting results for final versus no final tests. Clang was probably the biggest surprise, as with the keyword added, performance with Clang-generated code absolutely tanked. MSVC was a mixed bag, as were the GCC versions other than GCC 13.2 on AMD Ryzen, which saw a bump of a few percent faster.

Ultimately, it seems that there’s no free lunch as usual, and adding final to your code falls distinctly under ‘only use it if you know what you’re doing’. As things stand, the resulting behavior seems wildly inconsistent.

Processes, Threads, And… Fibers?

You’ve probably heard of multithreaded programs where a single process can have multiple threads of execution. But here is yet another layer of creating multitasking programs known as a fiber. [A Graphics Guy] lays it out in a lengthy but well-done post. There are examples for both x64 and arm64, although the post mainly focuses on x64 for Windows. However, the ideas will apply anywhere.

In the old days, there was a CPU and when your program ran on it, it was in control. But that’s wasteful, so software quickly moved to where many programs could share the CPU simultaneously. Then, as that got overloaded, computers got more CPUs. Most operating systems have the idea of a process, which is a program that thinks it is in complete control, but it is really sharing the CPU with other processes. The problem arises when you want to have multiple “little” programs that cooperate. Processes are not really supposed to know about one another and, if they do, there’s usually some heavy-weight communication mechanism allowing them to talk.

Continue reading “Processes, Threads, And… Fibers?”

QSPICE Picks Up Where LTSpice Left Us

[Mike Engelhardt] is a name that should be very familiar to the hardcore electronics nerd. [Mike] is the developer responsible for LTSpice, which is quite likely the most widely used spice-compatible simulator in the free software domain. When you move away from digital electronics and the comfort of software with its helpful IDEs and toolchains, and dip a wary toe into the murky grey waters of analog or power electronics, LTSpice is your best friend. And, like all best friends, it’s a bit quirky, but it always has your back. Sadly, LTSpice development seems to have stalled some years ago, but luckily for us [Mike] has been busy on the successor, QSpice, under the watchful eye of Qorvo.

It does look in its early stages, but from a useability point of view, it’s much improved over LTSpice. Performance is excellent (based on this scribe’s limited testing while mobile.) Gone (thankfully!) is the uncommon verb-noun usage paradigm — replaced with a more usual cut-n-paste flow. Visually it still kind of looks like LTspice in places, but nicer with a clear and uncluttered design that gets straight to the point. Internally, the simulation engine has improved in speed and accuracy, as well as adding native support for modern semiconductor types, such as wide bandgap materials like SiC. Noted is that this updated software has a particular emphasis on power integrity and noise analysis, which are sticky problems that have a big impact on modern high-power systems.

Continue reading “QSPICE Picks Up Where LTSpice Left Us”

In Praise Of RPN (with Python Or C)

HP calculators, slide rules, and Forth all have something in common: reverse polish notation or RPN. Admittedly, slide rules don’t really have RPN, but you work problems on them the same way you do with an RPN calculator. For whatever reason, RPN didn’t really succeed in the general marketplace, and you might wonder why it was ever a thing. The biggest reason is that RPN is very easy to implement compared to working through proper algebraic, or infix, notation. In addition, in the early years of computers and calculators, you didn’t have much to work with, and people were used to using slide rules, so having something that didn’t take a lot of code that matched how users worked anyway was a win-win.

What is RPN?

If you haven’t encountered RPN before, it is an easy way to express math without ambiguity. For example, what’s 5 + 3 * 6?  It’s 23 and not 48. By order of operations you know that you have to multiply before you add, even if you wrote down the multiplication second. You have to read through the whole equation before you can get started with math, and if you want to force the other result, you’ll need parentheses.

With RPN, there is no ambiguity depending on secret rules or parentheses, nor is there any reason to remember things unnecessarily. For instance, to calculate our example you have to read all the way through once to figure out that you have to multiply first, then you need to remember that is pending and add the 5. With RPN, you go left to right, and every time you see an operator, you act on it and move on. With RPN, you would write 3 6 * 5 +.

While HP calculators were the most common place to encounter RPN, it wasn’t the only place. Friden calculators had it, too. Some early computers and calculators supported it but didn’t name it. Some Soviet-era calculators used it, too, including the famous Elektronika B3-34, which was featured in a science fiction story in a Soviet magazine aimed at young people in 1985. The story set problems that had to be worked on the calculator.

Continue reading “In Praise Of RPN (with Python Or C)”

C++17’s Useful Features For Embedded Systems

Although the world of embedded software development languages seem to span somewhere between ASM and C89 all the way to MicroPython, there is a lot to be said for a happy medium between ease of development and features that makes the software more robust without adding overhead or bloat to the final firmware image.

This is where C++ has objectively many advantages over even C99, and as [Çağlayan Dökme] argues in a recent blog post C++17 adds many developer critter comforts to C++98 and the more recent C++11 C++14 standards.

First stepping back a generation (technically two, with C++20 also being a thing already), the addition of binary literals (e.g. 0b1010'1100) in C++14 and the expanded use of constexpr is addressed, with the latter foreshadowing C++17’s increased focus on compile time optimizations. A new attribute in C++17 that is part of this is [[nodiscard]], which when added before to the return type of a function or method requires the return value to be used in some manner, much like with functions in Ada (contrasted with procedures).

As [Çağlayan] notes, the biggest strength of compile-time checks is that it can save a lot of deploy-test-fix round-trips, with the total number of issues caught after deployment that could have been caught during compilation ideally being zero. Here C++17 streamlines the static_assert() mechanism and simplifies using if constexpr to instantiate code depending on compile-time conditions. Beyond compile-time optimizations there are a few other niceties, such as C++17 guaranteeing copy elision (return value optimization) when an object is returned directly, which is a welcome feature in hard real-time environments.

With today even MCUs having enough grunt to run multi-threaded applications and potentially firmware compiled from a many-thousand LoC codebase, picking a programming language that assists the developer with such an arduous task is very important, with Ada being the primary choice for high-reliability embedded platforms, but C++ along with C enjoying the most widespread (free) compiler support. Even if C++ isn’t supported on every single MCU out there (8051-based and most PIC MCUs mostly), whenever it is an option, it’s a pretty solid choice, especially with knowledge of these new language features.