Optimizing Linux Pipes

In CPU design, there is Ahmdal’s law. Simply put, it means that if some process is contributing to 10% of your execution, optimizing it can’t improve things by more than 10%. Common sense, really, but it illustrates the importance of knowing how fast or slow various parts of your system are. So how fast are Linux pipes? That’s a good question and one that [Mazzo] sets out to answer.

The inspiration was a highly-optimized fizzbuzz program that clocked in at over 36GB/s on his laptop. Is that a common speed? Nope. A simple program using pipes on the same machine turned in not quite 4 GB/s. What accounts for the difference?

Continue reading “Optimizing Linux Pipes”

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Modern CPUs Are Smarter Than You Might Realize

When it comes to programming, most of us write code at a level of abstraction that could be for a computer from the 1960s. Input comes in, you process it, and you produce output. Sure, a call to strcpy might work better on a modern CPU than on an older one, but your basic algorithms are the same. But what if there were ways to define your programs that would work better on modern hardware? That’s what a pre-print book from [Sergey Slotin] answers.

As a simple example, consider the effects of branching on pipelining. Nearly all modern computers pipeline. That is, one instruction is fetching data while an older instruction is computing something, while an even older instruction is storing its results. The problem arises when you already have an instruction partially executed when you realize that an earlier instruction caused a branch to another part of your code. Now the pipeline has to be backed out and performance suffers while the pipeline refills. Anything that had an effect has to reverse and everything else needs to be discarded.

That’s bad for performance. Because of this, some CPUs try to predict if a branch is likely to occur or not and then speculatively fill the pipeline for the predicted case. However, you can structure your code, for example, so that it is more obvious how branching will occur or even, for some compilers, explicitly inform the compiler if the branch is likely or not.

As you might expect, techniques like this depend on your CPU and you’ll need to benchmark to show what’s really going on. The text is full of graphs of execution times and an analysis of the generated assembly code for x86 to explain the results. Even something you think is a pretty good algorithm — like binary search, for example, suffers on modern architectures and you can improve its performance with some tricks. Actually, it is interesting that the tricks work on GCC, but don’t make a difference on Clang. Again, you have to measure these things.

Probably 90% of us will never need to use any of the kind of optimization you’ll find in this book. But it is a marvelous book if you enjoy solving puzzles and analyzing complex details. Of course, if you need to squeeze those extra microseconds out of a loop or you are writing a library where performance is important, this might be just the book you are looking for. Although it doesn’t cover many different CPUs, the ideas and techniques will apply to many modern CPU architectures. You’ll just have to do the work to figure out how if you use a different CPU.

We’ve looked at pieces of this sort of thing before. Pipelining, for example. Sometimes, though, optimizing your algorithm isn’t as effective as just changing it for a better one.

View of a well-organized workspace in front of a window view to outdoors

How To Optimize Your Workspace: Analyze How You Work

[Jay Carlson] has shared some fantastic guidance on how to optimize one’s home workspace, and you just might want to emulate some of his layout, especially if you routinely juggle multiple projects. He makes the important point that different people have different needs, so one size does not fit all. Optimizing one’s workspace must first take into account what kind(s) of work one does, and many of his tips and tricks are pretty broadly applicable.

A rack of trays, each with a project
Looking online for these? A common industry term is “bun rack”. This one is “half-height” in size.

[Jay] works on embedded systems, and often switches between many different jobs and projects. Get your notepads ready, because there are plenty of great takeaways.

For example, to get a good top-down camera view of what’s on the workbench, he uses a camera mounted on an articulated arm (the kind that usually has a lamp attached to the end.) This makes the camera easy to deploy and easy to stow, and he can effortlessly save footage or share video with colleagues online.

Another great tip is using what most of us would call cafeteria trays and a matching rack. With each tray devoted to a different project or version of hardware, it makes switching between jobs as simple as sliding in one tray and pulling out another. It’s also a highly space-efficient way to store a lot of in-progress hardware. [Jay] gives a detailed walkthrough of his workspace and explains every decision, it’s well worth a read.

It’s always better to save space, as long as doing so doesn’t negatively impact the work itself. If you’re looking for space-saving tips, be sure to check out this tiny workshop’s space-saving hacks for more ideas.

DIY Forth On Arduino

On a recent rainy afternoon, [Thanassis Tsiodras] decided to build his own Forth for the Arduino to relieve the boredom. One week of intense hacking later, he called it done and released his project as MiniForth on GitHub. [Thanassis] says he was inspired by our series of Forth articles from a few years back, and his goal was to build a Forth interpreter / compiler from scratch, put it into a Blue Pill microcontroller. That accomplished, he naturally decides to squeeze it into an Arduino Uno with only 2K of RAM.

Even if you are ambivalent about the Forth language, [Thanissis]’s project has some great ideas to check out. For example, he’s a big proponent of Makefile automation for repetitive tasks, and the project’s Makefile targets implements almost every task needed for development, building and testing his code.

Some development and testing tasks are easier to perform on the host computer. To that end, [Thanassis] tests his programs locally using the simavr simulator. The code is also portable, and he can compile it locally on the host and debug it using GDB along with Valgrind and AddressSanitizer to check for memory issues. He chose to write the program in C++ using only zero-cost abstractions, but found that compiling with the ArduinoSTL was too slow and used too much memory. No problem, [Thanassis] writes his own minimalist STL and implements several memory-saving hacks. As a final test, the Makefile can also execute a test suite of Forth commands, including a FizzBuzz algorithm, to check the resulting implementation.

Here’s a short video of MiniForth in action, blinking an LED on an UNO, and the video below the break shows each of the various Makefile tasks in operation. If you want to learn more, check out Elliot Williams’s Forth series which inspired [Thanassis] and this 2017 article discussing several different Forth implementations. Have you ever built your own compiler? Let us know in the comments below.

Continue reading “DIY Forth On Arduino”

Honey, I Shrunk The Arduino Core

High-level programming languages do a great job of making a programmer’s job easier, but these languages often leave a lot of efficiency on the table as a compromise. While a common thought is to move into a lower-level language like assembly to improve on a program’s speed or memory use, there’s often a lot that can be done at the high level before resorting to such extremes. This, of course, is true of the Arduino platform as well, as [NerdRalph] demonstrates by shrinking the size of the Arduino core itself.

[NerdRalph] had noticed that the “blink” example program actually includes over 1 kB of extraneous code, and that more complicated programs include even more cruft. To combat this issue, he created ArduinoShrink, which seeks to make included libraries more modular and self-contained. It modifies some of the default registers and counters to use less memory and improve speed, and is also designed to improve interrupt latency as well by changing when the Arduino would otherwise disable interrupts.

While there are some limits to ArduinoShrink, such as needing to know specifics about the pins at compile time, for anyone writing programs for Arduinos that are memory-intensive or need improvements in timing, this could be a powerful new tool. If you’d prefer to go in the opposite direction to avoid ever having to learn C or assembly, though, you can always stick with running Python on your embedded devices.

Balanced Design And How To Know When To Quit Optimizing

I got a relatively inexpensive 6040 CNC machine, and have been spending most weekends making the thing work, and then cutting stuff, learning the toolchain, and making subsequent improvements. Probably 90% of my machine time has been on making improvements. It’s not that the machine was bad — I got the version with ballscrews and a decently solid frame — but it’s that it somehow didn’t work together as a whole. It’s just an incredibly unbalanced design.

Let’s start with the spindle motor. It’s a 2.2 kW water-cooled beast that is capable of putting tons of work into a piece and spinning at very high speed. Yet to keep up with the high speed spindle, the motors that move it around would have to be capable of high speeds as well — it’s a feeds and speeds thing if you’re not a CNC geek. And they can’t. Instead, the stepper motors that came with the kit are designed for maximum force at low speeds. Which can make sense for some machines, but for one with a slightly flexible X-axis like this one, that’s wasted as well. The frame just can’t handle the low-end grunt that the motors are capable of, so it can’t take advantage of the spindle’s power either. The design is all over the place.

Over the last two months’ of weekends, I’ve been going through this iterative procedure of asking “what is my limiting factor right now?”, working on fixing that thing up, running it some, and then asking the question again. And it’s a good general procedure, and I believe that it’s getting me to the machine I want at the minimum cost of time, money, and effort.

At first, it was the driver hardware/software with its emulated USB parallel port, so I swapped out the controller for an Arduino running GRBL, soldered directly to the DB-25 that comes out of the back. At least it can put out pulses fast enough to order the motors around, but they would still stall out at high speeds. Swapping the stepper motors out for a high-speed pair only cost me €40, which makes you wonder why they didn’t just put the right motors on in the first place. The machine now travels fast enough to make use of the high-speed spindle, and I’m flying through plywood and plastics without leaving burn marks. It’s a huge win for not much money.

The final frontier is taking big bites out of aluminum. The spindle can do it, but I fear I’m up against the frame’s rigidity on the X-axis. For whatever reason, they went with unsupported rods on the X, which are significantly more flexible than an axis that’s backed up by more metal. And this is where the limiting factor may actually be my time and patience, rather than money. I just can’t bear to disassemble and reassemble the thing again. So for now, it’s going to be small nibbles, taking advantage of the machine’s speed, if not yet the spindle’s full horsepower.

But it’s odd, because this machine is a bundle of good parts. It’s just that they haven’t been chosen to work together optimally; the frame doesn’t work with the stepper motors, which don’t work with the spindle. If they went through my procedure of saying “what’s the limiting factor?” they could have saved themselves €100 by just shipping it with a wimpier spindle, which would have been a balanced, if anemic, machine. Or they could have built it with the right motors for more speed. Or supported rails for more grunt. Or both!

I’ll never know why they quit optimizing their design when they did. Maybe they never got past the slow USB/parallel port speed? But I’m near the end of my path, and I can tell because the limiting ingredient isn’t a simple upgrade, or even mere money anymore, but my own willpower.

How can you tell when you’re at the top of a mountain in a dense fog? A step you take in any direction would lead you downhill. How can you tell when you’re satisfied with a project’s state? When you don’t have the need, or desire, to undertake the next most obvious improvement.