Faster Integer Division With Floating Point

Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data)  instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.

The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.

Continue reading “Faster Integer Division With Floating Point”

50-Year-Old Program Gets Speed Boost

At first glance, getting a computer program to run faster than the first electronic computers might seem trivial. After all, most of us carry enormously powerful processors in our pockets every day as if that’s normal. But [Mark] isn’t trying to beat computers like the ENIAC with a mobile ARM processor or other modern device. He’s now programming with the successor to the original Intel integrated circuit processor, the 4040, but beating the ENIAC is still little more complicated than you might think with a processor from 1974.

For this project, the goal was to best the 70-hour time set by ENIAC for computing the first 2035 digits of pi. There are a number of algorithms for performing this calculation, but using a 4-bit processor and an extremely limited memory of only 1280 bytes makes a number of these methods impossible, especially with the self-imposed time limit. The limited instruction set is a potential bottleneck as well with these early processors. [Mark] decided to use [Fabrice Bellard]’s algorithm given these limitations. He goes into great detail about the mathematics behind this method before coding it in JavaScript. Generating assembly language from a working JavaScript was found to be fairly straightforward.

[Mark] is also doing a lot of work on the 4040 to get this program running as well, including upgrades to the 40xx tool stack, the compiler and linker, and an emulator he’s using to test his program before sending it to physical hardware. The project is remarkably well-documented, including all of the optimizations needed to get these antique processors running fast enough to beat the ENIAC. We won’t spoil the results for you, but as a hint to how it worked out, he started this project using the 4040 since his original attempt using a 4004 wasn’t quite fast enough.

Optimizing Linux Pipes

In CPU design, there is Ahmdal’s law. Simply put, it means that if some process is contributing to 10% of your execution, optimizing it can’t improve things by more than 10%. Common sense, really, but it illustrates the importance of knowing how fast or slow various parts of your system are. So how fast are Linux pipes? That’s a good question and one that [Mazzo] sets out to answer.

The inspiration was a highly-optimized fizzbuzz program that clocked in at over 36GB/s on his laptop. Is that a common speed? Nope. A simple program using pipes on the same machine turned in not quite 4 GB/s. What accounts for the difference?

Continue reading “Optimizing Linux Pipes”

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Modern CPUs Are Smarter Than You Might Realize

When it comes to programming, most of us write code at a level of abstraction that could be for a computer from the 1960s. Input comes in, you process it, and you produce output. Sure, a call to strcpy might work better on a modern CPU than on an older one, but your basic algorithms are the same. But what if there were ways to define your programs that would work better on modern hardware? That’s what a pre-print book from [Sergey Slotin] answers.

As a simple example, consider the effects of branching on pipelining. Nearly all modern computers pipeline. That is, one instruction is fetching data while an older instruction is computing something, while an even older instruction is storing its results. The problem arises when you already have an instruction partially executed when you realize that an earlier instruction caused a branch to another part of your code. Now the pipeline has to be backed out and performance suffers while the pipeline refills. Anything that had an effect has to reverse and everything else needs to be discarded.

That’s bad for performance. Because of this, some CPUs try to predict if a branch is likely to occur or not and then speculatively fill the pipeline for the predicted case. However, you can structure your code, for example, so that it is more obvious how branching will occur or even, for some compilers, explicitly inform the compiler if the branch is likely or not.

As you might expect, techniques like this depend on your CPU and you’ll need to benchmark to show what’s really going on. The text is full of graphs of execution times and an analysis of the generated assembly code for x86 to explain the results. Even something you think is a pretty good algorithm — like binary search, for example, suffers on modern architectures and you can improve its performance with some tricks. Actually, it is interesting that the tricks work on GCC, but don’t make a difference on Clang. Again, you have to measure these things.

Probably 90% of us will never need to use any of the kind of optimization you’ll find in this book. But it is a marvelous book if you enjoy solving puzzles and analyzing complex details. Of course, if you need to squeeze those extra microseconds out of a loop or you are writing a library where performance is important, this might be just the book you are looking for. Although it doesn’t cover many different CPUs, the ideas and techniques will apply to many modern CPU architectures. You’ll just have to do the work to figure out how if you use a different CPU.

We’ve looked at pieces of this sort of thing before. Pipelining, for example. Sometimes, though, optimizing your algorithm isn’t as effective as just changing it for a better one.

View of a well-organized workspace in front of a window view to outdoors

How To Optimize Your Workspace: Analyze How You Work

[Jay Carlson] has shared some fantastic guidance on how to optimize one’s home workspace, and you just might want to emulate some of his layout, especially if you routinely juggle multiple projects. He makes the important point that different people have different needs, so one size does not fit all. Optimizing one’s workspace must first take into account what kind(s) of work one does, and many of his tips and tricks are pretty broadly applicable.

A rack of trays, each with a project
Looking online for these? A common industry term is “bun rack”. This one is “half-height” in size.

[Jay] works on embedded systems, and often switches between many different jobs and projects. Get your notepads ready, because there are plenty of great takeaways.

For example, to get a good top-down camera view of what’s on the workbench, he uses a camera mounted on an articulated arm (the kind that usually has a lamp attached to the end.) This makes the camera easy to deploy and easy to stow, and he can effortlessly save footage or share video with colleagues online.

Another great tip is using what most of us would call cafeteria trays and a matching rack. With each tray devoted to a different project or version of hardware, it makes switching between jobs as simple as sliding in one tray and pulling out another. It’s also a highly space-efficient way to store a lot of in-progress hardware. [Jay] gives a detailed walkthrough of his workspace and explains every decision, it’s well worth a read.

It’s always better to save space, as long as doing so doesn’t negatively impact the work itself. If you’re looking for space-saving tips, be sure to check out this tiny workshop’s space-saving hacks for more ideas.

DIY Forth On Arduino

On a recent rainy afternoon, [Thanassis Tsiodras] decided to build his own Forth for the Arduino to relieve the boredom. One week of intense hacking later, he called it done and released his project as MiniForth on GitHub. [Thanassis] says he was inspired by our series of Forth articles from a few years back, and his goal was to build a Forth interpreter / compiler from scratch, put it into a Blue Pill microcontroller. That accomplished, he naturally decides to squeeze it into an Arduino Uno with only 2K of RAM.

Even if you are ambivalent about the Forth language, [Thanissis]’s project has some great ideas to check out. For example, he’s a big proponent of Makefile automation for repetitive tasks, and the project’s Makefile targets implements almost every task needed for development, building and testing his code.

Some development and testing tasks are easier to perform on the host computer. To that end, [Thanassis] tests his programs locally using the simavr simulator. The code is also portable, and he can compile it locally on the host and debug it using GDB along with Valgrind and AddressSanitizer to check for memory issues. He chose to write the program in C++ using only zero-cost abstractions, but found that compiling with the ArduinoSTL was too slow and used too much memory. No problem, [Thanassis] writes his own minimalist STL and implements several memory-saving hacks. As a final test, the Makefile can also execute a test suite of Forth commands, including a FizzBuzz algorithm, to check the resulting implementation.

Here’s a short video of MiniForth in action, blinking an LED on an UNO, and the video below the break shows each of the various Makefile tasks in operation. If you want to learn more, check out Elliot Williams’s Forth series which inspired [Thanassis] and this 2017 article discussing several different Forth implementations. Have you ever built your own compiler? Let us know in the comments below.

Continue reading “DIY Forth On Arduino”