AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Assembly Language For Real

We all probably know that for ultimate control and maximum performance, you need assembly language. No matter how good your compiler is, you’ll almost always be able to do better by using your human smarts to map your problem onto a computer’s architecture. Programming in assembly for PCs though is a little tricky. A lot of information about PC assembly language dates back from when assembly was more common, but it also covers old modes that, while still available, aren’t the best answer for the latest processors. [Gpfault] has launched a series on 64-bit x86 assembly that tries to remedy that, especially if you are working under Windows.

So far there are three entries. The first covers setting up your toolchain and creating a simple program that does almost nothing. But it is a start.

Continue reading “Assembly Language For Real”

Odyssey Is A X86 Computer Packing An Arduino Along For The Trip

We love the simplicity of Arduino for focused tasks, we love how Raspberry Pi GPIO pins open a doorway to a wide world of peripherals, and we love the software ecosystem of Intel’s x86 instruction set. It’s great that some products manage to combine all of them together into a single compact package, and we welcome the recent addition of Seeed Studio’s Odyssey X86J4105.

[Ars Technica] recently looked one over and found it impressive from the perspective of a small networked computer, but they didn’t dig too deeply into the maker-friendly side of the product. We can look at the product documentation to see some interesting details. This board is larger than a Raspberry Pi, but its GPIO pins were laid out in exactly the same order as that on a Pi. Some HATs could plug right in, eliminating all the electrical integration leaving just the software issue of ARM vs x86. Tasks that are not suitable for CPU-controlled GPIO (such as generating reliable PWM) can be offloaded to an on-board Arduino-compatible microcontroller. It is built around the SAMD21 chip, similar to the Arduino MKR and Arduino Zero but the pinout does not appear to match any of the popular Arduino form factors.

The Odyssey is not the first x86 single board computer (SBC) to have GPIO pins and an onboard Arduino assistant. LattePanda for example has been executing that game plan (minus the Raspberry Pi pin layout) for the past few years. We’ve followed them since their Kickstarter origins and we’ve featured creative uses here and there. LattePanda’s current offerings are built around Intel CPUs ranging from Atom to Core m3. The Odyssey’s Celeron is roughly in the middle of that range, and the SAMD21 is more capable than the ATmega32U4 (Arduino Leonardo) on board a LattePanda. We always love seeing more options in a market for us to find the right tradeoff to match a given project, and we look forward to the epic journeys yet to come.

Creating Black Holes: Division By Zero In Practice

Dividing by zero — the fundamental no-can-do of arithmetic. It is somewhat surrounded by mystery, and is a constant source for internet humor, whether it involves exploding microcontrollers, the collapse of the universe, or crashing your own world by having Siri tell you that you have no friends.

It’s also one of the few things gcc will warn you about by default, which caused a rather vivid discussion with interesting insights when I recently wrote about compiler warnings. And if you’re running a modern operating system, it might even send you a signal that something’s gone wrong and let you handle it in your code. Dividing by zero is more than theoretical, and serves as a great introduction to signals, so let’s have a closer look at it.

Chances are, the first time you heard about division itself back in elementary school, it was taught that dividing by zero is strictly forbidden — and obviously you didn’t want your teacher call the cops on you, so you obeyed and refrained from it. But as with many other things in life, the older you get, the less restrictive they become, and dividing by zero eventually turned from forbidden into simply being impossible and yielding an undefined result.

And indeed, if a = b/0, it would mean in reverse that a×0 = b. If b itself was zero, the equation would be true for every single number there is, making it impossible to define a concrete value for a. And if b was any other value, no single value multiplied by zero could result in anything non-zero. Once we move into the realms of calculus, we will learn that infinity appears to be the answer, but that’s in the end just replacing one abstract, mind-boggling concept with another one. And it won’t answer one question: how does all this play out in a processor? Continue reading “Creating Black Holes: Division By Zero In Practice”

Learn ARM Assembly With The Raspberry Pi

We live in a time when you don’t have to know assembly language to successfully work with embedded computers. The typical processor these days has resources that would shame early PCs and some of the larger ones are getting close to what was a powerful desktop machine only a few years ago. Even so, there are some cases where you really want to use assembly language. Maybe you need more speed. Or maybe you need very precise control over timing. Maybe you just like the challenge. [Robert G. Plantz] from Sonoma State University has an excellent book online titled “Introduction to Computer Organization: ARM Assembly Langauge Using the Raspberry Pi.” If you are interested in serious ARM assembly language, you really need to check out this book.

If you are more interested in x86-64 assembly and Linux [Plantz] has you covered there, too. Both books are free to read on the Internet, and you can pick up a printed version of the Linux book for a small payment if you want.

Continue reading “Learn ARM Assembly With The Raspberry Pi”