AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Print-in-Place Engine Aims To Be The Next Benchy

While there are many in the 3D-printing community who loudly and proudly proclaim never to have stooped to printing a 3DBenchy, there are far more who have turned a new printer loose on the venerable test model, just to see what it can do. But Benchy is getting a little long in the tooth, and with 3D-printers getting better and better, perhaps a better benchmarking model is in order.

Knocking Benchy off its perch is the idea behind this print-in-place engine benchmark, at least according to [SunShine]. And we have to say that he’s come up with an impressive model. It’s a cutaway of a three-cylinder reciprocating engine, complete with crankshaft, connecting rods, pistons, and engine block. It’s designed to print all in one go, with only a little cleanup needed after printing before the model is ready to go. The print-in-place aspect seems to be the main test of a printer — if you can get this engine to actually spin, you’re probably set up pretty well. [SunShine] shares a few tips to get your printer dialed in, and shows a few examples of what can happen when things go wrong. In addition to the complexities of the print-in-place mechanism, the model has a few Easter eggs to really challenge your printer, like the tiny oil channel running the length of the crankshaft.

Whether this model supplants Benchy is up for debate, but even if it doesn’t, it’s still a cool design that would be fun to play with. Either way, as [SunShine] points out, you’ll need a really flat bed to print this one; luckily, he recently came up with a compliant mechanism dial indicator to help with that job.

Continue reading “Print-in-Place Engine Aims To Be The Next Benchy”