AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

16 thoughts on “AVX-512: When The Bits Really Count

  1. Which is speaks against AVX.
    All that extra hardware just to get 33% speed increase in some obscure case.
    Vector units have been marketed as massive speed boost for many cases.
    With opening of GPUs for compute loads, they are becoming more and more obsolete, at least in mainstream machines.

    1. my thought exactly. there are loads where that 33% would be worth something but they are so obscure. personally, if i’m going to sacrifice expressivitiy and portability, i would want more than 33%.

    2. SIMD in general can often result in 2 to 4x performance improvements. What you might be seeing here is that the non-AVX512 code is already autovectorized to some extent.

  2. Seems like a good excuse to keep the number of transistors on a chip growing exponentially for an extra while.
    In the mean time, the amount of useful work those extra transistors do seem to be logarithmic. and at some time you can on adding transistors without the thing being able to do more work.

    That said, PC’s are still getting faster.
    I bought parts for a new one at the end of last year, and for that did a bit of reading about AMD processors, and they’ve managed to squeeze a handful of percent of speed increase out of each new generation

    Recently they’ve also managed to exceed a passmark rating of 100.000.

    A bunch (10+ maybe ?) of years ago GPU processing became a significant performance boost.
    These days there are special accelerators for AI stuff.
    It seems that if you want to get a significant performance boost, you have to do something special, instead of the ever diminishing percentages of performance increments.

    Sometimes I wonder whether adding an extra (standardized) FPGA to a PC will be the way of the future.
    Or the quantum computers., they’re also steadily growing in capabilities.

    1. The problem with FPGAs is that you generally end up with thousands of transistors doing the work of one. Consider the simplest case: We build a 4-input NAND in the FPGA. On a chip that’d require 8 transistors.
      In an FPGA, you have a 16 bit shift register for the table, a 16-to-one-selector. I’d count close to 200 transistors (I’m probably seriously underestimating this: I haven’t done chip layout in like 30 years) and that’s being favorable by neglecting routing and input selection logic that exists in the FPGA.

      So… by the time you have your FPGA doing “excel” , you’re better off buying a general purpose CPU.

      My “computer architecture” group was doing research into configurable CPUs in the 1990s. Wouldn’t it be great that you can add extra integer performance when you need it? Or more multiply-add units? Yeah, that’d be great. However this way you can get a custom CPU at enormous upfront costs, while the scale of a general purpose CPU that also has the “extra integer performance” (and a bunch of other goodies that you don’t need) is much more economically sensible.

  3. > The naive version of this look is very likely to have a branch misprediction

    1 compiler will encode it in a way to give CPU a hint to speculative ‘branch always’
    2 it has _one_ branch and its _guaranteed_ to have exactly one branch misprediction when it finally finishes (word == 0)
    3 the naive part here is not the branch but data dependency, every consecutive loop depends on data mangled by the last one (word = word & (word – 1)), meaning you cant speculate far enough to saturate available execution unit.

    Point 2 can be fixed by using popcnt on ‘word’ and set fixed number of loops with decrementing index ‘i’, modern branch predictors will recognize this and should be able to eliminate pipeline flush.
    Point 3 might be able to remove data dependency by reshuffling instructions around

    I click on first link and get this

    >(e.g., 0b110011) to the integers (e.g., 0, 1, 5, 6 in this instance)

    0,1, 4, 5

    1. Not dropping, it was just not supposed to be enabled on the heterogenous core chips in the first place. If cores do not have the same instruction set support, a thread being moved from a Golden Cove core to a Gravemont core whilst still issuing AVX-512 instructions will crash. Until Windows 11 is ubiquitous, the all cores regardless of architecture need to support the exact same instruction set. Once Win 11 is the norm and Windows 10 support can be dropped, instruction set heterogeny can be supported with software written to accommodate it. Alternatively if future Gracemont successors gain AVX-512 support that would work too, though run counter to the concept of simpler cores of lower die area.

  4. AVX512 is an interesting beast.

    Though, it is a fairly large set of functions and the “512” part of the name is a bit misleading since a lot of AVX512 instructions aren’t using 512 bit registers, but rather 512 bit worth of data spread across registers. It is just a library of Single Instruction Multiple Data instructions. Effectively explicit parallelism for a fair few functions.

    To a degree, AVX512 is a lot more generally applicable than most give it credit for, but programming with parallelism in consideration isn’t trivial, and the added hassle of register management can at a lot of times make the whole concept less appealing to most. Not to mention the lackluster support and platform dependence it comes with…

    However, AVX512 isn’t without faults and it seems a fair bit bloated in practice. And the fact that Intel can’t seem to decide what AVX512 extensions it wants and don’t want to include in a given microarchitecture makes the whole library frankly a bit unpleasant.

    1. I built a high performance signal processing program for a professional satcom project. When I looked at improving performance, I first looked at AVX-512. Running on a single core, it gave me an incredible performance boost, but I thought I could squeeze more by also using parallelism. Using AVX-512 on multiple cores actually tanked the performance to worse than single core use due to the throttling. There’s no throttling with AVX2 and I can use it on all cores at the same time. There are no incentives for me to use AVX-512 over AVX2 when the latter can give me better overall performance, no matter how many cores are using it or even which cores use it. Keep in mind that on some CPUs, AVX-512 is restricted to only some of the cores and can only be used on even fewer cores to avoid throttling. To squeeze the maximum I could have used AVX-512 on one core for the maximum boost, and AVX2 on the rest, but there comes a point where engineer time is more precious than CPU time.

      1. AVX512 is likely more focused on the more single threaded workloads where peak serial execution speed is the only thing of importance.

        For a lot of applications, AVX512 would be used to speed up certain sections of a larger program. While for more multithreaded workloads, it is indeed a bit excessive.

        Optimizing for serial performance vs parallel performance is fairly different worlds compared to each other. However, they do overlap quite a bit, since few applications fits squarely inside one of the two. But a lot of CPU architectures have spent more effort on serial above parallelism, and sometimes even uses parallelism to speculate about serial behavior, making power efficiency even worse.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.