Behind The X86 Pipeline Curtain

We’ve often heard that modern x86 CPUs don’t really execute x86 instructions. Instead, they decode them into RISC instructions that are easier to schedule, pipeline, and execute. But we never really looked into that statement to see if it is true. [Fanael] did, though, and the results are very interesting.

The post starts with a very simple loop containing four instructions. In a typical RISC CPU — RISC-V — the same loop requires six instructions. However, a modern CPU is likely to do much more than just blindly convert one instruction set to another.

The reason is that CPUs aim to increase the number of operations performed on a clock cycle (on average). There are many ways to maximize instructions per clock. One way to do this is pipelining, where you execute instructions in multiple phases. For example, you can load an instruction while decoding a second instruction and executing a third instruction.

There is a problem, though. Suppose you will add three numbers and then increment a counter within a loop, and the three numbers don’t depend on the counter. In a classic pipeline, you must wait for the additions to finish before you can increase the counter and continue the loop. But with an out-of-order pipeline, the CPU could figure out that it could do the increment in parallel with the additions. To further improve parallel operation, register renaming can allow the CPU to place results in a temporary register that you can commit or discard later.

The P6 from 1995 was the first x86 that did out-of-order execution. This CPU does, in fact, convert x86 instructions to RISC instructions. However, the Pentium M had micro-operation fusion which allowed the CPU to treat some operations as pairs, and each subsequent architecture diverged further and further from the model of the P6.

It is an interesting look behind the curtain. Modern computers are very complicated internally. If you want a detailed look at pipelining, we can help with that too.

12 thoughts on “Behind The X86 Pipeline Curtain

    1. That’s not how RISC is defined though. RISC is an empirical design process which attempts to optimise chip resources by making the most common cases execute quickly and the less common cases work adequately.

      Features like Load / Store; large register sets; uniform instruction length; pipelining; few addressing modes; fewer basic instruction opcodes, are merely the product of the methodology. For example, it was always hard to get compilers to optimise for multiple addressing modes so by cutting them down to one or two didn’t compromise the compilers, but it did allow silicon resources to be ploughed into mechanisms for making those addressing modes quick (e.g. dedicated address adders).

      Or, by making instructions a uniform length means that it gets easier to pipeline them (because you can more easily analyse instructions further downstream). Or by reducing the decoding complexity then you don’t need microcode and can substitute more registers and a larger on-chip cache. Or by simplifying instruction formats you can support 3 operand ALU instructions (which will *reduce* the number of instructions executed); or by having a large number of registers then you don’t need to use the stack so much and you get a stunning bandwidth for multi-ported data (i.e. via the registers).

      Etc, etc. The definitive guide is “Computer Architecture, a Quantitive Approach, by Hennessy and Patterson.”

      1. Maybe in the early 80s RISC was defined as you say. But these days RISC simply means for CPU designers that it’s a load/store architecture with a decent register address space (16 addressable registers at least). Few addressing modes is a consequence of dedicated load/store, obviously. Nothing stops you from designing a very comples OoO RISC core with a lot of dedicated circuitry for obscure corner cases, it’ll still be a RISC. And you don’t necessarily reduce the decoding complexity either, as you may have some very wide superscalar issue with complex dependency analysis in between instructions in a batch. It’s still a RISC though.

        Of course, I can be very biased, as a former ARM engineer I can be accused of not knowing what RISC is as there’s a lot of talk of how ARM is not a RISC in a slightest. But then, who cares what “pure RISC” is? All we want is efficient CPU cores and convenient (for compilers) ISAs. Arbitrary labels like “CISC” or “RISC” do not help to get there.

  1. While this is an excellent comparison, there are some omissions that would be notable for anyone looking to draw performance comparisons or determine “who’s better.”

    The first is the penalty that x86 pays in instructions decoding when the instructions and uOPs don’t match closely as in many RISC architectures. This is particularly painful with x86’s variable length encoding which probably seemed like a great idea in the late 70’s (yay, now some of our instructions can be shorter!) but now makes it hard to break a single byte stream into individual instructions for parallel decoding. From what I’ve heard this was a real pain with AMD’s Zen since they ended up with a lot of cross-decoder wiring so that they could try every possible decoding length and inform each other of known-incorrect decodings.

    The 2nd of course is that there are RISC architectures that perform instruction fusion and lots of other tricks as well. It’s rather unfair if one only compares against RISC architectures from the 90’s. While most RISC-V cores are too cost-reduced to employ such tricks, there are some that do. There’s also the high-performing IBM POWER architecture to consider which routinely leapfrogs x86 chips.

  2. Some time ago I had this crazy idea (Torvalds probably would say that it is “mental masturbation” :-D ) about doing a hybrid x86-64/RISC-V processor by taking the advantage of the instructions translation stage.

    The idea would be to keep unmodified everything that is specific to x86-64 like pages tables, calling conventions, exceptions format… everything that is NOT the instruction set, and add an extra bit in the pages table that specify whether that page contains x86-64 or RISC-V instructions, thus selecting which translator to use for the instructions contained in that page.

    The big advantage would be having a constant-size instruction set, which should improve the number of instructions that can be decoded in each single cycle. Also, RISC-V has two/three extra registers, which is also good for compiler optimization.

    The current operating systems would work fine in that architecture because it can work in x86-64 mode fine. The only required change to begin supporting this would be to add a new executable type for these “modified RISC-V” files (more on that later), and a modification in the loader/linker to mark the pages of these executables with the new bit.

    A new executable type would be required because the RISC-V calling convention is different enough from the x86-64 one to require that if you want to be able to call a shared library in a set from code in the other (x86-64 passes through registers the first six parameters, while RISC-V passes the eight first parameters). Of course, by duplicating the libraries, or by creating a “translation library” (in a similar way to wine) it would be possible to run unmodified RISC-V executables, but not unmodified RISC-V kernels, because those runs in “supervisor mode” and expects a different way of doing things like paging.

    Also, porting a kernel/drivers from x86-64 to this hybrid CPU would be much simpler, because it would be basically the current x86-64 architecture, so the code itself for paging, exceptions, etc would be exactly the same.

    But, of course, I’m not a CPU designer, so probably this “mental masturbation” has a lot of flaws… Please, feel free to enumerate all of them below :-)

    1. The idea is redeemable, but as presented it has many problems.

      First, the x86 ISA is designed to branch to byte boundaries, and RISC-V is not. A TLB/PTE flag also means a pipeline bubble every time you cross a page boundary since you can’t know the target instruction set. It’s much better to have a mode bit and special instructions to switch. Really, though, it doesn’t make sense to support interworking between the two ISAs.

      Second, the notion of an instruction is somewhat nebulous. RISCs often define aliases for special cases. PowerPC has a single instruction that implements 9 different pseudo-ops including shifts, rotates, and zero-extension. Generally AND, XOR, ADD, SUB, and others are produced by the ALU and selected in a mux using opcode bits. The problem is that RISC-V and x86 use different patterns, requiring a translation ROM. Translation of RISC-V to Intel micro-ops requires significant chip space and thus tradeoffs, though the micro-op cache might mitigate it.

      Third, RISC-V CSRs and vectors are fundamentally different from x86 equivalents.

      Now, what would make a lot of sense is to reencode the x86 ISA, likely closely following micro-ops. This requires far less silicon to support. Thumb mode was originally implemented in ARM chips as a pipeline stage that translated Thumb instructions to ARM, reusing the rest of the pipeline.

      1. I don’t see the problem with the byte boundaries, because any jump between code types would be between code in a library and code outside a library, and it is done using a table with the address. That table is filled by the linker.

        About the flag, AFAIK, the processor has inner caches for the entries to avoid having to read them every time the PC crosses a page boundary, so the flag can be cached too on them.

      2. Hmmm… so, what you propose is to create a “Risc86” instruction set which, basically, maps the current x86-64 instruction set (or a reasonable subset) into fixed-size instructions… Is that?

  3. There are very few actual, pure RISC architectures even available today, that I’m aware of. IMO the vast majority are RISC/CISC hybrids.
    The original intent of RISC was to take CISC and break every instruction down into its atomic components and build a new architecture from that. Originally, it was imagined, that set of atomic RISC instructions would all take the same amount of time to execute, ideally one clock cycle. By knowing how long every instruction would take to complete, and by knowing that all instructions would take an identical amount of time to complete, then on chip optimization could be not only much better, but simpler. Things like pipelining and OoO all become easier to plan and schedule. The real advantage for all of this would be seen through parallelism with higher core counts equaling great scaling of performance.
    However, RISC very quickly drifted off of that goal and design ethos for various reasons. By the time IBM had released the first commercial “RISC” product, the ROMP in the IBM RT PC, significant compromises had already been made compared to even the 801 research processor it was based on, which itself failed to meet the all equal execution time atomic instructions goal.

    Personally, I think it would be interesting to see someone look into this idea again, now that we have more advanced and intelligent design tools and lower cost per transistor manufacturing. Imagine a pure RISC (all atomic instructions that all meet the one clock cycle criteria) system with multiple of these RISC cores and an advanced hardware scheduler that can parallelize and dispatch from deep in the instruction queue among a large number of simple, but very fast RISC cores.

    1. I would also like to see that, and i feel like we really lost an interesting branch of history where any of the mayor players in industry made that move.

      I mean transistor counts have risen silly amounts over the years, but core counts and frequency have not.
      If frequency and ipc had remained even more stagnant but core counts had doubled every 2,5 years, then maybe software development would have gotten the memo and we’d have the majority of software actually using the available compute fully.

      And cheap, 100+core machines available to everyone… One can at least dream!

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.