Checking In On The ISA Wars And Its Impact On CPU Architectures

An Instruction Set Architecture (ISA) defines the software interface through which for example a central processor unit (CPU) is controlled. Unlike early computer systems which didn’t define a standard ISA as such, over time the compatibility and portability benefits of having a standard ISA became obvious. But of course the best part about standards is that there are so many of them, and thus every CPU manufacturer came up with their own.

Throughout the 1980s and 1990s, the number of mainstream ISAs dropped sharply as the computer industry coalesced around a few major ones in each type of application. Intel’s x86 won out on desktop and smaller servers while ARM proclaimed victory in low-power and portable devices, and for Big Iron you always had IBM’s Power ISA. Since we last covered the ISA Wars in 2019, quite a lot of things have changed, including Apple shifting its desktop systems to ARM from x86 with Apple Silicon and finally MIPS experiencing an afterlife in  the form of LoongArch.

Meanwhile, six years after the aforementioned ISA Wars article in which newcomer RISC-V was covered, this ISA seems to have not made the splash some had expected. This raises questions about what we can expect from RISC-V and other ISAs in the future, as well as how relevant having different ISAs is when it comes to aspects like CPU performance and their microarchitecture.

RISC Everywhere

Unlike in the past when CPU microarchitectures were still rather in flux, these days they all seem to coalesce around a similar set of features, including out-of-order execution, prefetching, superscalar parallelism, speculative execution, branch prediction and multi-core designs. Most of the performance these days is gained from addressing specific bottlenecks and optimization for specific usage scenarios, which has resulted in such things like simultaneous multithreading  (SMT) and various pipelining and instruction decoder designs.

CPUs today are almost all what in the olden days would have been called RISC (reduced instruction set computer) architectures, with a relatively small number of heavily optimized instructions. Using approaches like register renaming, CPUs can handle many simultaneous threads of execution, which for the software side that talks to the ISA is completely invisible. For the software, there is just the one register file, and unless something breaks the illusion, like when speculative execution has a bad day, each thread of execution is only aware of its own context and nothing else.

So if CPU microarchitectures have pretty much merged at this point, what difference does the ISA make?

Instruction Set Nitpicking

Within the world of ISA flamewars, the battle lines have currently mostly coalesced around topics like the pros and cons of delay slots, as well as those of compressed instructions, and setting status flags versus checking results in a branch. It is incredibly hard to compare ISAs in an apple-vs-apples fashion, as the underlying microarchitecture of a commercially available ARMv8-based CPU will differ from a similar x86_64- or RV64I- or RV64IMAC-based CPU. Here the highly modular nature of RISC-V adds significant complications as well.

If we look at where RISC-V is being used today in a commercial setting, it is primarily as simple embedded controllers where this modularity is an advantage, and compatibility with the zillion other possible RISC-V extension combinations is of no concern. Here, using RISC-V has an obvious advantage over in-house proprietary ISAs, due to the savings from outsourcing it to an open standard project. This is however also one of the major weaknesses of this ISA, as the lack of a fixed ISA along the pattern of ARMv8 and x86_64 makes tasks like supporting a Linux kernel for it much more complicated than it should be.

This has led Google to pull initial RISC-V support from Android due to the ballooning support complexity. Since every RISC-V-based CPU is only required to support the base integer instruction set, and so many things are left optional, from integer multiplication (M), atomics (A), bit manipulation (B), and beyond, all software targeting RISC-V has to explicitly test that the required instructions and functionality is present, or use a fallback.

Tempers are also running hot when it comes to RISC-V’s lack of integer overflow traps and carry instructions. As for whether compressed instructions are a good idea, the ARMv8 camp does not see any need for them, while the RISC-V camp is happy to defend them, and meanwhile x86_64 still happily uses double the number of instruction lengths courtesy of its CISC legacy, which would make x86_64 twice as bad or twice as good as RISC-V depending on who you ask.

Meanwhile an engineer with strong experience on the ARM side of things wrote a lengthy dissertation a while back on the pros and cons of these three ISAs. Their conclusion is that RISC-V is ‘minimalist to a fault’, with overlapping instructions and no condition codes or flags, instead requiring compare-and-branch instructions. This latter point cascades into a number of compromises, which is one of the major reasons why RISC-V is seen as problematic by many.

In summary, in lieu of clear advantages of RISC-V against fields where other ISAs are already established, its strong points seem to be mostly where its extreme modularity and lack of licensing requirements are seen as convincing arguments, which should not keep anyone from enjoying a good flame war now and then.

The China Angle

The Loongson 3A6000 (LS3A6000) CPU. (Credit: Geekerwan, Wikimedia)
The Loongson 3A6000 (LS3A6000) CPU. (Credit: Geekerwan, Wikimedia)

Although everywhere that is not China has pretty much coalesced around the three ISAs already described, there are always exceptions. Unlike Russia’s ill-fated very-large-instruction-word Elbrus architecture, China’s CPU-related efforts have borne significantly more fruit. Starting with the Loongson CPUs, China’s home-grown microprocessor architecture scene began to take on real shape.

Originally these were MIPS-compatible CPUs. But starting with the 3A5000 in 2021, Chinese CPUs began to use the new LoongArch ISA. Described as being a ‘bit like MIPS or RISC-V’ in the Linux kernel documentation on this ISA, it features three variants, ranging from a reduced 32-bit version (LA32R) and standard 32-bit (LA32S) to a 64-bit version (LA64). In the current LS3A6000 CPU there are 16 cores with SMT support. In reviews these chips are shown to be rapidly catching up to modern x86_64 CPUs, including when it comes to overclocking.

Of course, these being China-only hardware, few Western reviewers have subjected the LS3A6000, or its upcoming successor the LS3A7000, to an independent test.

In addition to LoongArch, other Chinese companies are using RISC-V for their own microprocessors, such as SpacemiT, an AI-focused company, whose products also include more generic processors. This includes the K1 octa-core CPU which saw use in the MuseBook laptop. As with all commercial RISC-V-based cores out today, this is no speed monsters, and even the SiFive Premier P550 SoC gets soundly beaten by even a Raspberry Pi 4’s already rather long-in-the-tooth ARM-based SoC.

Perhaps the most successful use of RISC-V in China are the cores in Espressif’s popular ESP32-C range of MCUs, although here too they are the lower-end designs relative to the Xtensa Lx6 and Lx7 cores that power Espressif’s higher-end MCUs.

Considering all this, it wouldn’t be surprising if China’s ISA scene outside of embedded will feature mostly LoongArch, a lot of ARM, some x86_64 and a sprinkling of RISC-V to round it all out.

It’s All About The IP

The distinction between ISAs and microarchitecture can be clearly seen by contrasting Apple Silicon with other ARMv8-based CPUs. Although these all support a version of the same ARMv8 ISA, the magic sauce is in the intellectual property (IP) blocks that are integrated into the chip. These range from memory controllers, PCIe SerDes blocks, and integrated graphics (iGPU), to encryption and security features. Unless you are an Apple or Intel with your own GPU-solution, you will be licensing the iGPU block along with other IP blocks from IP vendors.

These IP blocks offer the benefit of being able to use off-the-shelf functionality with known performance characteristics, but they are also where much of the cost of a microprocessor design ends up going. Developing such functionality from scratch can pay for itself if you reuse the same blocks over and over like Apple or Qualcomm do. For a start-up hardware company this is one of the biggest investments, which is why they tend to license a fully manufacturable design from Arm.

The actual cost of the ISA in terms of licensing is effectively a rounding error, while the benefit of being able to leverage existing software and tooling is the main driver. This is why a new ISA like LoongArch may very well pose a real challenge to established ISAs in the long run, beacause it is being given a chance to develop in a very large market with guaranteed demand.

Spoiled For Choice

Meanwhile, the Power ISA is also freely available for anyone to use without licensing costs; the only major requirement is compliance with the Power ISA. The OpenPOWER Foundation is now also part of the Linux Foundation, with a range of IBM Power cores open sourced. These include the A2O core that’s based on the A2I core which powered the XBox 360 and Playstation 3’s Cell processor, as well as the Microwatt reference design that’s based on the much newer Power ISA 3.0.

Whatever your fancy is, and regardless of whether you’re just tinkering on a hobby or commercial project, it would seem that there is plenty of diversity in the ISA space to go around. Although it’s only human to pick a favorite and favor it, there’s something to be said for each ISA. Whether it’s a better teaching tool, more suitable for highly customized embedded designs, or simply because it runs decades worth of software without fuss, they all have their place.

16 thoughts on “Checking In On The ISA Wars And Its Impact On CPU Architectures

  1. CPUs performance hit a wall around 2003. Since then, rather than addressing the problem it’s been working around it by adding more cores, more cache, more speculation (and providing unfixable backdoors to defeat disk encryption). Unless humans can come up with something novel I predict by 2030 entire computer industry will collapse due to lack of innovation.

    1. And yet today’s workloads demand parallelism over single threaded performance. Say what you will about the AI hype train, but that renaissance wouldn’t have been possible without being able to cram thousands of vector cores on a single chip.

    2. It’s not so much “CPU performance” that hit a wall, but fab process technology. That was when the Gate Oxide Thickness limit was reached, which halted voltage scaling (why chips stagnated around 1v operating level), which also halted frequency scaling (hitting the 5GHz wall), and transistor size scaling. With transistors density plateauing, and transistor speed plateauing, individual core speed also plateaued, since if your core gets too physically large you end up hitting speed-of-light issues trying to get all the different bits in sync. It’s no coincidence that suddenly multi-core CPUs went from a weird niche to the norm at that time. And when throwing more cores at the problem rapidly sputtered out (due to most jobs scaling with Amdahl’s law rather than Gustaffson’s) we then saw dedicated function accelerator hardware start cropping up, which is where we are today.

      Unless we see a truly dramatic change in how chips are fabbed (i.e. a move away from Silicon chemistry) this situation is not likely to change. It’s not a problem of ‘lazyness’ or lack of ‘creativity’ in CPU design, it;’s a physics problem.

    3. We are still improving in performance and efficiency even though Moore’s law have been dead for a long time together with Dennard scaling (what made smaller manufacturing processes both faster and more power efficient), that’s very impressive.
      More cores aren’t a problem nor are more caches or deeper speculative execution, those are required due to physical limitations. I don’t really see a problem as such for a while, but of course we’re starting to realize it’s getting crowded at the bottom…

      1. “We are still improving in performance and efficiency.”

        Except we’re not. Disable multiple cores, hyperthreading, speculative execution, L3 cache and run some benchmarks. You’ll performance of your i9 14k drop to late Pentium III levels (on which all current Intel CPUs are based after the space heater disaster that was P4.)

  2. some times I wonder what they where smoking, when they came up with the opcode for those ISAs;
    x86(16/32) have no instruction to read the instruction pointer and have no reverse substract.,
    6502 have adc but, no add, and arm must be 4 byte aligned and does not have 32bit intermediate loads.
    resulting in more instructions, to do the same work.. yes you can fix most things in software, but why?

    1. Transistor count and complexity.
      Arm at the time of it’s inception was low power architecture. And making unaligned accesses is ‘costy’ in circuit complexity (much easier to read 32 bit at the time, and do not worry). As soon as you start adding unaligned accesses – you need to add second reads to memory (in some cases) and also gathering of the data from those reads.

      As for x86 – why do you need to read the IP? You can do relative jumps without it absolutely no problem.

    2. Because instructions have a cost?
      Your example of not having a directly accessible IP is a good one, for many ISA’s exposing the IP added significant overheads, like for instance those that exposed it as a general register which needed extra hardware to keep the system coherent.
      And what would you gain? Reading the IP is a very rare thing, I can’t remember I’ve ever needed to know it.
      ARM adding 32 bit immediates would suddenly make it a variable length ISA with all the complications that requires. There are reasons ARM64 kept a lot of design decisions while making a considerably more complex design.
      You should look at optimized 6502 assembly and try to notice how rare clearing the carry flag can be made, and of course the reason there isn’t an ADD instructions is that when designed every single transistor counted.

  3. I hear there is a massive reluctance from silicon designer to design ANYTHING, and they prefer to rather buy IP blocks from someone and stitch them together instead of designing them in house. Sounds kinda silly to me.

    Disclaimer: This is all hearsay from someone who I know, but I choose to believe them (for better or worse)

    1. Of course they do, for the same reasons you wouldn’t want to reimplement libc.

      It’s extremely expensive to tape out design changes, so it’s worth it to buy IP that’s already been proven in silicon rather than spin your own. And not having to invest engineer time into design, DV, and DFT lets you put more resources into the parts of the design that make your chip unique.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.