Checking In On The ISA Wars And Its Impact On CPU Architectures

An Instruction Set Architecture (ISA) defines the software interface through which for example a central processor unit (CPU) is controlled. Unlike early computer systems which didn’t define a standard ISA as such, over time the compatibility and portability benefits of having a standard ISA became obvious. But of course the best part about standards is that there are so many of them, and thus every CPU manufacturer came up with their own.

Throughout the 1980s and 1990s, the number of mainstream ISAs dropped sharply as the computer industry coalesced around a few major ones in each type of application. Intel’s x86 won out on desktop and smaller servers while ARM proclaimed victory in low-power and portable devices, and for Big Iron you always had IBM’s Power ISA. Since we last covered the ISA Wars in 2019, quite a lot of things have changed, including Apple shifting its desktop systems to ARM from x86 with Apple Silicon and finally MIPS experiencing an afterlife in  the form of LoongArch.

Meanwhile, six years after the aforementioned ISA Wars article in which newcomer RISC-V was covered, this ISA seems to have not made the splash some had expected. This raises questions about what we can expect from RISC-V and other ISAs in the future, as well as how relevant having different ISAs is when it comes to aspects like CPU performance and their microarchitecture.

RISC Everywhere

Unlike in the past when CPU microarchitectures were still rather in flux, these days they all seem to coalesce around a similar set of features, including out-of-order execution, prefetching, superscalar parallelism, speculative execution, branch prediction and multi-core designs. Most of the performance these days is gained from addressing specific bottlenecks and optimization for specific usage scenarios, which has resulted in such things like simultaneous multithreading  (SMT) and various pipelining and instruction decoder designs.

CPUs today are almost all what in the olden days would have been called RISC (reduced instruction set computer) architectures, with a relatively small number of heavily optimized instructions. Using approaches like register renaming, CPUs can handle many simultaneous threads of execution, which for the software side that talks to the ISA is completely invisible. For the software, there is just the one register file, and unless something breaks the illusion, like when speculative execution has a bad day, each thread of execution is only aware of its own context and nothing else.

So if CPU microarchitectures have pretty much merged at this point, what difference does the ISA make?

Instruction Set Nitpicking

Within the world of ISA flamewars, the battle lines have currently mostly coalesced around topics like the pros and cons of delay slots, as well as those of compressed instructions, and setting status flags versus checking results in a branch. It is incredibly hard to compare ISAs in an apple-vs-apples fashion, as the underlying microarchitecture of a commercially available ARMv8-based CPU will differ from a similar x86_64- or RV64I- or RV64IMAC-based CPU. Here the highly modular nature of RISC-V adds significant complications as well.

If we look at where RISC-V is being used today in a commercial setting, it is primarily as simple embedded controllers where this modularity is an advantage, and compatibility with the zillion other possible RISC-V extension combinations is of no concern. Here, using RISC-V has an obvious advantage over in-house proprietary ISAs, due to the savings from outsourcing it to an open standard project. This is however also one of the major weaknesses of this ISA, as the lack of a fixed ISA along the pattern of ARMv8 and x86_64 makes tasks like supporting a Linux kernel for it much more complicated than it should be.

This has led Google to pull initial RISC-V support from Android due to the ballooning support complexity. Since every RISC-V-based CPU is only required to support the base integer instruction set, and so many things are left optional, from integer multiplication (M), atomics (A), bit manipulation (B), and beyond, all software targeting RISC-V has to explicitly test that the required instructions and functionality is present, or use a fallback.

Tempers are also running hot when it comes to RISC-V’s lack of integer overflow traps and carry instructions. As for whether compressed instructions are a good idea, the ARMv8 camp does not see any need for them, while the RISC-V camp is happy to defend them, and meanwhile x86_64 still happily uses double the number of instruction lengths courtesy of its CISC legacy, which would make x86_64 twice as bad or twice as good as RISC-V depending on who you ask.

Meanwhile an engineer with strong experience on the ARM side of things wrote a lengthy dissertation a while back on the pros and cons of these three ISAs. Their conclusion is that RISC-V is ‘minimalist to a fault’, with overlapping instructions and no condition codes or flags, instead requiring compare-and-branch instructions. This latter point cascades into a number of compromises, which is one of the major reasons why RISC-V is seen as problematic by many.

In summary, in lieu of clear advantages of RISC-V against fields where other ISAs are already established, its strong points seem to be mostly where its extreme modularity and lack of licensing requirements are seen as convincing arguments, which should not keep anyone from enjoying a good flame war now and then.

The China Angle

The Loongson 3A6000 (LS3A6000) CPU. (Credit: Geekerwan, Wikimedia)
The Loongson 3A6000 (LS3A6000) CPU. (Credit: Geekerwan, Wikimedia)

Although everywhere that is not China has pretty much coalesced around the three ISAs already described, there are always exceptions. Unlike Russia’s ill-fated very-large-instruction-word Elbrus architecture, China’s CPU-related efforts have borne significantly more fruit. Starting with the Loongson CPUs, China’s home-grown microprocessor architecture scene began to take on real shape.

Originally these were MIPS-compatible CPUs. But starting with the 3A5000 in 2021, Chinese CPUs began to use the new LoongArch ISA. Described as being a ‘bit like MIPS or RISC-V’ in the Linux kernel documentation on this ISA, it features three variants, ranging from a reduced 32-bit version (LA32R) and standard 32-bit (LA32S) to a 64-bit version (LA64). In the current LS3A6000 CPU there are 16 cores with SMT support. In reviews these chips are shown to be rapidly catching up to modern x86_64 CPUs, including when it comes to overclocking.

Of course, these being China-only hardware, few Western reviewers have subjected the LS3A6000, or its upcoming successor the LS3A7000, to an independent test.

In addition to LoongArch, other Chinese companies are using RISC-V for their own microprocessors, such as SpacemiT, an AI-focused company, whose products also include more generic processors. This includes the K1 octa-core CPU which saw use in the MuseBook laptop. As with all commercial RISC-V-based cores out today, this is no speed monsters, and even the SiFive Premier P550 SoC gets soundly beaten by even a Raspberry Pi 4’s already rather long-in-the-tooth ARM-based SoC.

Perhaps the most successful use of RISC-V in China are the cores in Espressif’s popular ESP32-C range of MCUs, although here too they are the lower-end designs relative to the Xtensa Lx6 and Lx7 cores that power Espressif’s higher-end MCUs.

Considering all this, it wouldn’t be surprising if China’s ISA scene outside of embedded will feature mostly LoongArch, a lot of ARM, some x86_64 and a sprinkling of RISC-V to round it all out.

It’s All About The IP

The distinction between ISAs and microarchitecture can be clearly seen by contrasting Apple Silicon with other ARMv8-based CPUs. Although these all support a version of the same ARMv8 ISA, the magic sauce is in the intellectual property (IP) blocks that are integrated into the chip. These range from memory controllers, PCIe SerDes blocks, and integrated graphics (iGPU), to encryption and security features. Unless you are an Apple or Intel with your own GPU-solution, you will be licensing the iGPU block along with other IP blocks from IP vendors.

These IP blocks offer the benefit of being able to use off-the-shelf functionality with known performance characteristics, but they are also where much of the cost of a microprocessor design ends up going. Developing such functionality from scratch can pay for itself if you reuse the same blocks over and over like Apple or Qualcomm do. For a start-up hardware company this is one of the biggest investments, which is why they tend to license a fully manufacturable design from Arm.

The actual cost of the ISA in terms of licensing is effectively a rounding error, while the benefit of being able to leverage existing software and tooling is the main driver. This is why a new ISA like LoongArch may very well pose a real challenge to established ISAs in the long run, beacause it is being given a chance to develop in a very large market with guaranteed demand.

Spoiled For Choice

Meanwhile, the Power ISA is also freely available for anyone to use without licensing costs; the only major requirement is compliance with the Power ISA. The OpenPOWER Foundation is now also part of the Linux Foundation, with a range of IBM Power cores open sourced. These include the A2O core that’s based on the A2I core which powered the XBox 360 and Playstation 3’s Cell processor, as well as the Microwatt reference design that’s based on the much newer Power ISA 3.0.

Whatever your fancy is, and regardless of whether you’re just tinkering on a hobby or commercial project, it would seem that there is plenty of diversity in the ISA space to go around. Although it’s only human to pick a favorite and favor it, there’s something to be said for each ISA. Whether it’s a better teaching tool, more suitable for highly customized embedded designs, or simply because it runs decades worth of software without fuss, they all have their place.

33 thoughts on “Checking In On The ISA Wars And Its Impact On CPU Architectures

  1. CPUs performance hit a wall around 2003. Since then, rather than addressing the problem it’s been working around it by adding more cores, more cache, more speculation (and providing unfixable backdoors to defeat disk encryption). Unless humans can come up with something novel I predict by 2030 entire computer industry will collapse due to lack of innovation.

    1. And yet today’s workloads demand parallelism over single threaded performance. Say what you will about the AI hype train, but that renaissance wouldn’t have been possible without being able to cram thousands of vector cores on a single chip.

    2. It’s not so much “CPU performance” that hit a wall, but fab process technology. That was when the Gate Oxide Thickness limit was reached, which halted voltage scaling (why chips stagnated around 1v operating level), which also halted frequency scaling (hitting the 5GHz wall), and transistor size scaling. With transistors density plateauing, and transistor speed plateauing, individual core speed also plateaued, since if your core gets too physically large you end up hitting speed-of-light issues trying to get all the different bits in sync. It’s no coincidence that suddenly multi-core CPUs went from a weird niche to the norm at that time. And when throwing more cores at the problem rapidly sputtered out (due to most jobs scaling with Amdahl’s law rather than Gustaffson’s) we then saw dedicated function accelerator hardware start cropping up, which is where we are today.

      Unless we see a truly dramatic change in how chips are fabbed (i.e. a move away from Silicon chemistry) this situation is not likely to change. It’s not a problem of ‘lazyness’ or lack of ‘creativity’ in CPU design, it;’s a physics problem.

    3. We are still improving in performance and efficiency even though Moore’s law have been dead for a long time together with Dennard scaling (what made smaller manufacturing processes both faster and more power efficient), that’s very impressive.
      More cores aren’t a problem nor are more caches or deeper speculative execution, those are required due to physical limitations. I don’t really see a problem as such for a while, but of course we’re starting to realize it’s getting crowded at the bottom…

      1. “We are still improving in performance and efficiency.”

        Except we’re not. Disable multiple cores, hyperthreading, speculative execution, L3 cache and run some benchmarks. You’ll performance of your i9 14k drop to late Pentium III levels (on which all current Intel CPUs are based after the space heater disaster that was P4.)

        1. But that’s an absurd standard; disable all the technology invented in the last 20+ years and OF COURSE you’re going to get early 2000’s level performance. But there are no single threaded OSes any more, and few languages that are single threaded either. And there’s a reason for that, too.

    4. I see Apple taking quite some serious steps to make concurrency mainstream in Swift, with the Swift Concurrency model.

      This doesn’t mean more performance per se, but it does mean apps that are making better use of concurrency.

      Concurrency has always been a complex thing: it either works or it doesn’t work at all. There is hardly a middle ground. And if it doesn’t work, it’s extremely hard to debug. So developers avoid it if you give them a chance.

      Supplying frameworks that make the complexity much more visible and therefore make it easier to identify mistake, and hopefully even easier to debug, means that developers will adopt concurrency much easier, and then all those multicore CPUs will be used far more efficiently than now.

      I’ve always wondered about Occam. That never became mainstream, but there must have been quite some good lessons learned there. And it should be possible to incorporate those lessons in modern programming languages.

    5. i kind of thought this way for a while. and certainly, since i stopped playing videogames so much, i do not feel the need to upgrade very often anymore. and i have been extremely underwhelmed generally with most upgrades, as i simply don’t take advantage of the power. for example, i have had a series of laptops since 2010, some much slower than the one before, and some much faster, and i have not noticed a speed difference between any of them in any scenario.

      but i certainly have noticed some improvement in single-threaded execution over time. it’s honestly a little bittersweet, as many of the programs i run most often are programs i wrote, and there’s a kind of feeling that faster hardware is making it easier to live with what could be considered my mistakes.

      i don’t know when it was that ‘make -j’ became normal to me, i think it was shortly after 2010. i thought of it as a way to make very slow compiles slightly less slow. half hour build jobs became ten minute build jobs. still too long. still long enough for me to walk away from the computer and lose the context of why i was doing the build in the first place.

      last year i had a kind of epiphany and i upgraded my main pc from a 4-core low-end 2018 processor to a used 8-core high-end 2018 processor that fit in the same socket. and i upgraded from 16GB of RAM to 32GB of RAM. and wow wow wow!!

      computers are so gosh darn fast these days. in some sense they’re so fast that i just don’t care, but they are absolutely still getting faster, and there are jobs that benefit from it. a big source tree that used to take minutes compiles in 10 seconds. a bigger source tree that used to take half an hour takes 70 seconds. test suites that used to take all day complete in a handful of minutes. the linux kernel has bloated tremendously from when i used to wait half an hour to build 1.2.13, but i just built 5.10.235 and modules in only 3 minutes yesterday. 3 minutes!!

      i run a ton of background jobs that i don’t hardly care about — VMs, minecraft server, all sorts of garbage — and i just don’t ever feel any slowdown from it. i’ve got plenty of RAM and plenty of cores to spare.

      so i no longer think of throwing more cores at it as simply compensating for a lack of progress…it is real progress.

  2. some times I wonder what they where smoking, when they came up with the opcode for those ISAs;
    x86(16/32) have no instruction to read the instruction pointer and have no reverse substract.,
    6502 have adc but, no add, and arm must be 4 byte aligned and does not have 32bit intermediate loads.
    resulting in more instructions, to do the same work.. yes you can fix most things in software, but why?

    1. Transistor count and complexity.
      Arm at the time of it’s inception was low power architecture. And making unaligned accesses is ‘costy’ in circuit complexity (much easier to read 32 bit at the time, and do not worry). As soon as you start adding unaligned accesses – you need to add second reads to memory (in some cases) and also gathering of the data from those reads.

      As for x86 – why do you need to read the IP? You can do relative jumps without it absolutely no problem.

    2. Because instructions have a cost?
      Your example of not having a directly accessible IP is a good one, for many ISA’s exposing the IP added significant overheads, like for instance those that exposed it as a general register which needed extra hardware to keep the system coherent.
      And what would you gain? Reading the IP is a very rare thing, I can’t remember I’ve ever needed to know it.
      ARM adding 32 bit immediates would suddenly make it a variable length ISA with all the complications that requires. There are reasons ARM64 kept a lot of design decisions while making a considerably more complex design.
      You should look at optimized 6502 assembly and try to notice how rare clearing the carry flag can be made, and of course the reason there isn’t an ADD instructions is that when designed every single transistor counted.

      1. 32bit immediate load would be: ldr r1, [pc++]
        simmilar to a pop: ldr r1,[sp++]
        so it would not really be variable length instructions, just jumped over data inside the instruction.

    3. It’s always worth reading “Computer Architecture: A Quantitive Approach”, particularly the first and second editions to understand what really matters in architecture design.

      As we know, the x86 series (now x64) was an evolved design. The Intel 8008 was a clone ISA of the TTL chip Datapoint 2200 and therefore inherited its limitations, including the basic A, BC, DE, HL registers (of which only HL could be used to reference memory). The Intel 8080 could be source-code translated from 8008 assembler (an easy upgrade path), but improved the ISA notably (with some 16-bit operations; memory access via DE and BC and a proper stack pointer). As a result, the 8080 formed the basis of the early 1970s business microcomputers based on CP/M. The 8085 was an object-code successor, which wasn’t used much.

      The Intel 8086 was developed as a stop-gap 16-bit CPU while Intel tried to sort out the 32-bit iAPX432. In an email with (I think) Stephen P Morse, he said that basically, they had a customer who needed more than 64kB of direct addressing space, so they bolted on the segmentation scheme. The 8086, was source-code translatable from 8080 assembler, but with different opcodes. Thus it inherited some of the issues from the 8080. It was designed to support high level languages with frame pointers and also anticipated future, protected segmented memory as per the Multics OS.

      The 8086 then became super-popular thanks to the IBM PC (though it would have been anyway via CP/M-86), which locked in that ISA for decades. However, the use of the 8-bit bus 8088 version meant programmers had to abuse the segmentation scheme Addresses=ASegmentReg*16+Reg+Offset for performance, so when Intel introduced the segment based, protected mode 80286 in 1982, IBM PC could couldn’t run properly. It was only with the 80386, which extended the CPU to 32-bits, but is not truly opcode compatible either that they also introduced a virtual 8086 mode to fix that earlier problem; while adding page-based protected memory.

      The 80386’s architecture, later called x86 held sway until the early 2000s when AMD extended it further to 64-bits, essentially a new, but related set of opcodes and support for 16 GPRs.

      So, the x86 is clumsy and awkward, because its primary objective was to maintain compatibility with earlier CPUs going back to the 8008 (the 4004 wasn’t related) and that CPU had an architecture somewhat based on necessity rather than any other objective.

      Other CPUs didn’t have those objectives. The Motorola 680×0 series was designed as a 32-bit version of the hugely popular pdp-11. ARM was designed as a 32-bit CPU inspired by early RISC research with elements taken from the 16-bit Data General Nova, but it too had to be incompatibly extended to proper 32-bit addressing in the 1990s and then again to another incompatible 64-bit instruction set in the 2000s. RISC-V was supposed to be a RISC successor to early MIPS and SPARC CPUs, but without their mistakes (but it has its own weaknesses). All of these though have architectural design objectives better than the x86/x64 series.

    4. if you’re writing a compiler then these kind of nuissances are just addressed one at a time and then everyone using your compiler gets the benefit of your implementation. there are a very small handful of these sorts of problems that are genuinely deep (8087-compatible floating point) but most of them are just a little head scratch and then you ‘when in rome’ it.

      if you’re not writing a compiler then are you posting on hackaday to say that you wrote assembly code for no reason and regretted it???

  3. I hear there is a massive reluctance from silicon designer to design ANYTHING, and they prefer to rather buy IP blocks from someone and stitch them together instead of designing them in house. Sounds kinda silly to me.

    Disclaimer: This is all hearsay from someone who I know, but I choose to believe them (for better or worse)

    1. Of course they do, for the same reasons you wouldn’t want to reimplement libc.

      It’s extremely expensive to tape out design changes, so it’s worth it to buy IP that’s already been proven in silicon rather than spin your own. And not having to invest engineer time into design, DV, and DFT lets you put more resources into the parts of the design that make your chip unique.

    2. Probably not a massive personal reluctance as such, but it is a lot faster (and simpler) to buy pre-validated IP and integrate that than reinvent the wheel – effectively it becomes a cost trade-off of paying for someone else to have done it or cost your time to DIY.
      It’s also the biggest challenge with the more open-source linked projects as they may not be as validated as more commercial packages which can also come with support on integrating it.

      1. i agree with this and i marvel that this didn’t pan out for rp2350. i want to understand how they used a defective i/o cell in a production run. did raspberry make a choice to use unvalidated IP for some reason that now looks stupid? or did their upstream vendor lie about their validation? or did the cell not have these defects until raspberry integrated it?

        the idea of slipping massive world-stopping defects to core functionality of a reusable cell just makes me question all these assumptions

  4. I think in the case of RISC-V, what needs to happen for stable support is to create a sub-spec of definite expected features for platform X, which could be say, phones, laptops, PCs, servers, etc.
    RISC-Vs flexibility is both its advantage and disadvantage, but that sub-spec idea could make it a non-issue.
    Guarantee that a processor built for RV-XY (insert platform type and version here) has the features and that support will massively speed up.
    The community could easily do this unofficially or officially.

      1. thanks! i was for sure that part of the article was completely bogus. because you can’t hardly contrast that problem from ARM. ARM is the exact same — a huge mess of optional features that is a nuissance to anyone orienting to it. and they more or less dealt with it exactly this way. i was sure RISC-V had to have the same problem and the same resolution.

  5. Perhaps the most successful use of RISC-V in China are the cores in Espressif’s popular ESP32-C range of MCUs, (…)

    Let’s not forget about CH32V family, covered on HaD a number of times.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.