Checking In On The ISA Wars And Its Impact On CPU Architectures

An Instruction Set Architecture (ISA) defines the software interface through which for example a central processor unit (CPU) is controlled. Unlike early computer systems which didn’t define a standard ISA as such, over time the compatibility and portability benefits of having a standard ISA became obvious. But of course the best part about standards is that there are so many of them, and thus every CPU manufacturer came up with their own.

Throughout the 1980s and 1990s, the number of mainstream ISAs dropped sharply as the computer industry coalesced around a few major ones in each type of application. Intel’s x86 won out on desktop and smaller servers while ARM proclaimed victory in low-power and portable devices, and for Big Iron you always had IBM’s Power ISA. Since we last covered the ISA Wars in 2019, quite a lot of things have changed, including Apple shifting its desktop systems to ARM from x86 with Apple Silicon and finally MIPS experiencing an afterlife in  the form of LoongArch.

Meanwhile, six years after the aforementioned ISA Wars article in which newcomer RISC-V was covered, this ISA seems to have not made the splash some had expected. This raises questions about what we can expect from RISC-V and other ISAs in the future, as well as how relevant having different ISAs is when it comes to aspects like CPU performance and their microarchitecture.

Continue reading “Checking In On The ISA Wars And Its Impact On CPU Architectures”

Faster Integer Division With Floating Point

Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data)  instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.

The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.

Continue reading “Faster Integer Division With Floating Point”

What Is X86-64-v3?

You may have heard Linux pundits discussing x86-64-v3. Can recompiling Linux code to use this bring benefits? To answer that question, you probably need to know what x86-64-v3 is, and [Gary Explains]… well… explains it in a recent video.

If you’d rather digest text, RedHat has a recent article about their experiments using the instructions set in RHEL10. From that article, you can see that most of the new instructions support some enhancements for vectors and bit manipulation. It also allows for more flexible instructions that leave their results in an explicit destination register instead of one of the operand registers.

Of course, none of this matters for high-level code unless the compiler supports it. However, gcc version 12 will automatically vectorize code when using the -O2 optimization flags.

Continue reading “What Is X86-64-v3?”

Emulating X86 On Apple’s AARCH64 X64 Emulator

You might know [Evan Martin] as the developer of retrowin32. It’s a Windows and x86 emulator designed to run on a Mac or on the web. He’s recently been exploring how to run 32-bit x86 binaries on the AArch64 (aka ARM64) architecture.

[Evan] realized that Apple’s ARM-based Macs feature a high-quality x86 emulator, used via the Rosetta binary translation system. It only supports 64-bit x86-64 binaries, also known as x64, and thus he had initially discounted it for running older 32-bit x86 software. However, as it turns out, x64 features a special compatibility mode for running 32-bit code. [Evan] was able to leverage this to run 32-bit Windows executables rather neatly via the high-performance Rosetta emulator.

To run a 32-bit executable on a 64-bit processor in this way, one creates a 64-bit program that is tasked with loading the 32-bit executable. It’s a little fussy, involving some tricks to handle memory management between the 32-bit code and the 64-bit wrapper, and how to interface with the OS, but [Evan] explains deftly how it’s all done.

[Evan] notes that this hack may not work forever, especially if Apple changes or deprecates Rosetta’s remaining x86-64 emulation in the future. Regardless, Apple’s “Game Porting Toolkit” relies on similar techniques used by Wine. If you find yourself dancing across platforms, you might learn some nifty tricks from [Evan]’s example!

Learning X86_64 Assembly By Building A GUI From Scratch

Some professional coders are absolutely adamant that learning to program in assembly language in these modern times is simply a waste of time, and this post is not for them. This is for the rest of us, who still think there is value in knowing at a low level what is going on, a deeper appreciation can be developed. [Philippe Gaultier] was certainly in this latter camp and figured the best way to learn was to work on a substantial project.

Now, there are some valid reasons to write directly in assembler; for example hand-crafting unusual code sequences for creating software exploits would be hindered by an optimising compiler. Creating code optimised for speed and size is unlikely to be among those reasons, as doing a better job than a modern compiler would be a considerable challenge. Some folks would follow the tried and trusted route and work towards getting a “hello world!” output to the console or a serial port, but not [Philippe]. This project aimed to get a full-custom GUI application running as a client to the X11 server running atop Linux, but the theory should be good for any *nix OS.

Hello World! In X11!

The first part of the process was to create a valid ELF executable that Linux would work with. Using nasm to assemble and the standard linker, only a few X86_64 instructions are needed to create a tiny executable that just exits cleanly. Next, we learn how to manipulate the stack in order to set up a non-trivial system call that sends some text to the system STDOUT.

To perform any GUI actions, we must remember that X11 is a network-orientated system, where our executable will be a client connected via a socket. In the simple case, we just connect the locally created socket to the server path for the local X server, which is just a matter of populating the sockaddr_un data structure on the stack and calling the connect() system call.

Now the connection is made, we can follow the usual X11 route of creating client ids, then allocating resources using them. Next, fonts are opened, and a graphical context is created to use it to create a window. Finally, after mapping the window, something will be visible to draw into with subsequent commands. X11 is a low-level GUI system, so there are quite a few steps to create even the most simple drawable object, but this also makes it quite simple to understand and thus quite suited to such a project.

We applaud [Phillip] for the fabulous documentation of this learning hack and can’t wait to see what’s next in store!

Not too long ago, we covered Snowdrop OS, which is written entirely in X86 assembly, and we also found out a thing or two about some oddball X86 instructions. We’ve also done our own Linux assembly primer.

Intel Suggests Dropping Everything But 64-Bit From X86 With Its X86-S Proposal

In a move that has a significant part of the internet flashing back to the innocent days of 2001 when Intel launched its Itanium architecture as a replacement for the then 32-bit only x86 architecture – before it getting bludgeoned by AMD’s competing x86_64 architecture – Intel has now released a whitepaper with associated X86-S specification that seeks to probe the community’s thoughts on it essentially removing all pre-x86_64 features out of x86 CPUs.

While today you can essentially still install your copy of MSDOS 6.11 on a brand-new Intel Core i7 system, with some caveats, it’s undeniable that to most users of PCs the removal of 16 and 32-bit mode would likely go by unnoticed, as well as the suggested removal of rings 1 and 2, as well as range of other low-level (I/O) features. Rather than the boot process going from real-mode 16-bit to protected mode, and from 32- to 64-bit mode, the system would boot straight into the 64-bit mode which Intel figures is what everyone uses anyway.

Where things get a bit hazy is that on this theoretical X86-S you cannot just install and boot your current 64-bit operating systems, as they have no concept of this new boot procedure, or the other low-level features that got dropped. This is where the Itanium comparison seems most apt, as it was Intel’s attempt at a clean cut with its x86 legacy, only for literally everything about the concept (VLIW) and ‘legacy software’ support to go horribly wrong.

Although X86-S seems much less ambitious than Itanium, it would nevertheless be interesting to hear AMD’s thoughts on the matter.

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.