What Is X86-64-v3?

You may have heard Linux pundits discussing x86-64-v3. Can recompiling Linux code to use this bring benefits? To answer that question, you probably need to know what x86-64-v3 is, and [Gary Explains]… well… explains it in a recent video.

If you’d rather digest text, RedHat has a recent article about their experiments using the instructions set in RHEL10. From that article, you can see that most of the new instructions support some enhancements for vectors and bit manipulation. It also allows for more flexible instructions that leave their results in an explicit destination register instead of one of the operand registers.

Of course, none of this matters for high-level code unless the compiler supports it. However, gcc version 12 will automatically vectorize code when using the -O2 optimization flags.

Continue reading “What Is X86-64-v3?”

Emulating X86 On Apple’s AARCH64 X64 Emulator

You might know [Evan Martin] as the developer of retrowin32. It’s a Windows and x86 emulator designed to run on a Mac or on the web. He’s recently been exploring how to run 32-bit x86 binaries on the AArch64 (aka ARM64) architecture.

[Evan] realized that Apple’s ARM-based Macs feature a high-quality x86 emulator, used via the Rosetta binary translation system. It only supports 64-bit x86-64 binaries, also known as x64, and thus he had initially discounted it for running older 32-bit x86 software. However, as it turns out, x64 features a special compatibility mode for running 32-bit code. [Evan] was able to leverage this to run 32-bit Windows executables rather neatly via the high-performance Rosetta emulator.

To run a 32-bit executable on a 64-bit processor in this way, one creates a 64-bit program that is tasked with loading the 32-bit executable. It’s a little fussy, involving some tricks to handle memory management between the 32-bit code and the 64-bit wrapper, and how to interface with the OS, but [Evan] explains deftly how it’s all done.

[Evan] notes that this hack may not work forever, especially if Apple changes or deprecates Rosetta’s remaining x86-64 emulation in the future. Regardless, Apple’s “Game Porting Toolkit” relies on similar techniques used by Wine. If you find yourself dancing across platforms, you might learn some nifty tricks from [Evan]’s example!

Learning X86_64 Assembly By Building A GUI From Scratch

Some professional coders are absolutely adamant that learning to program in assembly language in these modern times is simply a waste of time, and this post is not for them. This is for the rest of us, who still think there is value in knowing at a low level what is going on, a deeper appreciation can be developed. [Philippe Gaultier] was certainly in this latter camp and figured the best way to learn was to work on a substantial project.

Now, there are some valid reasons to write directly in assembler; for example hand-crafting unusual code sequences for creating software exploits would be hindered by an optimising compiler. Creating code optimised for speed and size is unlikely to be among those reasons, as doing a better job than a modern compiler would be a considerable challenge. Some folks would follow the tried and trusted route and work towards getting a “hello world!” output to the console or a serial port, but not [Philippe]. This project aimed to get a full-custom GUI application running as a client to the X11 server running atop Linux, but the theory should be good for any *nix OS.

Hello World! In X11!

The first part of the process was to create a valid ELF executable that Linux would work with. Using nasm to assemble and the standard linker, only a few X86_64 instructions are needed to create a tiny executable that just exits cleanly. Next, we learn how to manipulate the stack in order to set up a non-trivial system call that sends some text to the system STDOUT.

To perform any GUI actions, we must remember that X11 is a network-orientated system, where our executable will be a client connected via a socket. In the simple case, we just connect the locally created socket to the server path for the local X server, which is just a matter of populating the sockaddr_un data structure on the stack and calling the connect() system call.

Now the connection is made, we can follow the usual X11 route of creating client ids, then allocating resources using them. Next, fonts are opened, and a graphical context is created to use it to create a window. Finally, after mapping the window, something will be visible to draw into with subsequent commands. X11 is a low-level GUI system, so there are quite a few steps to create even the most simple drawable object, but this also makes it quite simple to understand and thus quite suited to such a project.

We applaud [Phillip] for the fabulous documentation of this learning hack and can’t wait to see what’s next in store!

Not too long ago, we covered Snowdrop OS, which is written entirely in X86 assembly, and we also found out a thing or two about some oddball X86 instructions. We’ve also done our own Linux assembly primer.

Intel Suggests Dropping Everything But 64-Bit From X86 With Its X86-S Proposal

In a move that has a significant part of the internet flashing back to the innocent days of 2001 when Intel launched its Itanium architecture as a replacement for the then 32-bit only x86 architecture – before it getting bludgeoned by AMD’s competing x86_64 architecture – Intel has now released a whitepaper with associated X86-S specification that seeks to probe the community’s thoughts on it essentially removing all pre-x86_64 features out of x86 CPUs.

While today you can essentially still install your copy of MSDOS 6.11 on a brand-new Intel Core i7 system, with some caveats, it’s undeniable that to most users of PCs the removal of 16 and 32-bit mode would likely go by unnoticed, as well as the suggested removal of rings 1 and 2, as well as range of other low-level (I/O) features. Rather than the boot process going from real-mode 16-bit to protected mode, and from 32- to 64-bit mode, the system would boot straight into the 64-bit mode which Intel figures is what everyone uses anyway.

Where things get a bit hazy is that on this theoretical X86-S you cannot just install and boot your current 64-bit operating systems, as they have no concept of this new boot procedure, or the other low-level features that got dropped. This is where the Itanium comparison seems most apt, as it was Intel’s attempt at a clean cut with its x86 legacy, only for literally everything about the concept (VLIW) and ‘legacy software’ support to go horribly wrong.

Although X86-S seems much less ambitious than Itanium, it would nevertheless be interesting to hear AMD’s thoughts on the matter.

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Assembly Language For Real

We all probably know that for ultimate control and maximum performance, you need assembly language. No matter how good your compiler is, you’ll almost always be able to do better by using your human smarts to map your problem onto a computer’s architecture. Programming in assembly for PCs though is a little tricky. A lot of information about PC assembly language dates back from when assembly was more common, but it also covers old modes that, while still available, aren’t the best answer for the latest processors. [Gpfault] has launched a series on 64-bit x86 assembly that tries to remedy that, especially if you are working under Windows.

So far there are three entries. The first covers setting up your toolchain and creating a simple program that does almost nothing. But it is a start.

Continue reading “Assembly Language For Real”

Odyssey Is A X86 Computer Packing An Arduino Along For The Trip

We love the simplicity of Arduino for focused tasks, we love how Raspberry Pi GPIO pins open a doorway to a wide world of peripherals, and we love the software ecosystem of Intel’s x86 instruction set. It’s great that some products manage to combine all of them together into a single compact package, and we welcome the recent addition of Seeed Studio’s Odyssey X86J4105.

[Ars Technica] recently looked one over and found it impressive from the perspective of a small networked computer, but they didn’t dig too deeply into the maker-friendly side of the product. We can look at the product documentation to see some interesting details. This board is larger than a Raspberry Pi, but its GPIO pins were laid out in exactly the same order as that on a Pi. Some HATs could plug right in, eliminating all the electrical integration leaving just the software issue of ARM vs x86. Tasks that are not suitable for CPU-controlled GPIO (such as generating reliable PWM) can be offloaded to an on-board Arduino-compatible microcontroller. It is built around the SAMD21 chip, similar to the Arduino MKR and Arduino Zero but the pinout does not appear to match any of the popular Arduino form factors.

The Odyssey is not the first x86 single board computer (SBC) to have GPIO pins and an onboard Arduino assistant. LattePanda for example has been executing that game plan (minus the Raspberry Pi pin layout) for the past few years. We’ve followed them since their Kickstarter origins and we’ve featured creative uses here and there. LattePanda’s current offerings are built around Intel CPUs ranging from Atom to Core m3. The Odyssey’s Celeron is roughly in the middle of that range, and the SAMD21 is more capable than the ATmega32U4 (Arduino Leonardo) on board a LattePanda. We always love seeing more options in a market for us to find the right tradeoff to match a given project, and we look forward to the epic journeys yet to come.