Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.
The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.
You may have heard Linux pundits discussing x86-64-v3. Can recompiling Linux code to use this bring benefits? To answer that question, you probably need to know what x86-64-v3 is, and [Gary Explains]… well… explains it in a recent video.
If you’d rather digest text, RedHat has a recent article about their experiments using the instructions set in RHEL10. From that article, you can see that most of the new instructions support some enhancements for vectors and bit manipulation. It also allows for more flexible instructions that leave their results in an explicit destination register instead of one of the operand registers.
Of course, none of this matters for high-level code unless the compiler supports it. However, gcc version 12 will automatically vectorize code when using the -O2 optimization flags.
[Evan] realized that Apple’s ARM-based Macs feature a high-quality x86 emulator, used via the Rosetta binary translation system. It only supports 64-bit x86-64 binaries, also known as x64, and thus he had initially discounted it for running older 32-bit x86 software. However, as it turns out, x64 features a special compatibility mode for running 32-bit code. [Evan] was able to leverage this to run 32-bit Windows executables rather neatly via the high-performance Rosetta emulator.
To run a 32-bit executable on a 64-bit processor in this way, one creates a 64-bit program that is tasked with loading the 32-bit executable. It’s a little fussy, involving some tricks to handle memory management between the 32-bit code and the 64-bit wrapper, and how to interface with the OS, but [Evan] explains deftly how it’s all done.
[Evan] notes that this hack may not work forever, especially if Apple changes or deprecates Rosetta’s remaining x86-64 emulation in the future. Regardless, Apple’s “Game Porting Toolkit” relies on similar techniques used by Wine. If you find yourself dancing across platforms, you might learn some nifty tricks from [Evan]’s example!
Some professional coders are absolutely adamant that learning to program in assembly language in these modern times is simply a waste of time, and this post is not for them. This is for the rest of us, who still think there is value in knowing at a low level what is going on, a deeper appreciation can be developed. [Philippe Gaultier] was certainly in this latter camp and figured the best way to learn was to work on a substantial project.
Now, there are some valid reasons to write directly in assembler; for example hand-crafting unusual code sequences for creating software exploits would be hindered by an optimising compiler. Creating code optimised for speed and size is unlikely to be among those reasons, as doing a better job than a modern compiler would be a considerable challenge. Some folks would follow the tried and trusted route and work towards getting a “hello world!” output to the console or a serial port, but not [Philippe]. This project aimed to get a full-custom GUI application running as a client to the X11 server running atop Linux, but the theory should be good for any *nix OS.
The first part of the process was to create a valid ELF executable that Linux would work with. Using nasm to assemble and the standard linker, only a few X86_64 instructions are needed to create a tiny executable that just exits cleanly. Next, we learn how to manipulate the stack in order to set up a non-trivial system call that sends some text to the system STDOUT.
To perform any GUI actions, we must remember that X11 is a network-orientated system, where our executable will be a client connected via a socket. In the simple case, we just connect the locally created socket to the server path for the local X server, which is just a matter of populating the sockaddr_un data structure on the stack and calling the connect() system call.
Now the connection is made, we can follow the usual X11 route of creating client ids, then allocating resources using them. Next, fonts are opened, and a graphical context is created to use it to create a window. Finally, after mapping the window, something will be visible to draw into with subsequent commands. X11 is a low-level GUI system, so there are quite a few steps to create even the most simple drawable object, but this also makes it quite simple to understand and thus quite suited to such a project.
We applaud [Phillip] for the fabulous documentation of this learning hack and can’t wait to see what’s next in store!
In a move that has a significant part of the internet flashing back to the innocent days of 2001 when Intel launched its Itanium architecture as a replacement for the then 32-bit only x86 architecture – before it getting bludgeoned by AMD’s competing x86_64 architecture – Intel has now released a whitepaper with associated X86-S specification that seeks to probe the community’s thoughts on it essentially removing all pre-x86_64 features out of x86 CPUs.
While today you can essentially still install your copy of MSDOS 6.11 on a brand-new Intel Core i7 system, with some caveats, it’s undeniable that to most users of PCs the removal of 16 and 32-bit mode would likely go by unnoticed, as well as the suggested removal of rings 1 and 2, as well as range of other low-level (I/O) features. Rather than the boot process going from real-mode 16-bit to protected mode, and from 32- to 64-bit mode, the system would boot straight into the 64-bit mode which Intel figures is what everyone uses anyway.
Where things get a bit hazy is that on this theoretical X86-S you cannot just install and boot your current 64-bit operating systems, as they have no concept of this new boot procedure, or the other low-level features that got dropped. This is where the Itanium comparison seems most apt, as it was Intel’s attempt at a clean cut with its x86 legacy, only for literally everything about the concept (VLIW) and ‘legacy software’ support to go horribly wrong.
Although X86-S seems much less ambitious than Itanium, it would nevertheless be interesting to hear AMD’s thoughts on the matter.
For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.
Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:
while (word != 0) {
result[i] = trailingzeroes(word);
word = word & (word - 1);
i++;
}
The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.
The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.
We all probably know that for ultimate control and maximum performance, you need assembly language. No matter how good your compiler is, you’ll almost always be able to do better by using your human smarts to map your problem onto a computer’s architecture. Programming in assembly for PCs though is a little tricky. A lot of information about PC assembly language dates back from when assembly was more common, but it also covers old modes that, while still available, aren’t the best answer for the latest processors. [Gpfault] has launched a series on 64-bit x86 assembly that tries to remedy that, especially if you are working under Windows.
So far there are three entries. The first covers setting up your toolchain and creating a simple program that does almost nothing. But it is a start.