Abusing X86 SIMD Instructions To Optimize PlayStation 3 Emulation

Key to efficient hardware emulation is an efficient mapping to the underlying CPU’s opcodes. Here one is free to target opcodes that may or may not have been imagined for that particular use. For emulators like the RPCS3 PlayStation 3 emulator this has led to some interesting mappings, as detailed in a video by [Whatcookie].

It’s important to remember here that the Cell processor in the PlayStation 3 is a bit of an odd duck, using a single regular PowerPC core (PPE) along with multiple much more simple co-processors called synergistic processing elements (SPEs) all connected with a high-speed bus. A lot of the focus with Cell was on floating point vector – i.e. SIMD – processing, which is part of why for a while the PlayStation 3 was not going to have a dedicated GPU.

As a result, it makes perfect sense to do creative mapping between the Cell’s SIMD instructions and those of e.g. SSE and AVX, even if Intel removing AVX-512 for a while caused major headaches. Fortunately some of those reappeared in AVX2.

The video goes through a whole range of Cell-specific instructions, how they work, and what x86 SIMD instructions they were mapped to and why. The SUBD instruction for example is mapped to VPDPBUSD as well as VDBPSADBW in AVX-512, the latter of which mostly targets things like video encoding. In the end it’s the result that matters, even if it also shows why the Cell processor was so interesting for high-performance compute clusters back in the day.

Continue reading “Abusing X86 SIMD Instructions To Optimize PlayStation 3 Emulation”

Surviving The RAM Apocalypse With Software Optimizations

To the surprise of almost nobody, the unprecedented build-out of datacenters and the equipping of them with servers for so-called ‘AI’ has led to a massive shortage of certain components. With random access memory (RAM) being so far the most heavily affected and with storage in the form of HDDs and SSDs not far behind, this has led many to ask the question of how we will survive the coming months, years, decades, or however-long the current AI bubble will last.

One thing is already certain, and that is that we will have to make our current computer systems last longer, and forego simply tossing in more sticks of RAM in favor of doing more with less. This is easy to imagine for those of us who remember running a full-blown Windows desktop system on a sub-GHz x86 system with less than a GB of RAM, but might require some adjustment for everyone else.

In short, what can us software developers do differently to make a hundred MB of RAM stretch further, and make a GB of storage space look positively spacious again?

Continue reading “Surviving The RAM Apocalypse With Software Optimizations”

Libxml2 Narrowly Avoids Becoming Unmaintained

In an excellent example of one of the most overused XKCD images, the libxml2 library has for a little while lost its only maintainer, with [Nick Wellnhofer] making good on his plan to step down by the end of the year.

XKCD's dependency model
Modern-day infrastructure, as visualized by XKCD. (Credit: Randall Munroe)

While this might not sound like a big deal, the real scope of this problem is rather profound. Not only is libxml2 part of GNOME, it’s also used as dependency by a huge number of projects, including web browsers and just about anything that processes XML or XSLT. Not having a maintainer in the event that a fresh, high-risk CVE pops up would obviously be less than desirable.

As for why [Nick] stepped down, it’s a long story. It starts in the early 2000s when the original author [Daniel Veillard] decided he no longer had time for the project and left [Nick] in charge. It should be said here that both of them worked as volunteers on the project, for no financial compensation. This when large companies began to use projects like libxml2 in their software, and were happy to send bug reports. Beyond a single Google donation it was effectively unpaid work that required a lot of time spent on researching and processing potential security flaws sent in.

Of note is that when such a security report comes in, the expectation is that you as a volunteer software developer drop everything you’re working on and figure out the cause, fix and patched-by-date alongside filing a CVE. This rather than you getting sent a merge request or similar with an accompanying test case. Obviously these kind of cases seems to have played a major role in making [Nick] burn out on maintaining both libxml2 and libxslt.

Fortunately for the project two new developers have stepped up to take over as maintainers, but it should be obvious that such churn is not a good sign. It also highlights the central problem with the conflicting expectations of open source software being both totally free in a monetary fashion and unburdened with critical bugs. This is unfortunately an issue that doesn’t seem to have an easy solution, with e.g. software bounties resulting in mostly a headache.

Bare Metal STM32: Increasing The System Clock And Running Dhrystone

When you start an STM32 MCU with its default configuration, its CPU will tick along at a leisurely number of cycles on the order of 8 to 16 MHz, using the high-speed internal (HSI) clock source as a safe default to bootstrap from. After this phase, we are free to go wild with the system clock, as well as the various clock sources that are available beyond the HSI.

Increasing the system clock doesn’t just affect the CPU either, but also affects the MCU’s internal buses via its prescalers and with it the peripherals like timers on that bus. Hence it’s essential to understand the clock fabric of the target MCU. This article will focus on the general case of increasing the system clock on an STM32F103 MCU from the default to the maximum rated clock speed using the relevant registers, taking into account aspects like Flash wait states and the APB and AHB prescalers.

Although the Dhrystone benchmark is rather old-fashioned now, it’ll be used to demonstrate the difference that a faster CPU makes, as well as how complex accurately benchmarking is. Plus it’s just interesting to get an idea of how a lowly Cortex-M3 based MCU compares to a once top-of-the line Intel Pentium 90 CPU.

Continue reading “Bare Metal STM32: Increasing The System Clock And Running Dhrystone”

Debugging The AMD GPU

Although Robert F. Kennedy gets the credit for popularizing it, George Bernard Shaw said: “Some men see things as they are and say, ‘Why?’ I dream of things that never were and say, ‘Why not?'” Well, [Hadz] didn’t wonder why there weren’t many GPU debuggers. Instead, [Hadz] decided to create one.

It wasn’t the first; he found some blog posts by [Marcell Kiss] that helped, and that led to a series of experiments you’ll enjoy reading about. Plus, don’t miss the video below that shows off a live demo.

It seems that if you don’t have an AMD GPU, this may not be directly useful. But it is still a fascinating peek under the covers of a modern graphics card. Ever wonder how to interact with a video card without using something like Vulkan? This post will tell you how.

Writing a debugger is usually a tricky business anyway. Working with the strange GPU architecture makes it even stranger. Traps let you gain control, but implementing features like breakpoints and single-stepping isn’t simple.

We’ve used things like CUDA and OpenCL, but we haven’t been this far down in the weeds. At least, not yet. CUDA, of course, is specific to NVIDIA cards, isn’t it?

Continue reading “Debugging The AMD GPU”

Creating User-Friendly Installers Across Operating Systems

After you have written the code for some awesome application, you of course want other people to be able to use it. Although simply directing them to the source code on GitHub or similar is an option, not every project lends itself to the traditional configure && make && make install, with often dependencies being the sticking point.

Asking the user to install dependencies and set up any filesystem links is an option, but having an installer of some type tackle all this is of course significantly easier. Typically this would contain the precompiled binaries, along with any other required files which the installer can then copy to their final location before tackling any remaining tasks, like updating configuration files, tweaking a registry, setting up filesystem links and so on.

As simple as this sounds, it comes with a lot of gotchas, with Linux distributions in particular being a tough nut. Whereas on MacOS, Windows, Haiku and many other OSes you can provide a single installer file for the respective platform, for Linux things get interesting.

Continue reading “Creating User-Friendly Installers Across Operating Systems”

Your Supercomputer Arrives In The Cloud

For as long as there have been supercomputers, people like us have seen the announcements and said, “Boy! I’d love to get some time on that computer.” But now that most of us have computers and phones that greatly outpace a Cray 2, what are we doing with them? Of course, a supercomputer today is still bigger than your PC by a long shot, and if you actually have a use case for one, [Stephen Wolfram] shows you how you can easily scale up your processing by borrowing resources from the Wolfram Compute Services. It isn’t free, but you pay with Wolfram service credits, which are not terribly expensive, especially compared to buying a supercomputer.

[Stephen] says he has about 200 cores of local processing at his house, and he still sometimes has programs that run overnight. If your program already uses a Wolfram language and uses parallelism — something easy to do with that toolbox — you can simply submit a remote batch job.

Continue reading “Your Supercomputer Arrives In The Cloud”