Make Your Code Slower With Multithreading

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches.  Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified.  Something in there was using a futex (or fast userspace mutex) a whole lot.

Continue reading “Make Your Code Slower With Multithreading”

Processes, Threads, And… Fibers?

You’ve probably heard of multithreaded programs where a single process can have multiple threads of execution. But here is yet another layer of creating multitasking programs known as a fiber. [A Graphics Guy] lays it out in a lengthy but well-done post. There are examples for both x64 and arm64, although the post mainly focuses on x64 for Windows. However, the ideas will apply anywhere.

In the old days, there was a CPU and when your program ran on it, it was in control. But that’s wasteful, so software quickly moved to where many programs could share the CPU simultaneously. Then, as that got overloaded, computers got more CPUs. Most operating systems have the idea of a process, which is a program that thinks it is in complete control, but it is really sharing the CPU with other processes. The problem arises when you want to have multiple “little” programs that cooperate. Processes are not really supposed to know about one another and, if they do, there’s usually some heavy-weight communication mechanism allowing them to talk.

Continue reading “Processes, Threads, And… Fibers?”

Minecraft Finally Gets Multi-Threaded Servers

Minecraft servers are famously single-threaded and those who host servers for large player bases often pay handsomely for a server that has gobs of memory and ripping fast single-core performance. Previous attempts to break Minecraft into separate threads haven’t ended successfully, but it seems like the folks over at [PaperMC] have finally cracked it with Folia.

Minecraft is one of (if not the most) hacked and modded games in history. Mods have been around since the early days, made possible by a dedicated group who painstakingly decompiled the Java bytecode and reverse-engineered it. Bukkit was a server mod back in the Alpha days that tried to support plugins and extend the default Minecraft. From Bukkit, Spitgot was forked. From Spitgot, Paper was forked, which focused on performance and gameplay mechanics. And now from Paper, Folia is a new fork focused on multi-threading.

A Minecraft world is split up into worlds (such as the nether or the overworld) and chunks. Chunks are 16x16xZ vertical columns of blocks. Folia breaks up sections of chunks into regions that can be ticked independently. Of course, moving to a multi-threaded model will cause existing plugins to fail. Very little was made thread-safe and the idea is that data cannot move easily across ticking regions. Regions tick in parallel, not synchronously.

Naturally, the people benefiting from Folia the most are those running servers that support hundreds of players. On a server with a vanilla-like configuration only around a hundred or so players can be online. Increasing single-core performance isn’t usually an option past this point. By moving to other cores, suddenly you can scale out significantly without restoring to complex proxying. Previous attempts have had multiple Minecraft servers and then synced players and entities between them. Of course, this can cause its own share of issues.

It’s simply incredible to us what the modding community continues to develop and create. It takes deep patience to reverse-engineer the system and rearchitect it from the outside. The Folia codebase is available on GitHub under a GNU GPL 3.0 license if you’d like to look through it.

Python Ditches The GILs And Comes Ashore

The Python world has been fractured a few times before. The infamous transition from version 2 to version 3 still affects people today, and there could be a new schism in the future. [Sam Gross] proposed a solution to drop the Global Interrupt Interpreter Lock (GIL), which would have enormous implications for many projects that leverage the CPython internals, such as Pandas and NumPy.

The fact that Python is interpreted is a double edge sword. It means there can be different runtimes, such as Pyston, Cinder, MicroPython, PyPy, and others, that might support the whole language, a specific version, or a subset. But if you’re using Python, you’re probably running CPython. And it has something known as global interpreter lock that affects threaded code. In a nutshell, only one thread can run in the interpreter at a time. There are some ways around it, such as moving performance-critical sections to C or having multiple interpreters. However, most existing solutions come with considerable downsides. Continue reading “Python Ditches The GILs And Comes Ashore”

Running 57 Threads At Once On The Arduino Uno

When one thinks of the Arduino Uno, one thinks of a capable 8-bit microcontroller platform that nonetheless doesn’t set the world alight with its performance. Unlike more modern parts like the ESP32, it has just a single core and no real multitasking abilities. But what if one wanted to run many threads on an Uno all at once? [Adam] whipped up some code to do just that.

Threads are useful for when you have multiple jobs that need to be done at the same time without interfering with each other. The magic of [Adam]’s ThreadHandler library is that it’s designed to run many threads and do so in real time, with priority management as well. On the Arduino Uno, certainly no speed demon, it can run up to 57 threads concurrently at 6ms intervals with a minumum timing error of 556 µs and a maximum of 952 µs. With a more reasonable number of 7 threads, the minimum error drops to just 120 µs.  Each thread comes with an estimated overhead of 1.3% CPU load and 26 bytes of RAM usage.

While we struggle to think of what we could do with more than a handful of threads on an Arduino Uno, we’re sure you might have some ideas – sound off in the comments. ThreadHandler is available for your perusal here, and runs on SAMD21 boards as well as any AVR-based boards that are compatible with TimerOne. We’ve seen other work in the same space before, such as ChibiOS for the Arduino platform. Video after the break.

Continue reading “Running 57 Threads At Once On The Arduino Uno”