Make Your Code Slower With Multithreading

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches.  Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified.  Something in there was using a futex (or fast userspace mutex) a whole lot.

Continue reading “Make Your Code Slower With Multithreading”

Grabbing The Thread: Spinlocks Vs Mutexes

Getting into the weeds of operating systems is daunting work. Especially when the operating system involved is a fully featured modern PC operating system with millions of lines of code all working together to integrate hardware and software seamlessly. One such operating system “weed” is figuring out how to handle simultaneous tasks when the processor can only really handle one thing at time. For that, you’ll be looking at the difference between spinlocks and mutexes.

Both of these are methods of making sure that the processor completes a task sufficiently before moving on to the next task. Modern computers are so fast (even ignoring multiple cores) that it seems as if they are doing many things at once. In order to maintain this illusion, tasks need ways of locking the processor to that specific task for a certain amount of time. Of course the queue for performing the next task can get complicated as there are often many tasks waiting to use processor time. Spinlocks are a simple way of holding the processor and mutexes are a slightly more complicated way, but which one is the most efficient use of system resources isn’t that straightforward.

If you’ve ever been interested in operating system details, this one goes deep into the intricacies of features most of us have never even considered the existence of. It’s definitely worth a read, though, and is very well written by someone who is clearly an expert. If you want an operating system challenge, you can build your own operating system as well.