Make Your Code Slower With Multithreading

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches.  Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified.  Something in there was using a futex (or fast userspace mutex) a whole lot.

After delving into the glibc source code, a comment said it all:

/* POSIX.1c requires that there is mutual exclusion for the `rand' and
`srand' functions to prevent concurrent calls from modifying common
data. */

__libc_lock_lock (lock);
 (void) __random_r (&unsafe_state, &retval);
 __libc_lock_unlock (lock);

By replacing the call to rand() with random_r(), the program’s performance with four threads improved dramatically. The runtime was reduced to a theoretical quarter of the single-thread version. As Marc summarizes, multi-threaded programming is not always as straightforward as one might think. While performance can be significantly worse in some cases, improvements are possible. However, this is not guaranteed to be the case in every situation.

The art of debugging and profiling code is complex, so here’s how to use Valgrind to look for problems you might not even know about. Even the humble Linux pipe needs to be thought out to get decent performance. What a surprise!

51 thoughts on “Make Your Code Slower With Multithreading

  1. Next time, he should consider if the bofusenick is conflicting counter-contusively with the tordanula or if the massive rocwooselon buffer is full of trambignofast calls. 😆

  2. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks…

    I don’t quite understand multiple cores and multithreaded applications. lets say I have a very large bash script for example, maybe 2000 lines, so when I hit the run button it;s all like every single core go to line 1, then line 2 and so on, like having “copies” of the same processes until the job is done or is more like oh!well we can divide the work and one core is doing this set of lines while another is working on a different set in a non-blocking manner and we will merge results when necessary. Or maybe is none of the above???

    1. Unless you’re doing something intentionally wild with gnu parallel or similar, the script itself is done sequentially on a single thread. Commands you call in the script may invoke parallelism and task out whatever you’re piping into them.

      1. Exactly.

        For example, ffmpeg’s video encoders are multi-threaded and will happily use more than one CPU core to convert videos. So even if you use a bash “for” command to sequentially transcode a set of video files, it will not be limited to a single thread/core.
        However, if you have a 16-core (or more) monster like a Threadripper or EPYC, you can speed up processing even further by using GNU parallel to run multiple instances of ffmpeg, allowing you to to convert at least 4 videos concurrently.

        1. Wow! mind blowing, so if you want to make use of more than one single thread/core you have to invoke it in your code right? I remember there are ways to “unlock” cores during startup but that doesn’t mean my ordinary script will run faster right? what about software like Libreoffice or kicad? won’tl they be handled by multicores if they don’t use parallelism?

          I was happy playing Doom in all my cores :( Knowledge hurts :)

          1. The 1st rule of parallel programming is to share as little data as possible. In this case, a separate random number generator should be used for each thread.

          2. UI apps are a little different and are treated as multithreaded/multicore as often as possible. There’s a management thread to handle UI (click/type/update) and background threads to handle long-running processes. That’s how you get something like a little spinning wheel while things are working in the background. It’s pretty easy to tell if an app is “sitting on the UI thread” since the window seems like it locks up for a bit. For games, you may have multiple management threads – a UI thread, a physics thread, etc.

            For things like scripts, single thread is usually what you want except for tiny bits like aforementioned parallel encoding, parallel compiling, etc. If it’s something that’s run on an infrequent basis, then it’s not even worth tuning. Keeping it sequential actually ends up being better, because if something breaks, it’s easier to dissect and fix.

        2. Nope.
          Thermals say ‘HI’.
          So does RAM bandwidth.

          Devil is in the details, but core count is mostly dick measuring.
          Aren’t you impressed by idle cores?
          Spend the money on more/better RAM and a better heat sink.

    2. So to things:
      1) It depends entirely on the task you are trying to parallelize. “Nick’s class” or “NC” is the category of algorithms considered to parallelize well, while “P” is the class of algorithms that are generally considered tractable to implement and run in a single threaded manner. That is the same P as the P versus NP debate, and it’s generally considered that P is not likely equal to NP, and NC is not likely equal to P. That whole there are no problem concrete examples, it’s expected that some tractable single threaded algorithms do not parallelize well.

      2) The number of lines in a program tells you literally nothing about it’s runtime. If they only run once each in order, even an e-core of a older budget chip will handle 2000 lines of script faster than you can blink. On the other hand, even a few hundred lines would be more than sufficient to (try to) print out the Goodstein sequence of an input number. For an input of 3, this runs quickly, outputting 6 numbers. For an input of 4 it will eventually to output over 1e100,000,000 numbers, most with over 100 million digits each. Determining how long a program can run based on just on the program size is actually so difficult that it is *literally impossible*

      1. I remember some gamers commenting about games running better when you use less cores that you have so I guess they don’t parallelize well.

        Damn! and I’m still struggling trying to “multitask” while not blocking my Arduino. Thanks for the explanations guys!

    3. A simple bash script is very ordered and does not take a lot of advantage of available parallelism at the top level, though individual commands and programs may.
      Using pipes and IO redirects you add more parallelism since producers and consumers run in different processes, which can run in parallel.
      You can also run jobs (one or more commands) in the background, which also inherently takes advantage of parallelism.

    4. I did some lazy-man parallel processing (we’re talking many seperate text files, wgets, data extraction/processing) years ago with bash, very bad practice but it it involved grep-ing the output of ps aux in a loop with a wait period, only firing off new tasks when the number of active tasks was below a certain threshold, and this number could be altered depending on how big a box it was running on.

      The whole thing was dreadfully slow but it worked, and at the time it was only supposed to be used once, so I cared more about the result than the process. I re-did the whole thing in python afterwards, and although that was multi-threaded it ran many times faster on much less powerful hardware.

  3. Parallel software usually has a different architecture to conventional software which runs on a single core/thread. It requires a bit of unlearning, and then learning the new architecture designs.

    1. so true. I’ve been writing multi-threaded multi-core (there is a difference..) programs since the 80’s, and I’m still surprised today at the percentage of programmers that don’t design it in from the ground up. I think it’s partly a conceptualization problem, they just don’t really understand how to use parallel processes..
      It’s also not taught well at uni, or most workplaces..

      1. Optimizing performance is rarely the most important design goal. Usually minimizing time to release or minimizing overall cost are. Sometimes security or just minimizing risk of bugs also plays a role. So it really shouldn’t be so surprising that most projects aren’t design from the ground up for parallel processing. Starting ground up with a simple straightforward approach and later redesigning as (if) needed is almost always a more successful strategy then building a more complex design from the very beginning.

        1. ‘Premature optimization is the root of all evil’.

          It’s sometimes quoted as ‘early optimization…’ but that is incorrect.
          If you know you will need to optimize it, it is never too early.
          If you don’t know you need to optimize it, doing it late is still stupid.

    2. There are some fundamental limitations to running parallel code irrespective of the architecture.

      One is Amdahl’s Law, which states that as you keep splitting your task into parallel threads, you can only go as fast as the slowest part of your task that has to be completed in order, no matter how many computations you can run concurrently.

      The second law, the name of which I’ve forgotten, states that as you add more parallel computing units or threads for your task, the portion of your code that has to be completed in order begins to grow – that is, the task of managing and organizing the computation and collating the results takes more computing time.

      When you take Amdahl’s law alone, you get diminishing returns, but every additional thread or computational unit still speeds things up somewhat. When you take the whole bureaucracy of parallel computation into account, you get negative returns. Trying to add too many computing units or threads on the job will actually turn out slower.

      1. This makes more sense in context now. I was reading this and thinking about programming languages that execute in order they’re written. How can you multitask stuff that must be serially executed by the design of the program?

  4. The rules are simple: *never* ever use sync primitives that involve system calls. It includes mutexes, futexts, semaphores and alike. Want to synchronise – spin on an atomic. But for most cases you just should not synchronise at all. It’s easy.

  5. In years gone by, context switching has been a nemesis. Saving the ‘minimal context’ to be able to come back to as ‘blocked’ task again quickly, and picking up a ‘blocked’ task that is now free has alwasy been ‘expensive’. We had that problem in the ’80/90s. The ability using parallel processes needs a special programming or problems to make it most worthwhile. We called the difference ‘vertical MIPS’ (typical big single processors) vs ‘horizontal MIPS’ (parallel processors). Both architectures are great, just different tools that specialize in different problem solutions.

      1. Try it for real sometime, as I had to do recently.
        1. A hundred or so source files in C# and VB.
        2. Several tens of thousands of lines of code.
        3. Several GB of program logs.
        4. A deadlock that occurs once in a blue moon.

        I don’t see how you expect to feed all of that to a chatbot and get any kind of reasonable answer. You’ll probably just get an enormous bill for throwing such large amounts of data at it. That and a smackdown from your boss for handing off so much proprietary code to an external agency.

        The cause turned out to be a database access and a synclock in about 20 places (out of hundreds of places that access the database) and a synclock and a database access in another place. Order matters.

        It took two of us a day to narrow down the possibilities, and another day to jigger up a way to verify the actual cause of the deadlock. The path that caused the deadlock lead through something like six different source files in as many projects.

        Real code rarely bears any resemblance to the toy projects that people like to feed to the chatbots.

        “Oh, look. The chatbot found the error in my five line example program! Isn’t it smart?”

        Nah.

  6. I must be too familiar with writing multi-threaded code, but I spotted the problem the second I saw the rand() call. There is a learning curve, but it’s surmountable.

    1. same. when you’re doing multithreaded programming you have to be aware every time you’re accessing a shared resource. it doesn’t seem arcane or difficult to me, but it is mandatory.

  7. Even after taking a graduate course in parallel algorithm design, I would have been spinning my wheels looking for inefficient memory access, would not have guessed Rand would be the bottleneck!

    1. Python Global Interpreter Lock decides most stuff is running single thread.
      You would need to use a python fork without this mutex, and then you’re going to need to be more explicit about profiles to ensure you can actually use a multithread python script that doesn’t run like crap

  8. Is it possible to boot into a single task interactive OS on a ARM M or A or RISC-V hardware platform, have access to all hardware instructions with interactive incremental assembler/interpreter/compiler, then use the OS to load an app and run it these days?

    Like a server app?

    It was before the multitasking c/c++ OS invasions.

    Multitasking OS vulnerable to 1 malware penetrations 2 bugs 3 ~unmaintainable obfuscated codes requiring endless series of expensive updates?

    But multitasking Linuxes valuable despite above bankruptcy-causing issues?

    Like Ubuntu hosting the gcc c compiler. :)

    Unnecessary power consumption by irrelevant to app threads another issue?

  9. Define “modern CPU cores” “plateauing” “recently”. These terms are vague at best.

    Because to my recollection Alder Lake got something like a 60-100% boost in single core performance over the 6th-10th gen (all those were similar cores, seems like that was the plateau). That may not exactly have been ‘recent’ or ‘modern’ enough, but you didn’t define it.

    But if you consider a 4ghz i7 Haswell is perfectly capable of running Windows 11 acceptably and an Alder Lake Celeron at 3ghz can handily best it with only 2 cores 🤷🏼‍♂️

    The thrust of your article however has a good point. You do need to tune whatever implement you have to the task at hand.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.