Make Your Code Slower With Multithreading

June 7, 2024

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches. Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified. Something in there was using a futex (or fast userspace mutex) a whole lot.

After delving into the glibc source code, a comment said it all:

/* POSIX.1c requires that there is mutual exclusion for the `rand' and `srand' functions to prevent concurrent calls from modifying common data. */

__libc_lock_lock (lock);
 (void) __random_r (&unsafe_state, &retval);
 __libc_lock_unlock (lock);

By replacing the call to rand() with random_r(), the program’s performance with four threads improved dramatically. The runtime was reduced to a theoretical quarter of the single-thread version. As Marc summarizes, multi-threaded programming is not always as straightforward as one might think. While performance can be significantly worse in some cases, improvements are possible. However, this is not guaranteed to be the case in every situation.

The art of debugging and profiling code is complex, so here’s how to use Valgrind to look for problems you might not even know about. Even the humble Linux pipe needs to be thought out to get decent performance. What a surprise!

51 thoughts on “Make Your Code Slower With Multithreading”

CMH62 says:

June 7, 2024 at 7:14 pm

Next time, he should consider if the bofusenick is conflicting counter-contusively with the tordanula or if the massive rocwooselon buffer is full of trambignofast calls. 😆

Report comment

Reply
1. Josh says:
  
  June 7, 2024 at 7:22 pm
  
  It’s technical content on Hackaday. Don’t get your panties in a twist over some domain specific language.
  
  Report comment
  
  Reply
  1. CMH62 says:
    
    June 7, 2024 at 7:27 pm
    
    It’s humor in the comments on Hackaday. Don’t get your bloomers in bunch.
    
    Report comment
    
    Reply
    1. The Commenter Formerly Known As Ren says:
      
      June 7, 2024 at 7:59 pm
      
      Tell me more about the rocwooselon buffer!
      
      Report comment
      
      Reply
      1. Mr Name Required says:
        
        June 8, 2024 at 5:40 am
        
        Can’t, just too many trambignofasts at the moment
        
        Report comment
      2. CMH62 says:
        
        June 8, 2024 at 8:26 am
        
        🤣
        
        Report comment
  2. Anonymous Meme Lord says:
    
    June 10, 2024 at 11:36 am
    
    Woosh
    
    Report comment
    
    Reply
2. Brendan Robert says:
  
  June 8, 2024 at 1:05 pm
  
  Clearly you are not familiar with the retro turbo encabulator or it’s cousin, write-only memory.
  
  Report comment
  
  Reply
Tony M says:

June 7, 2024 at 8:11 pm

Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks…

I don’t quite understand multiple cores and multithreaded applications. lets say I have a very large bash script for example, maybe 2000 lines, so when I hit the run button it;s all like every single core go to line 1, then line 2 and so on, like having “copies” of the same processes until the job is done or is more like oh!well we can divide the work and one core is doing this set of lines while another is working on a different set in a non-blocking manner and we will merge results when necessary. Or maybe is none of the above???

Report comment

Reply
1. ben says:
  
  June 7, 2024 at 9:03 pm
  
  Unless you’re doing something intentionally wild with gnu parallel or similar, the script itself is done sequentially on a single thread. Commands you call in the script may invoke parallelism and task out whatever you’re piping into them.
  
  Report comment
  
  Reply
  1. Marty says:
    
    June 7, 2024 at 11:42 pm
    
    Exactly.
    
    For example, ffmpeg’s video encoders are multi-threaded and will happily use more than one CPU core to convert videos. So even if you use a bash “for” command to sequentially transcode a set of video files, it will not be limited to a single thread/core.
    However, if you have a 16-core (or more) monster like a Threadripper or EPYC, you can speed up processing even further by using GNU parallel to run multiple instances of ffmpeg, allowing you to to convert at least 4 videos concurrently.
    
    Report comment
    
    Reply
    1. Tony M says:
      
      June 8, 2024 at 12:10 am
      
      Wow! mind blowing, so if you want to make use of more than one single thread/core you have to invoke it in your code right? I remember there are ways to “unlock” cores during startup but that doesn’t mean my ordinary script will run faster right? what about software like Libreoffice or kicad? won’tl they be handled by multicores if they don’t use parallelism?
      
      I was happy playing Doom in all my cores :( Knowledge hurts :)
      
      Report comment
      
      Reply
      1. Andrew says:
        
        June 8, 2024 at 5:26 am
        
        The 1st rule of parallel programming is to share as little data as possible. In this case, a separate random number generator should be used for each thread.
        
        Report comment
      2. ben says:
        
        June 8, 2024 at 10:15 am
        
        UI apps are a little different and are treated as multithreaded/multicore as often as possible. There’s a management thread to handle UI (click/type/update) and background threads to handle long-running processes. That’s how you get something like a little spinning wheel while things are working in the background. It’s pretty easy to tell if an app is “sitting on the UI thread” since the window seems like it locks up for a bit. For games, you may have multiple management threads – a UI thread, a physics thread, etc.
        
        For things like scripts, single thread is usually what you want except for tiny bits like aforementioned parallel encoding, parallel compiling, etc. If it’s something that’s run on an infrequent basis, then it’s not even worth tuning. Keeping it sequential actually ends up being better, because if something breaks, it’s easier to dissect and fix.
        
        Report comment
    2. HaHa says:
      
      June 9, 2024 at 12:22 pm
      
      Nope.
      Thermals say ‘HI’.
      So does RAM bandwidth.
      
      Devil is in the details, but core count is mostly dick measuring.
      Aren’t you impressed by idle cores?
      Spend the money on more/better RAM and a better heat sink.
      
      Report comment
      
      Reply
2. Steve says:
  
  June 8, 2024 at 7:55 am
  
  So to things:
  1) It depends entirely on the task you are trying to parallelize. “Nick’s class” or “NC” is the category of algorithms considered to parallelize well, while “P” is the class of algorithms that are generally considered tractable to implement and run in a single threaded manner. That is the same P as the P versus NP debate, and it’s generally considered that P is not likely equal to NP, and NC is not likely equal to P. That whole there are no problem concrete examples, it’s expected that some tractable single threaded algorithms do not parallelize well.
  
  2) The number of lines in a program tells you literally nothing about it’s runtime. If they only run once each in order, even an e-core of a older budget chip will handle 2000 lines of script faster than you can blink. On the other hand, even a few hundred lines would be more than sufficient to (try to) print out the Goodstein sequence of an input number. For an input of 3, this runs quickly, outputting 6 numbers. For an input of 4 it will eventually to output over 1e100,000,000 numbers, most with over 100 million digits each. Determining how long a program can run based on just on the program size is actually so difficult that it is *literally impossible*
  
  Report comment
  
  Reply
  1. Tony M says:
    
    June 9, 2024 at 11:52 am
    
    I remember some gamers commenting about games running better when you use less cores that you have so I guess they don’t parallelize well.
    
    Damn! and I’m still struggling trying to “multitask” while not blocking my Arduino. Thanks for the explanations guys!
    
    Report comment
    
    Reply
3. I Alone Possess the Truth says:
  
  June 8, 2024 at 8:08 am
  
  “I don’t quite understand multiple cores” Hey, I don’t understand them at all. What aisle are those thousands of cores in?
  
  Report comment
  
  Reply
  1. Dave Rowntree says:
    
    June 9, 2024 at 1:18 am
    
    Well, for example the rtx3090 gpu card has a chip with 10496 programmable CUDA cores. That’s quite a bit of parallelism there.
    
    Report comment
    
    Reply
4. nategoose says:
  
  June 8, 2024 at 8:29 am
  
  A simple bash script is very ordered and does not take a lot of advantage of available parallelism at the top level, though individual commands and programs may.
  Using pipes and IO redirects you add more parallelism since producers and consumers run in different processes, which can run in parallel.
  You can also run jobs (one or more commands) in the background, which also inherently takes advantage of parallelism.
  
  Report comment
  
  Reply
5. Rog77 says:
  
  June 8, 2024 at 9:25 am
  
  I did some lazy-man parallel processing (we’re talking many seperate text files, wgets, data extraction/processing) years ago with bash, very bad practice but it it involved grep-ing the output of ps aux in a loop with a wait period, only firing off new tasks when the number of active tasks was below a certain threshold, and this number could be altered depending on how big a box it was running on.
  
  The whole thing was dreadfully slow but it worked, and at the time it was only supposed to be used once, so I cared more about the result than the process. I re-did the whole thing in python afterwards, and although that was multi-threaded it ran many times faster on much less powerful hardware.
  
  Report comment
  
  Reply
yet another bruce says:

June 7, 2024 at 8:38 pm

Maybe someone fixed the C rand() function recently but based on what I learned 30 years ago spinlocks are the least of the problems with rand().

Report comment

Reply
shinsukke says:

June 7, 2024 at 9:59 pm

Parallel software usually has a different architecture to conventional software which runs on a single core/thread. It requires a bit of unlearning, and then learning the new architecture designs.

Report comment

Reply
1. ian 42 says:
  
  June 8, 2024 at 12:06 am
  
  so true. I’ve been writing multi-threaded multi-core (there is a difference..) programs since the 80’s, and I’m still surprised today at the percentage of programmers that don’t design it in from the ground up. I think it’s partly a conceptualization problem, they just don’t really understand how to use parallel processes..
  It’s also not taught well at uni, or most workplaces..
  
  Report comment
  
  Reply
  1. Tony M says:
    
    June 8, 2024 at 12:15 am
    
    well if programmers don’t understand how to use it, there is hope for me :) as I have a better understanding now.
    
    Report comment
    
    Reply
  2. Paul Stoffregen says:
    
    June 8, 2024 at 2:59 am
    
    Optimizing performance is rarely the most important design goal. Usually minimizing time to release or minimizing overall cost are. Sometimes security or just minimizing risk of bugs also plays a role. So it really shouldn’t be so surprising that most projects aren’t design from the ground up for parallel processing. Starting ground up with a simple straightforward approach and later redesigning as (if) needed is almost always a more successful strategy then building a more complex design from the very beginning.
    
    Report comment
    
    Reply
    1. HaHa says:
      
      June 9, 2024 at 12:27 pm
      
      ‘Premature optimization is the root of all evil’.
      
      It’s sometimes quoted as ‘early optimization…’ but that is incorrect.
      If you know you will need to optimize it, it is never too early.
      If you don’t know you need to optimize it, doing it late is still stupid.
      
      Report comment
      
      Reply
  3. Orin the dwarf says:
    
    June 9, 2024 at 12:24 pm
    
    Its also the problem that solutions are mostly hardware specific so they dont carry well.
    
    Report comment
    
    Reply
2. Dude says:
  
  June 8, 2024 at 4:09 pm
  
  There are some fundamental limitations to running parallel code irrespective of the architecture.
  
  One is Amdahl’s Law, which states that as you keep splitting your task into parallel threads, you can only go as fast as the slowest part of your task that has to be completed in order, no matter how many computations you can run concurrently.
  
  The second law, the name of which I’ve forgotten, states that as you add more parallel computing units or threads for your task, the portion of your code that has to be completed in order begins to grow – that is, the task of managing and organizing the computation and collating the results takes more computing time.
  
  When you take Amdahl’s law alone, you get diminishing returns, but every additional thread or computational unit still speeds things up somewhat. When you take the whole bureaucracy of parallel computation into account, you get negative returns. Trying to add too many computing units or threads on the job will actually turn out slower.
  
  Report comment
  
  Reply
  1. JSA says:
    
    June 9, 2024 at 4:44 am
    
    This makes more sense in context now. I was reading this and thinking about programming languages that execute in order they’re written. How can you multitask stuff that must be serially executed by the design of the program?
    
    Report comment
    
    Reply
BT says:

June 7, 2024 at 11:01 pm

“plateauing”

Four different vowels in a row, well done!

Report comment

Reply
1. The Commenter Formerly Known As Ren says:
  
  June 8, 2024 at 5:03 am
  
  And “facetious” uses the vowels in order!
  B^)
  
  Report comment
  
  Reply
  1. Commenter without a cause says:
    
    June 8, 2024 at 5:29 am
    
    … and sometimes “facetiously”.
    
    Report comment
    
    Reply
    1. The Commenter Formerly Known As Ren says:
      
      June 8, 2024 at 11:32 am
      
      I award you “one gold star”!
      
      Report comment
      
      Reply
combinatorylogic says:

June 8, 2024 at 2:23 am

The rules are simple: *never* ever use sync primitives that involve system calls. It includes mutexes, futexts, semaphores and alike. Want to synchronise – spin on an atomic. But for most cases you just should not synchronise at all. It’s easy.

Report comment

Reply
JackInTN says:

June 8, 2024 at 6:13 am

In years gone by, context switching has been a nemesis. Saving the ‘minimal context’ to be able to come back to as ‘blocked’ task again quickly, and picking up a ‘blocked’ task that is now free has alwasy been ‘expensive’. We had that problem in the ’80/90s. The ability using parallel processes needs a special programming or problems to make it most worthwhile. We called the difference ‘vertical MIPS’ (typical big single processors) vs ‘horizontal MIPS’ (parallel processors). Both architectures are great, just different tools that specialize in different problem solutions.

Report comment

Reply
ishotjr says:

June 8, 2024 at 7:47 am

is that image AI-generated?

Report comment

Reply
1. matt says:
  
  June 8, 2024 at 8:08 am
  
  Nope. That’s exactly what a multithreaded core looks like. It’s a pity it’s usually hidden under the heatsink.
  
  Report comment
  
  Reply
  1. ben says:
    
    June 8, 2024 at 11:58 am
    
    I tried delidding once to see it and can’t advise it. Those strings spew everywhere and they’re a pain to get out of the carpet.
    
    Report comment
    
    Reply
2. Dave Rowntree says:
  
  June 9, 2024 at 1:23 am
  
  Yeah, I couldn’t find anything I liked so I threw the task at gpt-4o and it turned out OK in a very obviously AI-generated kinda way.
  
  Report comment
  
  Reply
Jim says:

June 8, 2024 at 9:12 am

If anyone really wants to learn parallel programming, then MIT have a whole course on it, viewable on You Tube under their Open Courseware programme. It’s not for everyone, because as the professor states in lesson one, optimising your programme to use multiple cores makes your code a lot harder to read.

https://www.youtube.com/watch?v=o7h_sYMk_oc&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf

Report comment

Reply
1. jacobchrist says:
  
  June 8, 2024 at 11:00 am
  
  As long as ChatGPT can understand it and explain it to me I don’t see any problems with difficult to read code.
  
  Report comment
  
  Reply
  1. Joseph Eoff says:
    
    June 10, 2024 at 8:55 am
    
    Try it for real sometime, as I had to do recently.
    1. A hundred or so source files in C# and VB.
    2. Several tens of thousands of lines of code.
    3. Several GB of program logs.
    4. A deadlock that occurs once in a blue moon.
    
    I don’t see how you expect to feed all of that to a chatbot and get any kind of reasonable answer. You’ll probably just get an enormous bill for throwing such large amounts of data at it. That and a smackdown from your boss for handing off so much proprietary code to an external agency.
    
    The cause turned out to be a database access and a synclock in about 20 places (out of hundreds of places that access the database) and a synclock and a database access in another place. Order matters.
    
    It took two of us a day to narrow down the possibilities, and another day to jigger up a way to verify the actual cause of the deadlock. The path that caused the deadlock lead through something like six different source files in as many projects.
    
    Real code rarely bears any resemblance to the toy projects that people like to feed to the chatbots.
    
    “Oh, look. The chatbot found the error in my five line example program! Isn’t it smart?”
    
    Nah.
    
    Report comment
    
    Reply
James Parker says:

June 8, 2024 at 12:21 pm

I must be too familiar with writing multi-threaded code, but I spotted the problem the second I saw the rand() call. There is a learning curve, but it’s surmountable.

Report comment

Reply
1. Greg A says:
  
  June 9, 2024 at 6:07 am
  
  same. when you’re doing multithreaded programming you have to be aware every time you’re accessing a shared resource. it doesn’t seem arcane or difficult to me, but it is mandatory.
  
  Report comment
  
  Reply
Brendan Robert says:

June 8, 2024 at 1:09 pm

Even after taking a graduate course in parallel algorithm design, I would have been spinning my wheels looking for inefficient memory access, would not have guessed Rand would be the bottleneck!

Report comment

Reply
Jeff says:

June 8, 2024 at 9:20 pm

The funny thing is that a python script using the multiprocessing module would probably scale better.

Report comment

Reply
1. JSA says:
  
  June 9, 2024 at 4:48 am
  
  Python Global Interpreter Lock decides most stuff is running single thread.
  You would need to use a python fork without this mutex, and then you’re going to need to be more explicit about profiles to ensure you can actually use a multithread python script that doesn’t run like crap
  
  Report comment
  
  Reply
William Payne says:

June 9, 2024 at 3:15 pm

Is it possible to boot into a single task interactive OS on a ARM M or A or RISC-V hardware platform, have access to all hardware instructions with interactive incremental assembler/interpreter/compiler, then use the OS to load an app and run it these days?

Like a server app?

It was before the multitasking c/c++ OS invasions.

Multitasking OS vulnerable to 1 malware penetrations 2 bugs 3 ~unmaintainable obfuscated codes requiring endless series of expensive updates?

But multitasking Linuxes valuable despite above bankruptcy-causing issues?

Like Ubuntu hosting the gcc c compiler. :)

Unnecessary power consumption by irrelevant to app threads another issue?

Report comment

Reply
Miles says:

June 10, 2024 at 8:15 am

Define “modern CPU cores” “plateauing” “recently”. These terms are vague at best.

Because to my recollection Alder Lake got something like a 60-100% boost in single core performance over the 6th-10th gen (all those were similar cores, seems like that was the plateau). That may not exactly have been ‘recent’ or ‘modern’ enough, but you didn’t define it.

But if you consider a 4ghz i7 Haswell is perfectly capable of running Windows 11 acceptably and an Alder Lake Celeron at 3ghz can handily best it with only 2 cores 🤷🏼‍♂️

The thrust of your article however has a good point. You do need to tune whatever implement you have to the task at hand.

Report comment

Reply
Fuse says:

June 10, 2024 at 10:25 am

Who would have thought non-thread-safe calls would cause issues with multi threading.

Report comment

Reply

Hackaday

Make Your Code Slower With Multithreading

51 thoughts on “Make Your Code Slower With Multithreading”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Power Line Patrols: The Grid’s Eye In The Sky

A History Of Pong

Supersonic Flight May Finally Return To US Skies

The Death Of Industrial Design And The Era Of Dull Electronics

Power Grid Stability: From Generators To Reactive Power

Our Columns

Linux Fu: The Cheap Macropad Conundrum

Be More Axolotl: How Humans May One Day Regrow Limbs And Organs

Hackaday Links: July 27, 2025

Personalization, Industrial Design, And Hacked Devices

Hackaday Podcast Episode 330: Hover Turtles, Dull Designs, And K’nex Computers

51 thoughts on “Make Your Code Slower With Multithreading”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns