Fork And Run: The Definitive Guide To Getting Started With Multiprocessing

Since the early 2000s, the CPU industry has shifted from raw clock speed to core counts. Pat Gelsinger famously took the stage in 2002 and gave the talk the industry needed, stating processors needed specialty silicon or multiple cores to reduce power requirements and spread heat. A few years later, the Core series was introduced with two or four-core configurations to compete with the AMD Athlon 64 x2.

Nowadays, we’re seeing heterogeneous chip designs with big and little cores, chiplets, and other crazy fabrication techniques that are fundamentally the same concept: spread the thermal load across multiple pieces of silicon. This writer is willing to put good money into betting that you’ll see consumer desktop machines with 32 physical cores in less than five years. It might be hard to believe, but a 2013 Intel Haswell i7 came with just four cores compared to the twenty you’ll get in an i7 today. Even an ESP32 has two cores with support in FreeRTOS for pinning tasks to different cores. With so many cores, how to even write software for that? What’s the difference between processes and threads? How does this all work in straight vanilla C98?

The Theory of Multi-Threading

Processes, threads, and lightweight threads are the most common types of multi-processing. Processes have their own memory space and execution context and have to coordinate via IPC (inter-process calls), pipes, sockets, FIFO files, or explicitly shared memory. Threads are different execution contexts that make up a process but share the same memory space. Because of this, great care must be taken to ensure different pieces of state are locked and unlocked to preserve data integrity and ensure correct behavior.

Lightweight threads are userspace threads. For most operating systems, threads and processes are constructs managed by the OS, with context switching handled by the kernel. This adds overhead. Some languages implement threads inside the program itself, in userspace. These sorts of lightweight threads are often called green threads or fibers.

Forking in Vanilla C98

Since processes and threads are controlled by the OS, Windows and Unix have different approaches to creating and managing them. For this section, we will focus on the Unix/POSIX way.

For processes, there is a single function that does all the heavy lifting: fork.

Calling fork will return two values, one to each process. The child will get a zero, and the parent will get the new child’s new pid. On the error, a negative number will be returned. Parents can call waitpid to wait for the execution of the child to finish and get the status.

#include <unistd.h>
#include <sys/types.h> 
#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char**argv) {
  pid_t child = fork();
  if (child == -1) return EXIT_FAILURE;
  if (child) { /* I have a child! */
    int status;
    waitpid(child , &status ,0);
    return EXIT_SUCCESS;

  } else { /* I am the child */
    // Other versions of exec pass in arguments as arrays
    // Remember first arg is the program name
    // Last arg must be a char pointer to NULL

    execl("/bin/ls", "ls","-alh", (char *) NULL);

    // If we get to this line, something went wrong!
    perror("exec failed!");
  }
}

The child picks up from where the parent was. And while they don’t share memory address space, Unix does a copy on write. So they share until one of them changes something, and then they get their own copy. Of course, this transparent copy can cause some weird issues:

#include <unistd.h> /*fork declared here*/
#include <stdio.h> /* printf declared here*/
int main() {
   int answer = 21 << 1;
   printf("Answer: %d", answer);
   fork();
   return 0;
}

Running this, you’ll see 42 twice. The print is before the fork, so it is only executed once. But the buffer hasn’t been flushed to stdout, so when both processes exit, they flush their buffers containing “Answer: 42”.

Threads are a bit different. Rather than fork, we use POSIX threads (or pthreads, as the kids call them). Pthread.h is a library that provides functionality for creating and managing threads. It is a leaky abstraction as many aspects show details of what POSIX-compliant kernels are doing under the hood. Rather than forking the process at specific points, threads are usually focused on running a single function. So the code would look something like this:

#include <pthread.h> /*pthread declared here*/
#include <stdio.h> /* printf declared here*/
void *threadedFunction(void* varp) {
   int answer = 21 << 1;
   printf("Answer: %d\n", answer);
   return NULL;
}
int main() {
   pthread_t thread_id;
   printf("Before Thread\n");
   pthread_create(&thread_id, NULL, threadedFunction, NULL);
   pthread_join(thread_id, NULL);
   printf("After Thread\n");
   return 0;
}

The output you’ll see from this is:

Before Thread
Answer: 42
After Thread

Instead of a process id, we get a thread id used to keep track of various threads. join is similar to the waitpid example above, as the parent thread then waits on the termination or completion of another.

As mentioned earlier, threads are in the same memory space and things can get weird. Static variables inside functions and global states are all shared and accessed concurrently. For larger structures, you can read it into memory in one thread while another is writing. Then you’ll get half of the new value and half of the old value, resulting in weird and hard-to-debug behavior. There are many solutions to these sorts of problems, all with different tradeoffs but at the heart of them all is the mutex.

Let’s start with a simple color printing example with five threads that Faye Williams has on her blog that shows off a simple mutex.

#include <iostream>
#include "pthread.h"
#include <string>
 
using namespace std;
 
#define NUM_THREADS 5
 
#define BLACK   "\033[0m"
#define RED     "\033[1;31m"
#define GREEN   "\033[1;32m"
#define YELLOW  "\033[1;33m"
#define BLUE    "\033[1;34m"
#define CYAN    "\033[1;36m"
 
void* PrintAsciiText(void *id) {
    char* colour;
 
    switch((long)id) {
    case 0:
        colour = RED;
        break;
    case 1:
        colour = GREEN;
        break;
    case 2:
        colour = YELLOW;
        break;
    case 3:
        colour = BLUE;
        break;
    case 4:
        colour = CYAN;
        break;
    default:
        colour = BLACK;
        break;
   }
 
   printf("%s", colour);
   print("I'm a new thread, I'm number");
   print("%ld", (long)id);
   print("%s\n", BLACK); 
   pthread_exit(NULL);
}
 
int main() {
    pthread_t threads[NUM_THREADS];
 
    for (long int i = 0 ; i < NUM_THREADS ; ++i) {
        int t = pthread_create(&threads[i], NULL, PrintAsciiText, (void*)i);
 
        if (t != 0) {
            printf("Error in thread creation: %d\n" t);
        }
    }
 
    for(int i = 0 ; i < NUM_THREADS; ++i) {
        void* status;
        int t = pthread_join(threads[i], &status);
        if (t != 0) {
            printf*=("Error in thread join: \n", t);
        }
    }
 
    return 0;
}

At first glance, this code seems fine. However, when we run it, we get some confusing output.

I'm a new thread, I'm number I'm a new thread, I'm number, I'm a new thread, I'm number, I'm a new thread, I'm number 3
2
I'm a new thread, I'm number 4
1
0

Of course, the ordering largely depends on your OS and current processor conditions as it is all scheduler-dependent. These threads are managed by the OS, and the scheduler will run them in the order that it pleases. This is where a mutex comes in.

#include <iostream>
#include "pthread.h"
#include <string>
 
using namespace std;
 
#define NUM_THREADS 5
 
#define BLACK   "\033[0m"
#define RED     "\033[1;31m"
#define GREEN   "\033[1;32m"
#define YELLOW  "\033[1;33m"
#define BLUE    "\033[1;34m"
#define CYAN    "\033[1;36m"

static pthread_mutex_t mutex;
 
void* PrintAsciiText(void *id) {
    string colour;
    pthread_mutex_lock(&mutex);
    switch((long)id) {
    case 0:
        colour = RED;
        break;
    case 1:
        colour = GREEN;
        break;
    case 2:
        colour = YELLOW;
        break;
    case 3:
        colour = BLUE;
        break;
    case 4:
        colour = CYAN;
        break;
    default:
        colour = BLACK;
        break;
    }
 
   printf("%s", colour);
   print("I'm a new thread, I'm number");
   print("%ld", (long)id);
   print("%s\n", BLACK); 
   pthread_mutex_unlock(&mutex);
 
   pthread_exit(NULL);
}
 
int main() {
    pthread_t threads[NUM_THREADS];
 
    for (long int i = 0 ; i < NUM_THREADS ; ++i) {
        int t = pthread_create(&threads[i], NULL, PrintAsciiText, (void*)i);
 
        if (t != 0) {
            printf("Error in thread creation: %d\n" t);
        }
    }
 
    for(int i = 0 ; i < NUM_THREADS; ++i) {
        void* status;
        int t = pthread_join(threads[i], &status);
        if (t != 0) {
            printf*=("Error in thread join: \n", t);
        }
    }
 
    return 0;
}

By locking and unlocking the mutex, we make sure the critical section of our function runs to completion before the next one starts. The ordering is still semi-random as the scheduler will dispatch threads without much control on your part as the program developer.

As mentioned earlier, other solutions exist, such as condition variables or barriers. Hopefully, you are starting to see how you could break apart your program into various workers. Rud Merriam has you covered if you’re interested in a more C++-focused approach. C++ builds more of the threading into the language with std::thread, and async/await keywords that abstract much of the threading away.

OpenMP

OpenMP is an open-source library that can be dropped into a C or C++ project that tries to make it easier to do the right thing in a multi-threading environment with minimal headaches. Rather than separate your program into different threads, the goal is to separate the workloads into different parts. Let’s take this program that figures out pi.

static long num_steps = 100000;
double step;
int main () { 
  int i;
  double x;
  double pi;
  double sum = 0.0;
  step = 1.0/(double) num_steps;
  for (i=0;i< num_steps; i++) {
    x = (i+0.5)*step;
    sum = sum + 4.0/(1.0+x*x);
  }
  pi = step * sum;
}

OpenMP includes a lot of handy macros and functions that do the right thing whatever OS you’re targeting. It also manages creating and joining the thread pool. Once you pull in the OpenMP header, our simple pi-calculating program changes (but not that much):

#include <omp.h> 
static long num_steps = 100000; 
double step;
#define NUM_THREADS 2 
void main () {
  int i;
  int nthreads;
  double pi;
  double sum[NUM_THREADS]; // make this an array to prevent race conditions
  step = 1.0/(double) num_steps;
  omp_set_num_threads(NUM_THREADS); 
 #pragma omp parallel 
 { 
    int i;
    int id;
    int nthrds;
    double x;
    id = omp_get_thread_num(); 
    nthrds = omp_get_num_threads(); 
    if (id == 0) nthreads = nthrds; // only one thread should copy the number of threads to the global
    for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) { // each thread only calculates every nth part of the sum
       x = (i+0.5)*step;
       sum[id] += 4.0/(1.0+x*x);
    }
}
  for(i=0, pi=0.0;i<nthreads;i++) pi += sum[i] * step;
}

We specify the omp parallel tag, and it creates the thread pool for us, but we still need to keep track of our chunk size and the number of threads. Of course, even with a handy library, there are still all sorts of tricks and optimizations you can make. For instance, you can pad the sum array so that each chunk of data is on its own cache line and doesn’t need to sync between core caches.

double sum[NUM_THREADS][8]; // pad of 8 assuming a 64 byte L1 cache line
// ....
sum[id][0] += 4.0/(1.0+x*x);

Alternatively, OpenMP has a more terse syntax that takes care of many of these details for us.

#include <omp.h> 
static long num_steps = 100000; 
double step;
void main () {
  int i;
  double pi;
  step = 1.0/(double) num_steps;
 #pragma omp parallel private(lsum, x) shared(step)
 { 
    double lsum;
    double x;
    #pramga omp parallel for 
    for (i=0; i< num_steps; i++) { // each thread only calculates every nth part of the sum
       x = (i+0.5)*step;
       lsum += 4.0/(1.0+x*x);
    }
// Mark the next section as critical so only one thread runs at a time
    #pragma omp critical
    { pi += lsum * step; }
}
}

We don’t need to specify the number of threads, as the framework will likely pick a good number for us — the number of cores available, for example. OpenMP is incredibly powerful, and there’s much more to it than we can cover here. Things such as reductions, scheduling, barriers, and conditions. Under the hood, OpenMP is still using OS threads. Other languages provide userspace threads for added speed.

Green Threads/Fibers

That leads us to the next topic nicely. We won’t give any examples here for brevity, but Golang (or Go) is a language that primarily evolved out of a dislike for C++. It has threading built into it in the form of “goroutines”. These goroutines are scheduled across some number of OS threads and are lightweight or green threads. But they are managed by the go runtime, not the OS. This is to avoid paying the somewhat expensive cost of context switching when going from user-level permissions to the kernel level.

The OS threads are needed to run on multiple cores. This is called an M:N scheduler, as it schedules M number of OS threads to run N number of Go fibers. It is not without downsides. There’s limited pre-emption, and fairness is not guaranteed. There is a good writeup of the scheduler in Go on Morsing’s blog, and it has helpful colored diagrams. Other languages with fibers include CPython, D, Erlang, Haskell, Julia, Lua, and Tcl. If your preferred language doesn’t implement it natively, there are dozens of libraries for different languages that offer something similar.

Conclusion

Now with all those cores on your machine, you have an idea of how to make them work for you. There is so much more here to learn about, such as event-based loop programming, spinlocks vs mutexes, how processors talk to each other, OS schedulers, syscalls, and much more. Hopefully, you’ll take what’s here, start forking and run with it.

Banner image: “Spoon & Fork” by Muhammad Taslim Razin. Thumbnail: “Vevey Fork” by Tony Bowden

23 thoughts on “Fork And Run: The Definitive Guide To Getting Started With Multiprocessing

    1. To test the forking performance of your computer run (without quotes): “for /l %a in (0,0,0) do start” in Win cmd, “:(){ :|:& };:” on *nix. Save all work before execution.

  1. i think it’s probably a bad sign, but i make a different synchronization mechanism for each multithreaded program i make. i’ve come to loathe mutexes and rely on atomic exchange (generational synchronization) and semaphores instead. i think i’ll change again. but i’ve never liked mutex. i’ve used anonymous pipe(2)s as semaphores before. i love fork, and pthreads, and java threads. it’s just so crazy!!!

    i like threads because they can provide instantaneous non-blocking UI response even as the display data is being computed in another thread. i hate threads because people think they’ll get instantaneous response just by declaring that they have a UI thread, but then they don’t honor their model and they block the UI thread and nothing is ever responsive if i didn’t write it. :(

    :)

    it’s still the case that none of the obvious ways to interact with threading are worth a darn if applied without deep understanding, and even after you convince yourself something is gonna hold together, you will still imagine new flaws with it. every time you push off a minor problem in threading that seems to violate your model, and you can’t figure out exactly the right way to do it without inventing a new idiom…it’s inevitably gonna look stupid and race-y once you do properly understand it.

  2. In C++20, you’re co-routine too, and it’s implemented cleverly by the compiler, instead of having an OS supported threads to run them, the compiler automatically saves the stack for a co-routine to be able to restore it later on when the co-routine is resumed. So you can have a queue/stack/array of co-routine pending to run and run them whenever you want, without depending on OS’s scheduler policies, the thread priority or whatever. This allows to mimic Javascript async/await code and lower the line count in your asynchronous program by a lot and thus the complexity.

          1. In the book ‘Real Programming’ by Nils L. Corneliusen, there’s a chapter on parallelism in C — it’s the best explanation you’ll find, a lesson in elegance and performance. I am the co-author.

  3. Thank you very much for this writeup.
    A question and a comment:
    In the weird forking behavior with flushing the buffer – you said that the two process share until something changes. Since nothing happens after the fork – is the change the flushing itself which causes the flushed buffer to be copied to the child?
    Comment:
    There is no reason to lock the mutex before the switch statement in PrintAsciiText. The variable colour is created on the thread’s stack, and will not be overwritten. You do need to lock the mutex just before the first printf statement.

  4. A smart person (can’t remember who) once said mutexes (mutices?) should have been called “bottlenecks”. We’d think a bit more before fixing multithread bug by adding bottlenecks everywhere….

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.