We’re quite used to multitasking computer systems today. Our desktops run email, a couple of browsers in different workspaces, a word processor, and a few other applications, apparently all at once. Looking behind the scenes using a system monitor or task manager program reveals a multitude of other programs running in support of our activities. Of course, any given CPU is running a maximum of one program at a time. Multitasking is simply the practice of switching between active processes fast enough to give the illusion of simultaneity.
The roots of multiasking go way back. In the early days, when computers cost tons of money, the thought of an idle system was anathema. Teletype IO was slow compared to the processor, and leaving the processor waiting idle for a card reader to slurp in the next card was outrageous. The gurus of the time worked to fill that idle time with productive work. That eventually led to systems that would run multiple programs at one time, and eventually to more finely grained multitasking within a program.
Modern multitasking depends on support from the underlying API of an operating system. Each OS uses its own techniques, making it difficult to write portable code. The C++ 2011 standard increased the portability of the language by adding concurrency routines to the Standard Template Library (STL). These routines use the API of the OS. For instance, the Linux version uses the POSIX threading library, pthread. The result is a minimal, but useful, capability for building upon in later standards. The C++ 2017 standard development activities include work on parallelism and concurrency.
In this article, I’ll work through some of the facilities for and pitfalls in writing threaded code in C++.
Creating Threads
To implement multitasking within a single program, the code is broken up into multiple tasks, or threads, that run at the same time. Declaring and running a thread is simple. The hard parts come later while managing them and handling interactions among threads.
A thread is created from anything that is callable. That means a function or a class with, typically, an operator() method. Here is an example with three different callable objects created as threads and some management techniques.
#include <thread>
using namespace std;
void fa() {
}
void fb(int a) {
}
class Fc {
public:
	Fc(int a) :
			mA { a } {
	}
	void operator()() {
	}
private:
	int mA;
};
int mainx() {
	int value;
	thread t_fa { fa };
	thread t_fb { fb, value };
	thread t_fc { Fc { value } };
	t_fc.detach();
	/* ... */
	// code waits for one second
	std::this_thread::sleep_for(std::chrono::seconds(1));
// code continues in the ‘main’ thread
	t_fa.join();
	t_fb.join();
	return 0;
}
There is nothing special about these functions or the class. They could be used in a program as functions or as class instances. They become independent threads when passed to the constructor of class thread along with any arguments that are required of the function or the operator() method. The thread class is a variadic template class with a variable length number of typenames. This capability allows the class constructor to forward the arguments to the function or class method.
Now that the thread is running, how do you stop it? Unfortunately there is no method provided by thread. But it is easy to create a way. One technique I’ve used is a simple global boolean flag passed as a reference:
#include <atomic>
static std::atomic_bool run { true };
void fr(int a, std::atomic_bool& run) {
	while (run) {
          ...
	}
}
int value { 0 };
thread t_fr { fr, value, std::ref(run) };
The thread checks for run becoming false and exits. We’ll see later why the std::atomic_bool is used.
Arguments can be passed by value or pointer just as in calling a regular function. Passing a reference requires the use of std::ref, as demonstrated with run. 
Working With Threads
After the threads are created the remainder of main() is executed. When main() exits the situation gets interesting. What happens to the running threads? What happens to the resources they may have allocated, like an open file or serial port? Also consider that you can start a thread in any function, not just main(), so what if fa() created another thread and then exited? What happens to that new thread?
The standard provides two ways of handling this situation: thread::join() and thread::detach(). When function calls join() it is hold until the thread completes. In the example this is done just before main() exits.
A detached thread runs independent of the rest of the program. Generally, a call to detach() should be made soon after the thread is created. When the creating function exits, the thread continues running, even beyond the end of main(). If detached threads are not stopped by using some technique, like shown with run above, an exception is thrown. This leaves program resources in an indeterminate state.
A thread that has been neither joined nor detached is a joinable thread. This can be tested by thread::joinable(). An attempt to join a detached thread throws an exception. If the state of the thread is uncertain it, should be checked by calling joinable().
What happens if a thread throws an uncaught exception? The standard specifies std::terminate() is called, which calls std::abort to end the program. You can avoid this by catching the exception or specifying a std::terminate_handler. The details for this are available in a C++ reference site or book.
The need to join with a thread, if appropriate, requires diligence akin to management of resources, e.g. files or memory. It is solved by the same approach which underlies one reason for the existence of classes: Resource Allocation Is Initialization (RAII). A class constructor performs initialization, which includes resource acquisition, while a destructor releases resources. A simple class (see Notes at end) to handle threads is:
struct thread_guard: thread {
	using thread::thread;
	~thread_guard() {
		if (joinable()) {
			join();
		}
	}
};
The class thread_guard is a derived class of thread. The using thread::thread tells the compiler to use all of thread’s methods, eliminating the need to create a constructor for thread_guard, since we only want to provide a destructor. The destructor tests if the thread is joinable, i.e. that it isn’t detached, and does a join, if allowed. This provides two capabilities. The function creating a thread no longer needs to explicitly join the thread, although it safely can, before the function exits. If something interrupts the creating function, like an exception, the destructor will do the join.
In the example code the namespace this_thread is demonstrated with a call to this_thread::sleep_for(). This namespace provide three routines for controlling the timing of threads. You can, as illustrated, sleep for a period of time, sleep until a specified time, or simply yield. The fourth routine in the namespace gets the thread’s identification number. Its appearance in main() points out that it is also a thread. The chrono header is well worth studying since it provides convenient tools for working with time values in the form of clocks, specific points in time, and durations.
Racing Threads
Let’s go back to atomic_bool which is defined in the atomic header along with atomic versions of many standard types. Atomic variables are needed to prevent race conditions on a variable. It takes sometimes hundreds of processor cycles to read or write a variable, even something as small as a boolean or character. During that time an interrupt can occur or a new task swapped for the current task. If this newly executing code reads or writes the same variable the state of the variable is corrupted. For example, a 32 bit integer contains four bytes. The first task reads two bytes, is interrupted, the new routine writes all four bytes, and when the first task is restarted it reads the last two bytes. The original first two bytes it read are now invalid. An atomic operation prevents this from happening.
Other race conditions occur when attempting to access resources. If the tasks in the example were sending output to cerr such a race would occur. One task could start outputting text, be interrupted by a task swap, and the new task also start writing to cerr causing their outputs to intermix.
The long established technique for handling this is with mutual exclusion, shortened to mutex.
#include <mutex> std::mutex cerr_mutex; . . . cerr_mutex.lock(); cerr << "Hello Hackday! " << '\n'; cerr_mutex.unlock();
A mutex is locked when a task wants to access a resource and is unlocked when the task is done. This is a quick and inexpensive operation. If another task requests the mutex, the task is held until the mutex is released.
The use of multiple mutex can lead to a deadlock situation. Task A requests mutex X and Y, in that order. Task B requests Y and then X. Each can gain their first mutex ,but neither can obtain the second. The standard provides the function lock(X, Y, …) which waits until all the locks in the argument list are available.
The mutex header provides more classes and functions for handling race conditions so deserves careful study when using threads.
Peeking Behind the Curtains
It helps when doing multitasking to understand a little bit about what is happening behind the curtains. On Windows or Linux we are working with preemptive multitasking, in contrast to cooperative multitasking.
In the former, the system is driven by a timer interrupt to switch, using a scheduling algorithm, among the tasks running on the system. A cooperative multitasking system relies on tasks to voluntarily relinquish control so other tasks may be scheduled
The simplest preemptive scheduler is time-slicing where each task is allowed to run for a specific amount of time. If a task yields for IO, to wait on a mutex, or voluntarily, the next task is allowed to proceed. More sophisticated algorithms, even with cooperative multitasking, perform priority scheduling where a high priority task gets more time. In real-time system, tasks marked as real-time might get as much time as they need. With priority multitasking developers must assure that all tasks receive sufficient time. One reason they might not is when high priority tasks consume all the processing, starving lower priority tasks. A problem, dubbed priority inversion, can occur when a low priority task grabs a mutex preventing a higher priority task from running.
Multitasking Costs
Multitasking consumes time and memory resources. When a task swap occurs, the processor registers used by a task are pushed onto its stack. The next task’s registers are popped from that task’s stack to start it running. This obviously takes time and memory since each task must be allocated a stack. Setting the stack size is almost an art. Enough space must be allocated for the worst case number of function calls and local, stack based, variables the task might need. In addition, when an interrupt occurs to handle external events, say a serial port receives a character, the interrupt requires stack space.
The Arduino ecosystem generally does not support these forms of multitasking because the processors do not have the memory for the stacks required by multiple tasks. There are scheduler techniques usable on Arduinos that are cooperative but do not save the state of the task, pushing that burden onto the task itself. The C++ concurrency libraries are not usable since there is no underlying system to provide a multitasking API. The Arduino community has developed a large number of scheduler libraries to use on those systems.
Wrap Up
Multitasking is a useful technique for keeping the processor busy. It must be remembered that it isn’t a panacea and, as always, should be tested to make sure the overhead of multitasking isn’t costing more than it provides.
Another consideration is the impact on the organization of the code. Dividing a program into its logical parts as separate tasks can make creating, testing, and debugging the code easier. Individual developers can work more easily and independently on separate portions of the code. Even a single individual, like myself, finds it easier to concentrate on the the code for a specific task while ignoring other processes.
I’ve only touched on the complex requirements for creating a multitasking system. There are many other C++ capabilities available for working with tasks, including safely coordinating their activities and transferring data among them. Once you start hacking with larger code bases, breaking the code into multiple tasks may prove beneficial for your system and sanity.
Notes:
The thread_guard class is from Bjarne Stroustrup, “The C++ Programming Language”, 4th Edition. Read this book for all you ever want to know about C++.
 
            
 
 
    									 
    									 
    									 
    									 
			 
			 
			 
			 
			 
			 
			 
			 
			 
			
One (virtually lost) paradigm for multitasking programming is that of Occam -2 as used on the Inmos Transputer in the mid-80s to early 90s. Here, the concepts of parallel (including multicore) execution, sequential execution, synchronous channel communications and channel selection were built into the language constructs.
This means in the same way that we use { and } to denote a (sequential) block of code, Occam usea
PAR
–indented list of statements
Or
SEQ
–indented list of statements
To denote parallel or sequential execution.
Frankly, it was amazing: you could take a program that ran in pseudo-parallel on a single Transputer core and re-run it at 10x the speed on a network of 10 Transputer cores!
Importantly, occam-2 programs are largely event-driven which would make the language eminently suitable for low-power iot applications in today’s world. Check it out
https://en.m.wikipedia.org/wiki/Occam_(programming_language)
XMOS do use similar constructs with their C++ extensions for their multicore embedded processors https://www.xmos.com/support/tools/programming?version=latest&component=18344&page=2
+1 for xmos. Also note that one of the founders of xmos used to work for inmos, and the design is heavily influenced by the Transputer. The way xmos explains their architecture is a little unclear, but essentially each “tile” has 8 hardware threads (dedicated registers and such). So if you are running only one thread, it uses the full 500mhz available, if you run two threads, each gets 250mhz allocated, etc. The advantage is that you don’t waste cycles reallocating resources to the registers when switching threads, so its very low latency. Simplified explanation but that’s the general idea. I’ve been reading up on xmos but haven’t bought a Dev board yet, does anyone with experience with xmos microcontrollers have an opinion on them?
“C++ Concurrency in Action” by Anthony Williams is pretty much The Book on C++ multithreading.
Why is it sample code is never fully commented? I always seem to run across this with a few languages, C++ being one of them and javascript being my other favorite example. I wrote out and commented the above code but figured it was too ranty. My favorite example of this is trying to explain how an engine works and keeping it and all of the visual material in a separate room entirely from the lecture on how it works – so if you need to refer back you need to leave where you are and go to the other room. So long story short why write sample code and then disconnect the explanation in a fragmented manner requiring frequent back references to the code rather than also comment each line? In my admittedly short 14 years since graduating and originally noticing this I’ve not found a good answer.
That’s a great question. We don’t comment as much as we should mostly b/c it would make the code sections appallingly long. If you can keep the code snippet short, and then talk about it just afterwards, it’s a lot like having the comments inline, no? It’s only when the code gets longer that this gets tricky. But then you can’t break it up too much either…
That said, some of my favorite code to read has been narrated in the comments. For instance, Jones Forth: (http://git.annexia.org/?p=jonesforth.git;a=blob_plain;f=jonesforth.S;hb=HEAD) which is a full-fledged book/tutorial on writing a programming language, and the assembly code to do so.
How that would work in a HaD article is another question. I’m game for the experiment if you promise to read it! :)
BTW: I think Rud did pretty well breaking things up here. I don’t see how he could have broken them any further.
The amiga allowed you to attach notes to files giving more clarity than then 7.3 format. Perhaps a separate file synced and linked to the source that could be displayed with the code but not part of the code could address this. Commenting is such an important thing and how often do you see code with little or no commenting.o
I agree with that in every case but for education. Writing for educating shouldn’t be looked at like “Well this is how we do it in production” which is another thing developers tend to do. Switching from working mode to training mode is pretty difficult and not everyone realizes that they’re really two different ways of working.
http://kotaku.com/5975610/the-exceptional-beauty-of-doom-3s-source-code
Scroll down to the section on “Minimal Comments”, too many comments are bad.
I don’t agree with everything in that article but it is a good starting point for discussions on coding standards. I suggest that an organization adopt a standard and set up a pretty printer to format code to the standard. Either the developer or, preferably, a scripted check-in process runs the pretty printer. All that code in the version control system is formatted to standard. My Eclipse environment is setup to automatically format the code when I do a save.
One specific criticism is “Why is he looking at the STL code”? Seriously, that’s why it is in a library so you just use it. It’s in the library to hide the uglies. Does he go peeking at the code in the C libraries? Probably some pretty ugly stuff there. Microsoft’s STL is pretty obscure intentionally at least the version they got from Dinkumware. Plauger sells it so doesn’t want it comprehensible.
People interested in embedded multitasking should also check out FreeRTOS.
Or maybe one of the multitude of other free, embedded RTOSes that doesn’t have such a butt-ugly API.
I haven’t done C and I didn’t read all of this article. I usually do read all of [Rud Merriam]’s articles as I find them quite informative.
I have however written multitasking routines. All you need is an interrupt and fast or dedicated register access so they are fairly hardware dependent but at a bare metal code level they are surprisingly simple.
The core of it is just a look up table that works like programmable timers but you may have 20 or a 100 of them (256 comes to mind). Then some simple code to interface requests to change the table : add task, remove task. You just have a stack for each task (which is why it’s hard to multitask with a PIC – only one possible stack).
Thanks for finding all my other articles informative. LOL
Did a multitasker for a ‘286 system in the late ’80s in C using longjmp / setjmp calls to save the registers. As I mentioned in the article – that you didn’t read – but you reiterated, it’s the stack requirements that keep these off the smaller Arduinos. The C++ libraries for the Due have the header but it is all disabled with an #ifdef probably because there is no underlying OS to provide a multitasking API.
There are tons of other issues with multi threading, which you can’t address with simple thread libraries. You are better of with a decent real time os, which does not start with free in it’s name, and it’s a real real time os, not just a bunch of lame functions put together to act like a scheduler.
The C++ threading libraries add a portable interface on top of the underlying threading implementation, which can be anything. Eg. RTEMS has C++11 threading support.
I did one for Arduino and Energia with setjmp and longjmp. It uses a template parameter to specify stack depth.
https://github.com/jscrane/Tasks
Nice design using the template parameter to pass the stack size and recognizing the need for an implementation class to avoid code duplication.
Small nitpick: std::thread is not, as the article states, a variadic template class. It is a regular non-template class with a variadic constructor.
Good catch.
A CPU core may be able to only execute one instruction at a time, but as an embedded developer I write code for microprocessors not CPUs. My beloved STM32F4 ARM chip has the ability to execute multiple tasks at the same time. I try to lean heavily on the integrated hardware features (DMA, SPI, UART, etc) so my code can do other things while the MPU is doing background work.
Obviously this doesn’t work for custom logic, but often the background task is streaming or some other IO.
My C++ is a bit rusty, can anyone point me to an explanation of the braces (not the ones around the code blocks, the other ones…) ?
e.g. as used around the “a” on line 11:
public:
Fc(int a) :
mA { a } {
}
and lines 22-24:
thread t_fa { fa };
thread t_fb { fb, value };
thread t_fc { Fc { value } };
C++ 11 added “uniform initialization” using braces. See https://en.wikipedia.org/wiki/C%2B%2B11#Uniform_initialization for a start.