The Linux Scheduler And How It Handles More Cores

Sometimes you read an article headline and you find yourself re-reading it a few times before diving into the article. This was definitely the case for a recent blog post by [The HFT Guy], where the claim was made that the Linux kernel has for fifteen years now been hardlocked into not scheduling for more than 8 cores. Obviously this caused a lot of double-checking and context discovery on both Hacker News and the Level 1 Techs forum. So what is going on exactly? Did the Linux developers make an egregious error more than a decade ago that has crippled Linux performance to this day?

Where the blog author takes offence is in the claim made in the Linux kernel code and documentation that the base time slice scales with the number of CPUs (or cores), pointing out the commit in which the number of CPUs taken into account was limited to a maximum of 8. So far so good, even if at this point quite a few readers had already jumped to showing that their Linux system could definitely load more than 8 cores to 100%.

As pointed out by [sirn] on the Level 1 Techs forum, this limit was intentional, as discussed on the Linux Kernel mailing list (LKML) in November and December of 2009. Essentially – as also pointed out by a few commentators in the Hacker News thread – the granularity of task switching (time slices per second) should be higher with fewer cores, to give the impression of concurrency, which becomes less important with more cores, where diminishing returns – around the 8 CPU mark – mean that task switching overhead becomes more crucial.

That means that this ‘hardcoded limit’ was put in there on purpose back in 2009, based on solid empirical evidence using many-core workstations and servers. It also shows that writing good schedulers is hard, which is why the LKML is famous for its Scheduler Wars and why you can pick alternative schedulers if you compile your own kernel. The current Completely Fair Scheduler (CFS) is also likely going to be replaced in the Linux kernel with the EEVDF scheduler as the default.

18 thoughts on “The Linux Scheduler And How It Handles More Cores

  1. After 2005 the majority of computers in the top 500 have been running Linux (Before that it was mostly Unix), and after 2018 Linux has completely taken over and it dominates the market. ALL of them are running some Linux flavor.

    So this never has been an issue for all of the Linux computers.

    The applications I run on my home PC are unfortunately still mostly single threaded. Thunar (In Mint 20) has a bug that sometimes consumes 100% of a single thread and I have to kill it. FreeCAD 0.21 takes 3 minutes to load one of my drawings, and during that time it barely logs above a single thread (but there are some peaks) I have a quite modest Ryzen 5600G (6 cores 12 threads) and in the almost two years I have it, compiling GCC stuff with a -j12 to the makefiles is still the only way I have made some good use of it. And when I do that, it’s finished before I can get a good look at the statistics :) and therefore I don’t even know if it’s limited to a mere 8 threads (cores). Disk (SSD) I/O may be another bottleneck but I have never investigated it thoroughly enough to be sure.

    But overall, CPU’s have been stuck around 3 to 5GHZ for some 20 years, and multi threaded applications are the only way forward. It’s high time that application programmers are starting to take multi core utilization more seriously.

    But apart from the problems with applications (And the people who write them), it’s nice to see that the Linux kernel(s) are getting improvements too. :)

    1. Top500 is a special case, most of the systems listed are not one homogeneous system with a single instance of the kernel scheduler, but many distinct systems operating as a cluster with a workload scheduler on top of the scheduling on each compute member. Some of those clusters even have different configurations of member compute nodes. To top it off, the kernel scheduler doesn’t really come into play much when you are talking about accelerated compute with GPU or other coprocessors doing the actual compute and the Linux host systems just facilitating support tasks.

      1. That’s a bad analogy. The computer’s “engine” is its hardware, and is already decided. The generalist supercomputer would be more of a machine made to go fast in varying circumstances, and would have a lot of money put into getting as much of the engine’s performance into the tires as possible, while keeping from overheating or running out of fuel. If you take one of those cars and make it road legal, with a much smaller engine, you may still find that it performs better than a regular suv with the same engine and a lot of extra weight and drag. Now, if you end up realizing you need support for hauling something, or you find the ride uncomfortable in the racecar and wish to sacrifice performance to do it, that’s one thing.
        But when we remember that computers are different from cars and use the metaphor yet another way, Linux is not so much of a racecar in that scenario as a kit car you can spec out with whatever capabilities you can find, minimalist or luxury, and the other two options are effectively a tesla that might be sleek but you can only do what it lets you do, or a fleet vehicle with a bunch of junk rattling around that you still sometimes can’t do anything about but everyone’s familiar with it.

    2. FreeCAD use of cores or threads is the program itself. Only some portions (OpenGL (if in use), a few specific tasks in the OCCT kernel, and Coin) are coded to use multiple threads. For the most part it is single threaded. The Linux scheduler has little to do with it. As is the case with many programs.

      1. Thing is, while an application may not make use of more cores, the OS is. Just doing a ‘ps aux’ shows all the moving parts that are handled by other cores/threads while your single threaded application(s) is/are running. Win Win.

    3. I’d like to know what the “top 500” computers means. Top performing? Top sellers? Top popularity in Reddit threads? In any case, a limit that was decided fourteen years ago may need to be re-evaluated.
      I ran into Linux hard-coded limits about a decade ago, when the kernel module for Video4Linux wouldn’t allow more than three simultaneous video devices. This made it a no-go for me. Fortunately, that limit has been re-evaluated and upgraded.

    1. Thank you, I never knew that had a name. Like many people I had to develop that concept for myself through repeated mistakes. It took a long time to learn to go from “That makes no sense, it must be a mistake!” to “That makes no sense, what am I missing?” Reminds me of a certain colorful quote from the senior officer in the movie Colors concerning cows.

      1. And that is why you should always put comments in your code anytime you do something out of the ordinary. You will remember now but you won’t remember in 6 months. How many times have I ‘fixed’ something only to break it.

    2. I worked at a small repair shop where this was kind of part of the intro for new techs. We used the example of a zip tie threaded through the holes in a devices wall plug (think North American plug) and closed. In our shop that means ‘unsafe to power on’, but it really should give pause in any context. If you see something weird, stop and think, and always be looking out for something weird.
      Before I get dogpiled we definitely did have conventional and I’d say respectable safety training for our field, but this was the culture we tried to bring to new folks.

  2. The eight CPU limit in the HFT Guy post brought back memories of the old VAX architecture when VMS 5.0 started supporting multiple processors under Synchronous Multi Processing (SMP). It was interesting to note the total reported system performance as CPU’s were added. A base CPU was considered to have one VAX unit of speed for it CPU model. A two CPU system was reported to have a 1.8 value. Three CPU’s (as the VAX 6330 my employer had) was 2.7. Four CPU’s, 3.6 and finally 5.4 on the top of the line six CPU systems. Every added CPU increased the overhead a bit to the point if you reached a theoretical eight processor system, you basically had one CPU dedicated to managing all the others as that would rate a 7.2 if one extrapolated the 0.1 decline for every CPU.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.