Python Ditches The GILs And Comes Ashore

The Python world has been fractured a few times before. The infamous transition from version 2 to version 3 still affects people today, and there could be a new schism in the future. [Sam Gross] proposed a solution to drop the Global Interrupt Interpreter Lock (GIL), which would have enormous implications for many projects that leverage the CPython internals, such as Pandas and NumPy.

The fact that Python is interpreted is a double edge sword. It means there can be different runtimes, such as Pyston, Cinder, MicroPython, PyPy, and others, that might support the whole language, a specific version, or a subset. But if you’re using Python, you’re probably running CPython. And it has something known as global interpreter lock that affects threaded code. In a nutshell, only one thread can run in the interpreter at a time. There are some ways around it, such as moving performance-critical sections to C or having multiple interpreters. However, most existing solutions come with considerable downsides.

Why Was the GIL There, and How do You Remove It?

Program state is much easier to reason about when you can guarantee that only one thread will be running at a time. Reference counting, memory allocation, method resolution order caches, and garbage collections are just some of the things that aren’t thread-safe without the GIL. [Sam] discusses this evolution in his overview document.

Getting free of the GIL begins by making significant changes to reference counting. To know whether the garbage collector can free an object in memory or not, it counts all of the references to that object.  Currently, reference counting is non-atomic, and changing all reference counting operations to be atomic has a massive performance hit.

The proposal uses a technique known as biased reference counting is used to have local and shared references. Local references can leverage non-atomic operations, and the owning thread combines the local references and the shared reference to keep track of ownership. This approach works great for objects that are single-threaded or only lightly used by a few threads. Several objects, such as interned strings, True, False, and None, exist for the program’s lifetime and can be marked as immortal, reducing their reference counting overhead to zero. An object is marked immortal by leveraging the least-significant bit in the reference count field. Objects that are frequently accessed but not guaranteed to be immortal have deferred reference counting. This means that the only reference counting needed is when the reference is stored on the heap. A side effect of this change is that an object can’t be immediately reclaimed since the stack will need to be scanned for any remaining references.

[Sam] replaced the standard pymalloc memory allocator with mimalloc, a drop-in replacement for malloc that offers thread safety and performance. The upside of this swap is that this allocator allows the runtime to find GC-tracked objects without an explicit list. This is a significant performance boost, but it means that you can’t just swap out another malloc-compatible allocator and expect the same thread safety for garbage collection and collections.

Speaking of collections and dictionaries, they get tweaked slightly in this surgery. In CPython of today, their design is “thread-safe,” but it leans on the GIL. For example, they have a lock for writes but not for reads, and without the GIL to make a read atomic, a concurrent write can come in the middle of a read.

Perhaps the most surprising change was moving the interpreter from a stack-based virtual machine to a register-based virtual machine, roughly based on V8. This was needed for reference counting changes to be efficient. Functionality-wise, it operates the same, but it causes significant code churn.

What Does This Mean for the Community?

For Python extension library authors, there will be some required work on their end. For example, all C libraries will need to be re-compiled as the ABI has changed. However, the GIL APIs (such as PyEvalReleaseThread) are still required for marking states as attached or detached, which influences garbage collection behavior.

Initial performance benchmarks show performance matching version 3.10, and running 10% faster than 3.9 in single-threaded workloads because it incorporates some optimizations and fixes that went into 3.10 and 3.11. These are averages of single-threaded benchmarks. For multi-threaded workloads, the lack of a GIL allows it to shine, blowing past the default interpreter with an 18x speedup by running 20 threads. Not too shabby.

As for merging into mainline, the debate is ongoing among core maintainers. Some are calling for the unrelated performance optimizations to be merged in and leave the GIL surgery behind. There are concerns that the overhead introduced by the augmented reference counting will slow down many existing Python programs, the vast majority of which are not multi-threaded yet.

However, many prominent players and companies use Python for machine learning and ETL workloads, and would benefit significantly from this change. This proposal could be the chance for some more notable players to offer a fork of CPython that has these performance increases. Maybe it would gain enough of a following to become a serious contender against CPython? Only time will tell.

What Happens Next?

The code is up on GitHub as well as a place for community discussion. Significant amounts of testing and validation need to occur before maintainers have confidence that this significant change won’t break things. Extensions need to be re-compiled, and existing multi-threaded code needs validation, checking for masked concurrency bugs. This could take years.

But Python isn’t a static language. Python recently got switch statements, and it’s exciting to see Python continue to evolve and change. Hopefully it is all for the better.

37 thoughts on “Python Ditches The GILs And Comes Ashore

    1. 99% seems rather high. Any script that does multiple things that don’t need to be synchronous would benefit. Nearly everything I’ve used it for would benefit from GIL’less threads.

      In fact I know that many uses _can’t_ use threads because the GIL makes it useless to do so.

      1. Not really. 99% of people programmer are low skill “it just needs to do it” types. There is no shame in that because they aren’t really interested in programming and they aren’t using it to do anything of significance.

      1. I’d argue it’s because writing thread safe code is a PITA. Python has been sold as an easy to learn and use language that forces you to write readable code. It’s found its way into the hearts and minds of novices and pros, and offers something akin to the old BASIC days of yore. Sure, it encourages library hell and supply chain vulnerabilities, but it works with a very low barrier to entry. Pulling in libraries that require a multi thread interpreter is simply going to turn people off and likely push them even further to JS, Go, Ruby, etc. I’m not a fan of Python by any means, and most people on here that dig into low level coding are perfectly fine compiling binaries for the advance stuff that should be in a static type language in the first place. ETL performance? Switch to Java or compile from source. Need GUI separation from backend operations, again there are plenty of cross-platform languages and libraries that do that 10x better and keep the simple thread separation issues from compiling. With all that being said, maybe another interpreter fork is needed that caters to the needs outlined in the post. Multiple threads could do wonders for ML and GPU and similar “AI” coprocessors/accelerators, but if I were a betting man, I’d bet adding multiple threads and their boilerplate management code would immediately turn noobs away as they try to grok race conditions.

        1. The Python Interpreter has had multi-threading for a long time, this is not new. What *is* new is that they have made it truly parallel by removing a global semaphore that was used to prevent more than 1 thread from executing in the Interpreter at once. It’s like the difference between a system that has a single CPU vs a system with multiple CPUs.

          1. The worst code I’ve ever read was the first I’d ever seen. I taught myself Python after tinkering in it. It was written by someone who was more used to writing C# (I think) as a minimally functional test fixture program that was abandoned by Engineering as soon as the test department finished a first article. (Obviously, it’s production-ready if test can make the product.) In Python 2.7.9. Using a specific combination of libraries that could not be reproduced outside the two Windows XP machines originally configured for the task. I’m convinced that the original programmer read P.E.P. 8 meticulously with the intention of ignoring it as completely as possible. It seemed like all the worst aspects of functional and OO programming were combined into an incomprehensible mess.

            After some serious reading, I came to loathe that project every time I had to use it. It’s Python, so it’s readable, but the C accent was so thick in places you’d be driven to distraction by variable declarations and ignorance of things like list comprehensions. It was also structured in two parts that, nevertheless, needed to incorporate parts of each other, so path kludges were thrown in on top. At one point, a class was written to hold coordinate data, but none of the methods transformed the data in place. The code that needed the transform would call the method and pass it its (the object’s) own data!

            The “user interface” was knowing which script to open in IDLE and hit F5. One spat out a data file that was then drag-and-dropped to a script that made a plot. Another, subtly different script made a data file and the text from that was manually copied from Notepad into Excel to calculate nonuniformity. Another had “switches” in the source code that you edited to set up the fixture one way or another.

            I managed to build a slightly better interface on top of this mess. Calling functions within the shell is marginally better, and I was able to directly control the instruments without writing a whole script for each action. I desperately wanted to rewrite the whole thing, but I was supposed to be testing stuff and following code execution to extract functionality into sane chunks was too time-consuming after getting five files deep.

      2. More that it doesn’t warrant multi-threading, and that it actually would make simple things needlessly complex.
        You’d be surprised by how many simple & low-resource demanding python scripts are acting as backbones of larger things, basically unsung heroes taken for granted.

      1. I really don’t like python, the amount of code that can run days (machine learning..) and then crash on missing variable that was originaly global, but someone else (read “me”) removed it, or unexpected type makes me angry every time.

        But the treads in CPython are historical artefact, they run same way that personal computers run when it was invented. Its 20 years since pentium4 with hyperthreading and 2 threads executing simultaneousy on PC and python is like 30years old. I Assume that it was a long forgotten “Oh ***” situation “this fun interpreter project now runs on machines with multiple threads how do i make it not break”

    1. I suppose it’s nice when more options open up, but these options aren’t always necessarily good ones. Case in point – it is pretty much always better for the user (of the library) when the library is written in pretty much anything other than Python. This is because Python is still slow and one of the best ways to speed up Python code is to write the dependencies in compiled languages.

  1. Summary (before ramblings): Don’t forget multiprocessing and IPC as alternatives to threading…

    Its spooky how often I’ll have a conversation at work, and then come home and see a relevant article on HAD. A few months ago a co-worker (new to python) was perplexed that his recent effort to thread a message processing script wasn’t any faster than the monolithic version. I had the pleasure of tipping him off about the GIL and the process-based multiprocessing module. Ahhh, that lovely deer-in-the-headlights looks most of us probably gave when trying to understand “Wait… so threading doesn’t actually use threads?”

    Yesterday we ran across each other again and he described his adventures in the land of multiprocessing (using multicast for his IPC mechanism) to get the performance he wanted.

    For anyone just coming across this problem, you can use the multiprocessing module to create new processes (actually forking the python interpreter, so the process gets distributed across multiple cores) instead of the threading module. Its works more like the classic UNIX process model. It works great, but since there is no common memory space, you have to put a lot more effort into working out how you communicate safely and efficiently between processes. This is both a bad (it takes more work) and good (if you come up with a mechanism that suits your problem, the resulting code can be really resilient and you are 3/4 of the way to a fully distributed design that can be spread over a network just as easily as on a single system).

    (hint, multiprocessing has its own Queue class that makes it really easy to pass messages, just keep in mind there is a lot of stuff going on under the hood to pass that data (pipes, pickling, etc). Pipes, shared memory, or network-based methods might be more efficient if you are moving a lot of data and speed is essential.

    I guess I’m lucky that most of my problems tend to fit the process model very well.

    Then again, I’m old enough that my introduction to IPC was a Chapter in Stevens… not a Volume. Threading wasn’t a very portable option back then.

  2. “`
    But Python isn’t a static language. Python recently got switch statements, and it’s exciting to see Python continue to evolve and change. Hopefully it is all for the better.
    “`

    I liked python-0.x … -1.y but the nice small scripting language turned into hive of features and 2 Himalayas of libraries.

    Laugh at me, if you want, but meanwhile I prefer AWK over Python. I still keep an eye on it because of MicroPython but secretly am looking for alternatives for playing with those IoToys too.

  3. This title is so terribly misleading. Nobody has made any decisions on this yet. A previous attempt to get rid of the gil stranded, because it is so hard. Sam is very clear that this would be a major effort for the whole community.

  4. I don’t know where the Python hate comes from because we find it a ‘very’ useful language in our company. Very easy to use, easy to read, easy to maintain and usually there is a module out there that is going to do what you need to get done which is good for productivity. From the electrical engineers to the programmers it is a very usable language. It has eliminated most all the VB, some Perl, batch file usage in our department. Cut our maintenance way down. It has been a good ride so far. BTW, I still program a lot in assembly, ‘C’, and ‘Object Pascal’. Some C++ too. But if you want to automate the boring stuff, Python is the language to use. Of course, our BT (Business Technologies) department is stuck on VB and some C#. Hopefully we can change that….

    That said, personally I’ve only used Python cooperative threads once that I can remember for the work I now do. I used preemptive threads ‘a lot’ when I was doing real-time work in ‘C’ back when. Anyway, threading is a very nice feature to have ‘when’ it appropriate, but otherwise just stick to the normal single thread flow. I’ve used Circuit Python and microPython on small projects and found no reason to knock the language there either. As an applied programmer by trade and schooling, I find Python the best interpreted language I have run into in my career so far.

  5. IMO, this seems like a good example of effort wasted fixing the wrong problem. Sure, it may speed up a *few* programs, but why are programs that need to be fast written in an interpreted language anyway? If you need performance, use a language that compiles to machine code! There are plenty of options…

    I refuse to use any interpreted language for anything but simple scripting. Python is, in my opinion, an overgrown, overused, overcomplicated mess. I can count on one hand the number of applications I have used (and I use a lot of different applications) that are written in python and work reliably. Way too many python programs have showstopper bugs, many of the kind that are IMPOSSIBLE to have (at runtime) with C.

    1. I agree. I think the cooperative multi-tasking was good enough for Python. But that is just me :) .

      Not sure why you think it is over complicated or has to be. Most of the advanced ‘features’ we never use like decorators. Keep the objects ‘simple’. If you use the KISS principle it works very well. I have lots of programs working with JSON files, excel files, csv, databases, etc. Then sending/receiving data via DDE, sftp, https, or copying files around, doing cleanup work, zipping files and folder automatically, building reports, graphing data … well, there are just lots of uses for Python that don’t have to be done really really fast — just fast enough. And when things change, I don’t have to go compile it on a development system and move it to where it needs to go. I just make the change on the fly and next time it runs the change is in place. Love it. Things that I would find difficult to do in ‘C’, Python makes it a breeze to do. I’ve used ‘C’ since ’85 and really like the language, don’t get me wrong, but it doesn’t fit at all for the tasks that I do now for moving data around and maintaining the system. Basically, use the language that fits the job at hand.

  6. How about give people the option to choose whether to use GIL or the new multi-threading infrastructure by introducing a command line parameter, such as –multithreading? Or is it too ugly to have CPython support both?

  7. This would be an absolute boon for me. For various reasons, my code needs lots of threads and I am currently living with the overhead of multiprocessing. It can’t come soon enough!

  8. If it turns out to be true that this would slow down existing single-threaded code, why not include both, and make CPython two interpreters in one? Use the current interpreter by default, and switch to the new interpreter when the code is multithreaded. It would be twice as much to maintain, and CPython would be bigger, but that may be outweighed by performance, if they’re really after performance. (Of course, there may be a way to do this only partway, sharing parts between the old and new interpreters, but I don’t have the expertise to suggest how that could be done. The change from stack-based to register-based makes that sound unlikely to be feasible, though.)

  9. I’m a software developer for a living, and i use python over 90% of the time, but almost all i do is bound by storage or some other software like the database or our CRM-system. I use threads extensively as a way of making everything asynchronous, and it works wonders! The few scripts that do “heavy-lifting” like compressing images, moving a lot of data, encryption/hashing, etc, usually runs the heavy parts natively through libraries anyway, which doesn’t affect the GIL. I can think of 1 project in my 4 years here, that would maybe benefit from removing the GIL, but that’s a once-temporary-now-permanent-solution that should’ve been made in a lower level language anyway.
    If real parallelization is needed, multiprocessing offers that, and you should consider if pythons right for your purpose. Don’t get me wrong, can the GIL be removed it would make it better, but it’s a lot of effort for very little reward and a lot of potential problems.

    1. “once-temporary-now-permanent-solution”

      That may as well be the motto of the place I work –> TechDebt, Inc.

      “Look at this almost functional proof-of-concept!”
      “Ship it!”

      The struggle is real and spectacular.

  10. I wonder what the EVE Online devs think of this. Years ago, when I still played, I found out about the GIL in a dev blog detailing how hard it is to process the actions of thousands of players on a single server.

    1. Typically, “application binary interface”. It refers to the agreed upon conventions by which the hardware machine’s resources are used. For example, “the first argument to the function about to be called is put into register 1, the 2nd in register 2, and the rest of the arguments put on the stack. The return value is placed into register 1.”

      The conventions are fairly arbitrary, but generally designed to use the hardware efficiently. Programs that call into libraries need to have matching ABI conventions in order for the code to run correctly.

      Different compilers can use different ABIs, and their resulting machine code won’t be compatible with each othe even though the code is compatible with the same hardware. A program compiled by visual studio to run on windows will likely have different conventions from gcc targeting Linux. Your operating system can generally run different ABIs in separate processes simultaneously, though, since processes are conceptually lightweight virtual computers. The kernel itself will have an ABI for doing system calls. Compatibility layers like wine are cleverly able to translate system calls for windows into Linux system calls, allowing windows programs to be run in Linux.

  11. I really hope Python gets rid of the GIL. Sure you can use multiprocessing, but for complex data and objects, you lose a lot of efficiency by having to serialize and deserialize those across the process boundary (of course Python makes this pretty simple, it is just relatively slow). For a lot of things, the thread paradigm makes a lot more sense, if threads actually functioned “correctly”.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.