The Python world has been fractured a few times before. The infamous transition from version 2 to version 3 still affects people today, and there could be a new schism in the future. [Sam Gross] proposed a solution to drop the Global
Interrupt Interpreter Lock (GIL), which would have enormous implications for many projects that leverage the CPython internals, such as Pandas and NumPy.
The fact that Python is interpreted is a double edge sword. It means there can be different runtimes, such as Pyston, Cinder, MicroPython, PyPy, and others, that might support the whole language, a specific version, or a subset. But if you’re using Python, you’re probably running CPython. And it has something known as global interpreter lock that affects threaded code. In a nutshell, only one thread can run in the interpreter at a time. There are some ways around it, such as moving performance-critical sections to C or having multiple interpreters. However, most existing solutions come with considerable downsides.
Why Was the GIL There, and How do You Remove It?
Program state is much easier to reason about when you can guarantee that only one thread will be running at a time. Reference counting, memory allocation, method resolution order caches, and garbage collections are just some of the things that aren’t thread-safe without the GIL. [Sam] discusses this evolution in his overview document.
Getting free of the GIL begins by making significant changes to reference counting. To know whether the garbage collector can free an object in memory or not, it counts all of the references to that object. Currently, reference counting is non-atomic, and changing all reference counting operations to be atomic has a massive performance hit.
The proposal uses a technique known as biased reference counting is used to have local and shared references. Local references can leverage non-atomic operations, and the owning thread combines the local references and the shared reference to keep track of ownership. This approach works great for objects that are single-threaded or only lightly used by a few threads. Several objects, such as interned strings, True, False, and None, exist for the program’s lifetime and can be marked as immortal, reducing their reference counting overhead to zero. An object is marked immortal by leveraging the least-significant bit in the reference count field. Objects that are frequently accessed but not guaranteed to be immortal have deferred reference counting. This means that the only reference counting needed is when the reference is stored on the heap. A side effect of this change is that an object can’t be immediately reclaimed since the stack will need to be scanned for any remaining references.
[Sam] replaced the standard
pymalloc memory allocator with mimalloc, a drop-in replacement for
malloc that offers thread safety and performance. The upside of this swap is that this allocator allows the runtime to find GC-tracked objects without an explicit list. This is a significant performance boost, but it means that you can’t just swap out another
malloc-compatible allocator and expect the same thread safety for garbage collection and collections.
Speaking of collections and dictionaries, they get tweaked slightly in this surgery. In CPython of today, their design is “thread-safe,” but it leans on the GIL. For example, they have a lock for writes but not for reads, and without the GIL to make a read atomic, a concurrent write can come in the middle of a read.
Perhaps the most surprising change was moving the interpreter from a stack-based virtual machine to a register-based virtual machine, roughly based on V8. This was needed for reference counting changes to be efficient. Functionality-wise, it operates the same, but it causes significant code churn.
What Does This Mean for the Community?
For Python extension library authors, there will be some required work on their end. For example, all C libraries will need to be re-compiled as the ABI has changed. However, the GIL APIs (such as
PyEvalReleaseThread) are still required for marking states as attached or detached, which influences garbage collection behavior.
Initial performance benchmarks show performance matching version 3.10, and running 10% faster than 3.9 in single-threaded workloads because it incorporates some optimizations and fixes that went into 3.10 and 3.11. These are averages of single-threaded benchmarks. For multi-threaded workloads, the lack of a GIL allows it to shine, blowing past the default interpreter with an 18x speedup by running 20 threads. Not too shabby.
As for merging into mainline, the debate is ongoing among core maintainers. Some are calling for the unrelated performance optimizations to be merged in and leave the GIL surgery behind. There are concerns that the overhead introduced by the augmented reference counting will slow down many existing Python programs, the vast majority of which are not multi-threaded yet.
However, many prominent players and companies use Python for machine learning and ETL workloads, and would benefit significantly from this change. This proposal could be the chance for some more notable players to offer a fork of CPython that has these performance increases. Maybe it would gain enough of a following to become a serious contender against CPython? Only time will tell.
What Happens Next?
The code is up on GitHub as well as a place for community discussion. Significant amounts of testing and validation need to occur before maintainers have confidence that this significant change won’t break things. Extensions need to be re-compiled, and existing multi-threaded code needs validation, checking for masked concurrency bugs. This could take years.
But Python isn’t a static language. Python recently got switch statements, and it’s exciting to see Python continue to evolve and change. Hopefully it is all for the better.