The GIL also has its virtues. When calling a C extension, it is
serialized by default unless the GIL is released.
An attempt was made (many years ago) to replace the GIL with
fine-grained locking. It slowed down serial code. That should come as no
surprise, as OS mutexes and semaphores are expensive.
There are lock-free data structures of any conceivable sort. There are
multithreaded garbage collectors in Java and .NET, that don't need a
global lock. I've heard claims that removal of the GIL would require
Python to use a garbage collector instead of reference counts. That is
not true. Lock-free data structures and garbage collectors have all one
thing in common: they use a compare-and-exchange (CAS) instruction
present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is
used to implement OS mutexes (sleeplocks) and the less expensive
spinlocks. Without a CAS instruction for the platform, there would be no
GIL. All platforms that has a working GIL has a working CAS.
Python could use the CAS instruction for lock-free management of
reference counts. All built-in types (int, float, str, list, tuple,
dict, set, etc) could be made lock-free using CAS. That would allow the
interpreter to execute bytecode without holding the GIL. Only calls to C
extensions would be serialized with the GIL.
Schematically, this is how refcounts could be managed without needing a
lock:
/*
* Returns 1 if an exhange was made, and (*addr == old) before
* the exchange, otherwise returns 0.
*
* uses opcode CMPXCHG on x86
*
*/
extern int Py_refcnt_CAS(Py_ssize_t *addr, Py_ssize_t old, Py_ssize_t
new);
inline void Py_DECREF(PyObject *obj)
{
register Py_ssize_t refcnt;
do {
refcnt = obj->refcnt;
} while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt-1));
refcnt = obj->refcnt;
if (refcnt == 0) {
if(Py_refcnt_CAS(&obj->ob_refcnt, 0, -1))
{
/* deallocate the object */
}
}
}
inline void Py_INCREF(PyObject *obj)
{
register Py_ssize_t refcnt;
refcnt = obj->refcnt;
if (refcnt >= 0) {
do {
if((refcnt = obj->refcnt)<0) break;
} while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt+1));
}
}
Regards,
Sturla Molden
_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas
I eagerly await your patch!
CPython uses *extensive* refcounting operations on objects shared
between threads (builtins, globals, literals, etc). The cache
contention over those refcounts means even 2 threads are significantly
slower than 1 thread, nevermind the increased contention of 3+
threads. That is why it is said a true tracing GC is needed to
replace the refcounting GC.
--
Adam Olsen, aka Rhamphoryncus
I just wanted to bring this up:
- The GIL has consequences on multicore CPUs that are overlooked: thread
switches are usually missed at check intervals. This could be fixed
without removing the GIL: For example, there could be a wait-queue for
the GIL; a thread that request the GIL puts itself in the back.
- Also, fine grained locking is not the only alternative to a global
lock. No locks at all is even better.
Sturla Molden
> Benjamin Peterson skrev:
>> I eagerly await your patch!
> I'd gladly contribute to the effort, but I did not say it is an easy
> task :-)
>
> I just wanted to bring this up:
[..]
>
> - Also, fine grained locking is not the only alternative to a global
> lock. No locks at all is even better.
I doubt anyone here would disagree with that, but to be taken
seriously, you'll need to be *much* more specific. Enlighten us.
-Casey
> The cooperative
> multithreading intended by using checkintervals only works properly on
> single-processor systems. Python threads only work properly on
> single-processor computers.
I'm trying to work on that. I have an experimental rewrite of the GIL for POSIX
systems which improves latencies, while decreasing the overhead of the GIL
itself (because if checked every 100 opcodes, it can be dropped and re-taken
tens of thousands of times per second...). If people have interesting workloads
for py3k, the Mercurial branch can be found here:
http://hg.pitrou.net/public/py3k/newgil/shortlog
(my measurements, by the way, are based on a small concurrency benchmark for
Python I recently wrote:
http://svn.python.org/view/sandbox/trunk/ccbench/ )
> There are lock-free data structures of any conceivable sort. There are
> multithreaded garbage collectors in Java and .NET, that don't need a
> global lock. I've heard claims that removal of the GIL would require
> Python to use a garbage collector instead of reference counts. That is
> not true. Lock-free data structures and garbage collectors have all one
> thing in common: they use a compare-and-exchange (CAS) instruction
> present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is
> used to implement OS mutexes (sleeplocks) and the less expensive
> spinlocks. Without a CAS instruction for the platform, there would be no
> GIL. All platforms that has a working GIL has a working CAS.
That does not mean that CAS (or atomic increment/decrement, for that matter) is
as fast as a regular (non-atomic) increment / decrement.
Besides, reference counting is not the only thing to worry about.
You have to:
- protect (somehow) the GC when collecting, so that things don't get modified
under its feet
- protect INCREFs and DECREFs
- protect all mutable types (dicts being probably the most performance-critical,
since they are used for namespaces and class/object attributes and methods)
- protect all static/global data in C modules which have them
- and probably other things I'm not thinking about
An interesting experiment would be to add all those protections without even
attempting to remove the GIL, and measure the overhead compared to a regular
Python. IMO, if the overhead is less than 10%, then it would probably be
acceptable to go ahead and try removing the GIL.
But someone needs to tackle this concretely for it to happen, because I guess
these ideas have already been suggested a couple of times...
Regards
Antoine.
How specific should I be?
S.M.
> Casey Duncan skrev:
>>
>> I doubt anyone here would disagree with that, but to be taken
>> seriously, you'll need to be *much* more specific. Enlighten us.
> I just showed you thread-safe incref and decref that don't require a
> lock.
They looked like spinlocks to me, I was assuming you were talking
about something else even more magical. I was still considering them
locks.
-Casey
I suspect it's going to take working code- after all, if it isn't worth your
time to implement, I don't think very many others are going to leap
up and do it, however much benefit your perceive it having.
Geremy Condra
> - protect INCREFs and DECREFs
>
Update using CAS (I just showed you C code for this).
> - protect all mutable types (dicts being probably the most performance-critical,
> since they are used for namespaces and class/object attributes and methods)
>
CAS. Lock-free lists and hash tables exist. We don't need locks to
protect mutable types and class objects. Just use a lock-free hash-table
for __dict__ or whatever.
> - protect all static/global data in C modules which have them
>
Reacquire the GIL before calling functions in C extensions functions.
The GIL is nice there, but not anywhere else.
> An interesting experiment would be to add all those protections without even
> attempting to remove the GIL, and measure the overhead compared to a regular
> Python.
Yes, one could add a nonsence CAS to existing incref/decref, and measure
overhead.
That would be the first thing to do IMO.
If it doesn't seem to decrease performance a lot, you could continue by adding
locking to dict objects, and measure again.
Then add lists and sets (although I doubt sets and even lists are
performance-critical for general interpreter operation).
If after that you have less than a 10% slowdown (personal opinion only, other
people's mileage may vary), it's probably worth going on and trying to do the
whole GIL removal (which doesn't mean it would necessarily be accepted, by the
way, but at least would avoid it being automatically rejected ;-)).
Regards
Antoine.
The failed attempt to remove the GIL used OS mutexes, which are much
more expensive sleeplocks.
This'd show you the minimum amount of overhead, but only for a trivial
case: single threaded. You're not invoking cache contention, which in
my experience is much larger than the atomic op cost, and prevents the
*real* gains you hope to make.
Python-safethread makes the datastructures safe (lockless fast-paths)
so that it can be compared properly. I tried atomic refcounting and
it sucked. My buffered bodge sucked less, but still sucked. The only
option left is a real tracing GC.
--
Adam Olsen, aka Rhamphoryncus
No, single-threaded is the reason for keeping the GIL *as an option*.
The complete lack of any other viable option (in CPython) is the
reason it's the only option.
--
Adam Olsen, aka Rhamphoryncus
> -----Original Message-----
> From: python-ideas-bounces+kristjan=ccpgam...@python.org
> [mailto:python-ideas-bounces+kristjan=ccpgam...@python.org] On
> Behalf Of Sturla Molden
> Sent: 20. október 2009 22:13
> To: python...@python.org
> Subject: Re: [Python-ideas] Remove GIL with CAS instructions?
>
>
> - The GIL has consequences on multicore CPUs that are overlooked:
> thread
> switches are usually missed at check intervals. This could be fixed
> without removing the GIL: For example, there could be a wait-queue for
> the GIL; a thread that request the GIL puts itself in the back.
>
This depends entirely on the platform and primitives used to implement the GIL.
I'm interested in windows. There, I found this article:
http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html
So, you may be on to something. Perhaps a simple C test is in order then?
I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however.
"fairness" was retained in debug builds too, strangely enough.
Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly.
The test application is attached.
I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above.
On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev)
Cheers,
Kristján
I'm not entirely sure what you're trying to say there, but I'd like link
to what Guido said over two years ago:
http://www.artima.com/weblogs/viewpost.jsp?thread=214235
[quote]
... I'd welcome a set of patches into Py3k only if the performance for a
single-threaded program (and for a multi-threaded but I/O-bound
program) does not decrease.
I would also be happy if someone volunteered to maintain a GIL-free fork
of Python...
However, I want to warn that there are many downsides to removing the
GIL. ...
I want to point out one more time that the language doesn't require the
GIL -- it's only the CPython virtual machine that has historically been
unable to shed it.
[end quote]
--
Steven D'Aprano
In the rare case where none of those threads are releasing the GIL for
any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart,
but not a particularly difficult one to deal with.
That tangent aside, the major problem with removing the GIL has never
been the concept of doing so, but the practicality of implementing it in
a cross-platform way that doesn't hurt single threaded performance.
Cheers,
Nick.
--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------
> -----Original Message-----
> From: python-ideas-bounces+kristjan=ccpgam...@python.org
> [mailto:python-ideas-bounces+kristjan=ccpgam...@python.org] On
> Behalf Of Nick Coghlan
> Sent: 21. október 2009 12:31
> To: Sturla Molden
> Cc: python...@python.org
> Subject: Re: [Python-ideas] Remove GIL with CAS instructions?
>
> > - The GIL has consequences on multicore CPUs that are overlooked:
> thread
> > switches are usually missed at check intervals.
>
> In the rare case where none of those threads are releasing the GIL for
> any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart,
> but not a particularly difficult one to deal with.
>
I think you may be missing the point.
If the GIL is implemented in a naive way on a platform, then releasing the gil and reclaiming it (such as is done during a checkinterval) can cause the same thread to get it again. Thus, the idea that this checkinterval should allow another thread waiting for the gil to run, would not work.
Apparently, "fairness" of some locking primitives is being given up for efficiency reasons in some operatins systems, like Vista.
I tested this by implementing a GIL like mechanism on a multicore maching using windows CriticalSections and Mutexes, both of which would starve the other threads. Semaphores work, though and the python locks on windows (using an atomic counter and an Event object) also work.
So, alghough Sturla's claim isn't valid for windows, there might be systems where the syscheckinterval mechanism for thread yielding doesn't work due to the locks in question not being "fair" on multicore platforms.
K
My critisism of the GIL on python-ideas was partly motivated by this:
However, David Beazley is not talking about Windows. Since the GIL is
apparently not a mutex on Windows, it could behave differently. So I
wrote a small script that contructs a GIL battle, and record how often a
check-interval results in a thread-switch or not. For monitoring check
intervals, I used a small C extension to read _Py_Ticker from ceval.c.
It is not declared static so I could easily hack into it.
With two threads and a check interval og 100, only 61 of 100000 check
intervals failed to produce a thread-switch in the interpreter. I'd call
that rather fair. :-)
And in case someone asks, the nthreads=1 case is just for verification.
S.M.
D:\>test.py
check interval = 1
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=57809, missed=42191
nthreads=3, swiched=91535, missed=8465
nthreads=4, swiched=99751, missed=249
nthreads=5, swiched=95839, missed=4161
nthreads=6, swiched=100000, missed=0
D:\>test.py
check interval = 10
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99858, missed=142
nthreads=3, swiched=99992, missed=8
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0
D:\>test.py
check interval = 100
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99939, missed=61
nthreads=3, swiched=100000, missed=0
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0
D:\>test.py
check interval = 1000
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99999, missed=1
nthreads=3, swiched=100000, missed=0
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0
Anyway, if anyone wants to run a GIL battle, here is the code I used.
If it turns out the GIL is far worse with pthreads, as it is implemented
with a mutex, it might be a good idea to reimplement it with an event
object as it is on Windows.
Sturla Molden
----------------
In python:
from giltest import *
from time import clock
import threading
import sys
def thread(rank, battle, start):
while not start.isSet():
if rank == 0:
start.set()
try:
while 1:
battle.record(rank)
except:
pass
if __name__ == '__main__':
sys.setcheckinterval(1000)
print "check interval = %d" % sys.getcheckinterval()
for nthreads in range(1,7):
start = threading.Event()
battle = GIL_Battle(100000)
threads = [threading.Thread(target=thread, args=(i,battle,start))
for i in range(1,nthreads)]
for t in threads:
t.setDaemon(True)
t.start()
thread(0, battle, start)
for t in threads: t.join()
s,m = battle.report()
print "nthreads=%d, swiched=%d, missed=%d" % (nthreads, s, m)
In Cython or Pyrex:
from exceptions import Exception
cdef extern from *:
ctypedef int vint "volatile int"
vint _Py_Ticker
class StopBattle(Exception):
pass
cdef class GIL_Battle:
""" tests the fairness of the GIL """
cdef vint prev_tick, prev_rank, switched, missed
cdef int trials
def __cinit__(GIL_Battle self, int trials=100000):
self.prev_tick = _Py_Ticker
self.prev_rank = -1
self.missed = 0
self.switched = 0
self.trials = trials
def record(GIL_Battle self, int rank):
if self.trials == self.switched + self.missed:
raise StopBattle
if self.prev_rank == -1:
self.prev_tick = _Py_Ticker
self.prev_rank = rank
else:
if _Py_Ticker > self.prev_tick:
if self.prev_rank == rank:
self.missed += 1
else:
self.switched += 1
self.prev_tick = _Py_Ticker
self.prev_rank = rank
else:
self.prev_tick = _Py_Ticker
def report(GIL_Battle self):
return int(self.switched), int(self.missed)
No, I understand the fairness problem - releasing the GIL for a whole
millisecond is a workaround for exactly that issue. (Python must have
suffered from this back on Windows NT 4, as that is where I first
discovered this trick to fix a program that wasn't switching threads
properly - before I added the time.sleep call to explicitly release the
GIL and give other threads a chance to run before attempting to
reacquire it, the program was running each thread to completion before
starting the next one).
It's a sledgehammer approach to the problem though and incurs a speed
penalty even on platforms where the GIL thread scheduling is fairer. So
it's better if we can find a way to ensure proper scheduling fairness
across all platforms.
Cheers,
Nick.
--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------
Which makes me agree with the commonly expressed opinion that CPython would
probably need to ditch refcounting (at least in the critical paths) if we want
to remove the GIL.
Regards
Antoine.
(*) using gcc's atomic primitives which, I have checked, are inlined as
carefully optimized assembler:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Builtins
--
--Guido van Rossum (python.org/~guido)
It's worth repeating this kind of experiment; the hardware landscape
has changed a lot in 10 years. It's interesting that the results are
the same a decade later.
Collin