[Python-ideas] Remove GIL with CAS instructions?

Sturla Molden

unread,

Oct 20, 2009, 5:24:57 PM10/20/09

to python...@python.org

The GIL is particularly evil on multiprocessor systems. Not just because
it prevents parallel excution of Python code, but thread switches are
defunct. This is because a thread that periodically releases and
re-acquires the GIL (at checkintervals), more or less always wins. On a
multiprocessor system, the thread it will continue to run its processor
and grab the GIL back before a sleeping thread wakes up. The cooperative
multithreading intended by using checkintervals only works properly on
single-processor systems. Python threads only work properly on
single-processor computers. On computers with multiple CPUs, one must
set the affinity of the Python process is set to one CPU only for proper
cooperative thread scheduling. This behaviour of the GIL is a PITA, but
often overlooked - all debates seems to be about parallel scalability,
though far less important. Today, most people have multicore desktop
computers (e.g. mine have four cores). The Linux kernel got rid of the
BKL a long time ago. It's time Python does the same.

The GIL also has its virtues. When calling a C extension, it is
serialized by default unless the GIL is released.

An attempt was made (many years ago) to replace the GIL with
fine-grained locking. It slowed down serial code. That should come as no
surprise, as OS mutexes and semaphores are expensive.

There are lock-free data structures of any conceivable sort. There are
multithreaded garbage collectors in Java and .NET, that don't need a
global lock. I've heard claims that removal of the GIL would require
Python to use a garbage collector instead of reference counts. That is
not true. Lock-free data structures and garbage collectors have all one
thing in common: they use a compare-and-exchange (CAS) instruction
present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is
used to implement OS mutexes (sleeplocks) and the less expensive
spinlocks. Without a CAS instruction for the platform, there would be no
GIL. All platforms that has a working GIL has a working CAS.

Python could use the CAS instruction for lock-free management of
reference counts. All built-in types (int, float, str, list, tuple,
dict, set, etc) could be made lock-free using CAS. That would allow the
interpreter to execute bytecode without holding the GIL. Only calls to C
extensions would be serialized with the GIL.

Schematically, this is how refcounts could be managed without needing a
lock:

/*
* Returns 1 if an exhange was made, and (*addr == old) before
* the exchange, otherwise returns 0.
*
* uses opcode CMPXCHG on x86
*
*/
extern int Py_refcnt_CAS(Py_ssize_t *addr, Py_ssize_t old, Py_ssize_t
new);

inline void Py_DECREF(PyObject *obj)
{
register Py_ssize_t refcnt;
do {
refcnt = obj->refcnt;
} while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt-1));
refcnt = obj->refcnt;
if (refcnt == 0) {
if(Py_refcnt_CAS(&obj->ob_refcnt, 0, -1))
{
/* deallocate the object */
}
}
}

inline void Py_INCREF(PyObject *obj)
{
register Py_ssize_t refcnt;
refcnt = obj->refcnt;
if (refcnt >= 0) {
do {
if((refcnt = obj->refcnt)<0) break;
} while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt+1));
}
}

Regards,
Sturla Molden

_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Benjamin Peterson

unread,

Oct 20, 2009, 5:45:39 PM10/20/09

to python...@python.org

Sturla Molden <sturla@...> writes:
> Python could use the CAS instruction for lock-free management of
> reference counts. All built-in types (int, float, str, list, tuple,
> dict, set, etc) could be made lock-free using CAS.

I eagerly await your patch!

Adam Olsen

unread,

Oct 20, 2009, 6:08:01 PM10/20/09

to Sturla Molden, python...@python.org

On Tue, Oct 20, 2009 at 15:24, Sturla Molden <stu...@molden.no> wrote:
> There are lock-free data structures of any conceivable sort. There are
> multithreaded garbage collectors in Java and .NET, that don't need a global
> lock. I've heard claims that removal of the GIL would require Python to use
> a garbage collector instead of reference counts. That is not true. Lock-free
> data structures and garbage collectors have all one thing in common: they
> use a compare-and-exchange (CAS) instruction present in most modern
> processors, e.g. CMPXCHG on x86. In fact, CAS is used to implement OS
> mutexes (sleeplocks) and the less expensive spinlocks. Without a CAS
> instruction for the platform, there would be no GIL. All platforms that has
> a working GIL has a working CAS.

CPython uses *extensive* refcounting operations on objects shared
between threads (builtins, globals, literals, etc). The cache
contention over those refcounts means even 2 threads are significantly
slower than 1 thread, nevermind the increased contention of 3+
threads. That is why it is said a true tracing GC is needed to
replace the refcounting GC.

--
Adam Olsen, aka Rhamphoryncus

Sturla Molden

unread,

Oct 20, 2009, 6:13:04 PM10/20/09

to python...@python.org

Benjamin Peterson skrev:

> I eagerly await your patch!
>

I'd gladly contribute to the effort, but I did not say it is an easy
task :-)

I just wanted to bring this up:

- The GIL has consequences on multicore CPUs that are overlooked: thread
switches are usually missed at check intervals. This could be fixed
without removing the GIL: For example, there could be a wait-queue for
the GIL; a thread that request the GIL puts itself in the back.

- Also, fine grained locking is not the only alternative to a global
lock. No locks at all is even better.

Sturla Molden

Casey Duncan

unread,

Oct 20, 2009, 6:57:15 PM10/20/09

to Sturla Molden, python...@python.org

On Oct 20, 2009, at 4:13 PM, Sturla Molden wrote:

> Benjamin Peterson skrev:
>> I eagerly await your patch!
> I'd gladly contribute to the effort, but I did not say it is an easy
> task :-)
>
> I just wanted to bring this up:

[..]

>
> - Also, fine grained locking is not the only alternative to a global
> lock. No locks at all is even better.

I doubt anyone here would disagree with that, but to be taken
seriously, you'll need to be *much* more specific. Enlighten us.

-Casey

Antoine Pitrou

unread,

Oct 20, 2009, 6:55:04 PM10/20/09

to python...@python.org

Hello,

> The cooperative
> multithreading intended by using checkintervals only works properly on
> single-processor systems. Python threads only work properly on
> single-processor computers.

I'm trying to work on that. I have an experimental rewrite of the GIL for POSIX
systems which improves latencies, while decreasing the overhead of the GIL
itself (because if checked every 100 opcodes, it can be dropped and re-taken
tens of thousands of times per second...). If people have interesting workloads
for py3k, the Mercurial branch can be found here:
http://hg.pitrou.net/public/py3k/newgil/shortlog

(my measurements, by the way, are based on a small concurrency benchmark for
Python I recently wrote:
http://svn.python.org/view/sandbox/trunk/ccbench/ )

> There are lock-free data structures of any conceivable sort. There are
> multithreaded garbage collectors in Java and .NET, that don't need a
> global lock. I've heard claims that removal of the GIL would require
> Python to use a garbage collector instead of reference counts. That is
> not true. Lock-free data structures and garbage collectors have all one
> thing in common: they use a compare-and-exchange (CAS) instruction
> present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is
> used to implement OS mutexes (sleeplocks) and the less expensive
> spinlocks. Without a CAS instruction for the platform, there would be no
> GIL. All platforms that has a working GIL has a working CAS.

That does not mean that CAS (or atomic increment/decrement, for that matter) is
as fast as a regular (non-atomic) increment / decrement.
Besides, reference counting is not the only thing to worry about.
You have to:
- protect (somehow) the GC when collecting, so that things don't get modified
under its feet
- protect INCREFs and DECREFs
- protect all mutable types (dicts being probably the most performance-critical,
since they are used for namespaces and class/object attributes and methods)
- protect all static/global data in C modules which have them
- and probably other things I'm not thinking about

An interesting experiment would be to add all those protections without even
attempting to remove the GIL, and measure the overhead compared to a regular
Python. IMO, if the overhead is less than 10%, then it would probably be
acceptable to go ahead and try removing the GIL.

But someone needs to tackle this concretely for it to happen, because I guess
these ideas have already been suggested a couple of times...

Regards

Antoine.

Sturla Molden

unread,

Oct 20, 2009, 7:31:52 PM10/20/09

to python...@python.org

Casey Duncan skrev:

>
> I doubt anyone here would disagree with that, but to be taken
> seriously, you'll need to be *much* more specific. Enlighten us.

I just showed you thread-safe incref and decref that don't require a lock.

How specific should I be?

S.M.

Casey Duncan

unread,

Oct 20, 2009, 7:38:34 PM10/20/09

to Sturla Molden, python...@python.org

On Oct 20, 2009, at 5:31 PM, Sturla Molden wrote:

> Casey Duncan skrev:
>>
>> I doubt anyone here would disagree with that, but to be taken
>> seriously, you'll need to be *much* more specific. Enlighten us.
> I just showed you thread-safe incref and decref that don't require a
> lock.

They looked like spinlocks to me, I was assuming you were talking
about something else even more magical. I was still considering them
locks.

-Casey

geremy condra

unread,

Oct 20, 2009, 7:39:47 PM10/20/09

to python-ideas

On Tue, Oct 20, 2009 at 7:31 PM, Sturla Molden <stu...@molden.no> wrote:
> Casey Duncan skrev:
>>
>> I doubt anyone here would disagree with that, but to be taken seriously,
>> you'll need to be *much* more specific. Enlighten us.
>
> I just showed you thread-safe incref and decref that don't require a lock.
>
> How specific should I be?
>

I suspect it's going to take working code- after all, if it isn't worth your
time to implement, I don't think very many others are going to leap
up and do it, however much benefit your perceive it having.

Geremy Condra

Sturla Molden

unread,

Oct 20, 2009, 7:46:15 PM10/20/09

to python...@python.org

Antoine Pitrou skrev:

> - protect (somehow) the GC when collecting, so that things don't get modified
> under its feet
>

The GC would have to temporarily suspend all active threads registered
with the interpreter. E.g. using the SuspendThread function in Windows API.

> - protect INCREFs and DECREFs
>

Update using CAS (I just showed you C code for this).

> - protect all mutable types (dicts being probably the most performance-critical,
> since they are used for namespaces and class/object attributes and methods)
>

CAS. Lock-free lists and hash tables exist. We don't need locks to
protect mutable types and class objects. Just use a lock-free hash-table
for __dict__ or whatever.

> - protect all static/global data in C modules which have them
>

Reacquire the GIL before calling functions in C extensions functions.
The GIL is nice there, but not anywhere else.

> An interesting experiment would be to add all those protections without even
> attempting to remove the GIL, and measure the overhead compared to a regular
> Python.

Yes, one could add a nonsence CAS to existing incref/decref, and measure
overhead.

Antoine Pitrou

unread,

Oct 20, 2009, 7:50:29 PM10/20/09

to python...@python.org

Sturla Molden <sturla@...> writes:
>
> Yes, one could add a nonsence CAS to existing incref/decref, and measure
> overhead.

That would be the first thing to do IMO.
If it doesn't seem to decrease performance a lot, you could continue by adding
locking to dict objects, and measure again.
Then add lists and sets (although I doubt sets and even lists are
performance-critical for general interpreter operation).
If after that you have less than a 10% slowdown (personal opinion only, other
people's mileage may vary), it's probably worth going on and trying to do the
whole GIL removal (which doesn't mean it would necessarily be accepted, by the
way, but at least would avoid it being automatically rejected ;-)).

Regards

Antoine.

Sturla Molden

unread,

Oct 20, 2009, 7:52:39 PM10/20/09

to python...@python.org

Casey Duncan skrev:

>
> They looked like spinlocks to me, I was assuming you were talking
> about something else even more magical. I was still considering them
> locks.

The code is almost like spinlocks, except the refcount acts as its own
"spinlock", thus the overhead is halved. (A separate spinlock would have
to be released as well.)

The failed attempt to remove the GIL used OS mutexes, which are much
more expensive sleeplocks.

Adam Olsen

unread,

Oct 20, 2009, 8:00:29 PM10/20/09

to Antoine Pitrou, python...@python.org

On Tue, Oct 20, 2009 at 17:50, Antoine Pitrou <soli...@pitrou.net> wrote:
> Sturla Molden <sturla@...> writes:
>>
>> Yes, one could add a nonsence CAS to existing incref/decref, and measure
>> overhead.
>
> That would be the first thing to do IMO.
> If it doesn't seem to decrease performance a lot, you could continue by adding
> locking to dict objects, and measure again.
> Then add lists and sets (although I doubt sets and even lists are
> performance-critical for general interpreter operation).
> If after that you have less than a 10% slowdown (personal opinion only, other
> people's mileage may vary), it's probably worth going on and trying to do the
> whole GIL removal (which doesn't mean it would necessarily be accepted, by the
> way, but at least would avoid it being automatically rejected ;-)).

This'd show you the minimum amount of overhead, but only for a trivial
case: single threaded. You're not invoking cache contention, which in
my experience is much larger than the atomic op cost, and prevents the
*real* gains you hope to make.

Python-safethread makes the datastructures safe (lockless fast-paths)
so that it can be compared properly. I tried atomic refcounting and
it sucked. My buffered bodge sucked less, but still sucked. The only
option left is a real tracing GC.

--
Adam Olsen, aka Rhamphoryncus

Sturla Molden

unread,

Oct 20, 2009, 8:22:22 PM10/20/09

to python...@python.org

Adam Olsen skrev:

> This'd show you the minimum amount of overhead, but only for a trivial
> case: single threaded.

As far as I know, the single-threaded case is the reason for keeping the
GIL.

Adam Olsen

unread,

Oct 21, 2009, 12:09:01 AM10/21/09

to Sturla Molden, python...@python.org

On Tue, Oct 20, 2009 at 18:22, Sturla Molden <stu...@molden.no> wrote:
> Adam Olsen skrev:
>>
>> This'd show you the minimum amount of overhead, but only for a trivial
>> case: single threaded.
>
> As far as I know, the single-threaded case is the reason for keeping the
> GIL.

No, single-threaded is the reason for keeping the GIL *as an option*.
The complete lack of any other viable option (in CPython) is the
reason it's the only option.

--
Adam Olsen, aka Rhamphoryncus

Kristján Valur Jónsson

unread,

Oct 21, 2009, 6:15:52 AM10/21/09

to Sturla Molden, python...@python.org, pytho...@python.org

> -----Original Message-----
> From: python-ideas-bounces+kristjan=ccpgam...@python.org
> [mailto:python-ideas-bounces+kristjan=ccpgam...@python.org] On
> Behalf Of Sturla Molden
> Sent: 20. október 2009 22:13
> To: python...@python.org
> Subject: Re: [Python-ideas] Remove GIL with CAS instructions?
>
>
> - The GIL has consequences on multicore CPUs that are overlooked:
> thread
> switches are usually missed at check intervals. This could be fixed
> without removing the GIL: For example, there could be a wait-queue for
> the GIL; a thread that request the GIL puts itself in the back.
>

This depends entirely on the platform and primitives used to implement the GIL.
I'm interested in windows. There, I found this article:
http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html
So, you may be on to something. Perhaps a simple C test is in order then?

I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however.
"fairness" was retained in debug builds too, strangely enough.

Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly.

The test application is attached.

I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above.

On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev)

Cheers,

Kristján

giltest.cpp

Steven D'Aprano

unread,

Oct 21, 2009, 7:05:34 AM10/21/09

to python...@python.org

On Wed, 21 Oct 2009 03:09:01 pm Adam Olsen wrote:
> On Tue, Oct 20, 2009 at 18:22, Sturla Molden <stu...@molden.no> wrote:
> > Adam Olsen skrev:
> >> This'd show you the minimum amount of overhead, but only for a
> >> trivial case: single threaded.
> >
> > As far as I know, the single-threaded case is the reason for
> > keeping the GIL.
>
> No, single-threaded is the reason for keeping the GIL *as an option*.
> The complete lack of any other viable option (in CPython) is the
> reason it's the only option.

I'm not entirely sure what you're trying to say there, but I'd like link
to what Guido said over two years ago:

http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[quote]
... I'd welcome a set of patches into Py3k only if the performance for a
single-threaded program (and for a multi-threaded but I/O-bound
program) does not decrease.
I would also be happy if someone volunteered to maintain a GIL-free fork
of Python...
However, I want to warn that there are many downsides to removing the
GIL. ...
I want to point out one more time that the language doesn't require the
GIL -- it's only the CPython virtual machine that has historically been
unable to shed it.
[end quote]

--
Steven D'Aprano

Nick Coghlan

unread,

Oct 21, 2009, 8:31:12 AM10/21/09

to Sturla Molden, python...@python.org

Sturla Molden wrote:
> Benjamin Peterson skrev:
>> I eagerly await your patch!
> I'd gladly contribute to the effort, but I did not say it is an easy
> task :-)
>
> I just wanted to bring this up:
>
> - The GIL has consequences on multicore CPUs that are overlooked: thread
> switches are usually missed at check intervals.

In the rare case where none of those threads are releasing the GIL for
any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart,
but not a particularly difficult one to deal with.

That tangent aside, the major problem with removing the GIL has never
been the concept of doing so, but the practicality of implementing it in
a cross-platform way that doesn't hurt single threaded performance.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------

Kristján Valur Jónsson

unread,

Oct 21, 2009, 9:12:38 AM10/21/09

to Nick Coghlan, Sturla Molden, python...@python.org

> -----Original Message-----
> From: python-ideas-bounces+kristjan=ccpgam...@python.org
> [mailto:python-ideas-bounces+kristjan=ccpgam...@python.org] On
> Behalf Of Nick Coghlan
> Sent: 21. október 2009 12:31
> To: Sturla Molden
> Cc: python...@python.org
> Subject: Re: [Python-ideas] Remove GIL with CAS instructions?
>

> > - The GIL has consequences on multicore CPUs that are overlooked:
> thread
> > switches are usually missed at check intervals.
>
> In the rare case where none of those threads are releasing the GIL for
> any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart,
> but not a particularly difficult one to deal with.
>

I think you may be missing the point.
If the GIL is implemented in a naive way on a platform, then releasing the gil and reclaiming it (such as is done during a checkinterval) can cause the same thread to get it again. Thus, the idea that this checkinterval should allow another thread waiting for the gil to run, would not work.
Apparently, "fairness" of some locking primitives is being given up for efficiency reasons in some operatins systems, like Vista.
I tested this by implementing a GIL like mechanism on a multicore maching using windows CriticalSections and Mutexes, both of which would starve the other threads. Semaphores work, though and the python locks on windows (using an atomic counter and an Event object) also work.

So, alghough Sturla's claim isn't valid for windows, there might be systems where the syscheckinterval mechanism for thread yielding doesn't work due to the locks in question not being "fair" on multicore platforms.

K

Sturla Molden

unread,

Oct 21, 2009, 10:56:40 PM10/21/09

to pytho...@python.org, python...@python.org

Antoine Pitrou skrev:
> Kristján sent me a patch which I applied and is supposed to fix this.
> Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero
> latency with the Pi calculation in the background) than under Linux, although it
> may be caused by the limited resolution of time.time() under Windows.
>
>

My critisism of the GIL on python-ideas was partly motivated by this:

http://blip.tv/file/2232410

However, David Beazley is not talking about Windows. Since the GIL is
apparently not a mutex on Windows, it could behave differently. So I
wrote a small script that contructs a GIL battle, and record how often a
check-interval results in a thread-switch or not. For monitoring check
intervals, I used a small C extension to read _Py_Ticker from ceval.c.
It is not declared static so I could easily hack into it.

With two threads and a check interval og 100, only 61 of 100000 check
intervals failed to produce a thread-switch in the interpreter. I'd call
that rather fair. :-)

And in case someone asks, the nthreads=1 case is just for verification.

S.M.

D:\>test.py
check interval = 1
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=57809, missed=42191
nthreads=3, swiched=91535, missed=8465
nthreads=4, swiched=99751, missed=249
nthreads=5, swiched=95839, missed=4161
nthreads=6, swiched=100000, missed=0

D:\>test.py
check interval = 10
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99858, missed=142
nthreads=3, swiched=99992, missed=8
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0

D:\>test.py
check interval = 100
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99939, missed=61
nthreads=3, swiched=100000, missed=0
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0

D:\>test.py
check interval = 1000
nthreads=1, swiched=0, missed=100000
nthreads=2, swiched=99999, missed=1
nthreads=3, swiched=100000, missed=0
nthreads=4, swiched=100000, missed=0
nthreads=5, swiched=100000, missed=0
nthreads=6, swiched=100000, missed=0

Sturla Molden

unread,

Oct 21, 2009, 11:09:44 PM10/21/09

to pytho...@python.org, python...@python.org

Sturla Molden skrev:

>
> However, David Beazley is not talking about Windows. Since the GIL is
> apparently not a mutex on Windows, it could behave differently. So I
> wrote a small script that contructs a GIL battle, and record how often
> a check-interval results in a thread-switch or not. For monitoring
> check intervals, I used a small C extension to read _Py_Ticker from
> ceval.c. It is not declared static so I could easily hack into it.

Anyway, if anyone wants to run a GIL battle, here is the code I used.

If it turns out the GIL is far worse with pthreads, as it is implemented
with a mutex, it might be a good idea to reimplement it with an event
object as it is on Windows.

Sturla Molden

----------------

In python:

from giltest import *
from time import clock
import threading
import sys

def thread(rank, battle, start):

while not start.isSet():
if rank == 0:
start.set()

try:
while 1:
battle.record(rank)
except:
pass

if __name__ == '__main__':

sys.setcheckinterval(1000)

print "check interval = %d" % sys.getcheckinterval()

for nthreads in range(1,7):

start = threading.Event()
battle = GIL_Battle(100000)
threads = [threading.Thread(target=thread, args=(i,battle,start))
for i in range(1,nthreads)]
for t in threads:
t.setDaemon(True)
t.start()

thread(0, battle, start)
for t in threads: t.join()

s,m = battle.report()

print "nthreads=%d, swiched=%d, missed=%d" % (nthreads, s, m)

In Cython or Pyrex:

from exceptions import Exception

cdef extern from *:
ctypedef int vint "volatile int"
vint _Py_Ticker

class StopBattle(Exception):
pass

cdef class GIL_Battle:

""" tests the fairness of the GIL """

cdef vint prev_tick, prev_rank, switched, missed
cdef int trials

def __cinit__(GIL_Battle self, int trials=100000):
self.prev_tick = _Py_Ticker
self.prev_rank = -1
self.missed = 0
self.switched = 0
self.trials = trials

def record(GIL_Battle self, int rank):
if self.trials == self.switched + self.missed:
raise StopBattle
if self.prev_rank == -1:
self.prev_tick = _Py_Ticker
self.prev_rank = rank
else:
if _Py_Ticker > self.prev_tick:
if self.prev_rank == rank:
self.missed += 1
else:
self.switched += 1
self.prev_tick = _Py_Ticker
self.prev_rank = rank
else:
self.prev_tick = _Py_Ticker

def report(GIL_Battle self):
return int(self.switched), int(self.missed)

Nick Coghlan

unread,

Oct 22, 2009, 5:02:27 PM10/22/09

to Kristján Valur Jónsson, python...@python.org

Kristján Valur Jónsson wrote:
> I think you may be missing the point. If the GIL is implemented in a
> naive way on a platform, then releasing the gil and reclaiming it
> (such as is done during a checkinterval) can cause the same thread to
> get it again. Thus, the idea that this checkinterval should allow
> another thread waiting for the gil to run, would not work.
> Apparently, "fairness" of some locking primitives is being given up
> for efficiency reasons in some operatins systems, like Vista. I
> tested this by implementing a GIL like mechanism on a multicore
> maching using windows CriticalSections and Mutexes, both of which
> would starve the other threads. Semaphores work, though and the
> python locks on windows (using an atomic counter and an Event object)
> also work.
>
> So, alghough Sturla's claim isn't valid for windows, there might be
> systems where the syscheckinterval mechanism for thread yielding
> doesn't work due to the locks in question not being "fair" on
> multicore platforms.

No, I understand the fairness problem - releasing the GIL for a whole
millisecond is a workaround for exactly that issue. (Python must have
suffered from this back on Windows NT 4, as that is where I first
discovered this trick to fix a program that wasn't switching threads
properly - before I added the time.sleep call to explicitly release the
GIL and give other threads a chance to run before attempting to
reacquire it, the program was running each thread to completion before
starting the next one).

It's a sledgehammer approach to the problem though and incurs a speed
penalty even on platforms where the GIL thread scheduling is fairer. So
it's better if we can find a way to ensure proper scheduling fairness
across all platforms.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------

Antoine Pitrou

unread,

Nov 24, 2009, 1:39:03 PM11/24/09

to python...@python.org

Ok, out of curiousity I gave it a try: replacing INCREF/DECREF with atomic
instructions (*) slows down the interpreter by 20 to 40% depending on the
workload. And keep in mind this is the tip of the iceberg: to remove the GIL
you'd also have to add fine-grained locking in other places (dicts etc.).

Which makes me agree with the commonly expressed opinion that CPython would
probably need to ditch refcounting (at least in the critical paths) if we want
to remove the GIL.

Regards

Antoine.

(*) using gcc's atomic primitives which, I have checked, are inlined as
carefully optimized assembler:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Builtins

Guido van Rossum

unread,

Nov 24, 2009, 2:10:14 PM11/24/09

to Antoine Pitrou, python...@python.org

Note that Greg Stein reached this same conclusion (and similar
numbers) over 10 years ago...

--
--Guido van Rossum (python.org/~guido)

Collin Winter

unread,

Nov 24, 2009, 2:21:05 PM11/24/09

to Guido van Rossum, Antoine Pitrou, python...@python.org

On Tue, Nov 24, 2009 at 11:10 AM, Guido van Rossum <gu...@python.org> wrote:
> Note that Greg Stein reached this same conclusion (and similar
> numbers) over 10 years ago...

It's worth repeating this kind of experiment; the hardware landscape
has changed a lot in 10 years. It's interesting that the results are
the same a decade later.

Collin

Reply all

Reply to author

Forward