Calls to python API without GIL causing crash

2 views
Skip to first unread message

Sam Partington

unread,
Mar 7, 2012, 6:44:17 AM3/7/12
to wxpyth...@googlegroups.com
Hello there,

For almost a month we've been trying to track down a GPF in our
application which was impossible to reproduce on demand, only
presented itself in release builds and when in py2XXX binary form, and
only on multi-core systems.

We have eventually tracked it down to this code in
http://svn.wxwidgets.org/svn/wx/wxPython/trunk/include/wx/wxPython/pytree.h

In particular :
void SetData(PyObject* obj) {
wxPyBlock_t blocked = wxPyBeginBlockThreads();
Py_DECREF(m_obj);
wxPyEndBlockThreads(blocked);
m_obj = obj;
Py_INCREF(obj);
}

Py_INCREF is being called without the GIL. The odd thing is that this
seems to be deliberate, in SetData for example the GIL is explicitly
released after the DECREF. The other Py_INCREF calls also don't
acquire the GIL.

We've added Begin/EndBlockThreads calls around the INCREF calls and
we've now been running our application for 3 days without a crash
(crashes would normally occur in a few hours).

As a result of this we are now reviewing the whole of wxPython for all
calls to Py_INCREF that are taking place without the GIL, and we'll
submit patches upstream at this point. We've already found several
places. After this we'll be reviewing all Python API calls in
wxPython. This will take a while obviously :-)

Does anyone know why this was done this way? Is there some reason why
it was thought that it was atomic? It is defined thus :

#define Py_INCREF(op) ( \
_Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \
((PyObject*)(op))->ob_refcnt++)


Which is most definitely not atomic.

Another thing that occurs to me whilst reviewing the code is that the
whole system in wxPython is dreadfully error prone. Our own extension
handles this by having all Python API calls in a single module rather
than littered over the code base.

After this experience we also intend to enforce correctness in our
extension by passing in a handle to something roughly equivalent to
your 'wxPyBlock_t' type. Would there be interest in having something
like this in wx? In Phoenix perhaps, though I've not yet looked at
the phoenix code at all.

Sam

Robin Dunn

unread,
Mar 7, 2012, 5:29:20 PM3/7/12
to wxpyth...@googlegroups.com
On 3/7/12 3:44 AM, Sam Partington wrote:
[...]

>
> Does anyone know why this was done this way? Is there some reason why
> it was thought that it was atomic? It is defined thus :
>
> #define Py_INCREF(op) ( \
> _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \
> ((PyObject*)(op))->ob_refcnt++)
>
>
> Which is most definitely not atomic.

I can't say for sure since it has been so long, but I seem to recall
reading something about the GIL only needing to be held when actual
Python byte-code is being (or could be) executed, so I was probably
using that as the basis of that coding decision. Since a Py_DECREF can
potentially cause a Python __del__ method to be called then they are all
protected by the GIL, but since Py_INCREF only executes a bit of C code
which should never directly result in Python code being executed then I
didn't. At least that's my guess at this point in time, it has been a
*very* long time since then.

I'm looking forward to your patches.

--
Robin Dunn
Software Craftsman
http://wxPython.org

Sam Partington

unread,
Mar 8, 2012, 5:04:40 AM3/8/12
to wxpyth...@googlegroups.com

It seems that's not an uncommon misconception looking around the net,
but the docs, now at least, are quite clear, explicitly mentioning
incrementing ref counts:

http://www.python.org/doc//current/c-api/init.html#thread-state-and-the-global-interpreter-lock

"""
The Python interpreter is not fully thread safe. In order to support
multi-threaded Python programs, there’s a global lock, called the
global interpreter lock or GIL, that must be held by the current
thread before it can safely access Python objects. Without the lock,
even the simplest operations could cause problems in a multi-threaded
program: for example, when two threads simultaneously increment the
reference count of the same object, the reference count could end up
being incremented only once instead of twice.

Therefore, the rule exists that only the thread that has acquired the
global interpreter lock may operate on Python objects or call Python/C
API functions.
"""

> I'm looking forward to your patches.

That'll probably take a week before we have completed the initial
review, so far I've found about 8 violations, and not all of them
INCREF, some PyInt_FromLong calls for example.

As this review process feels rather error prone, I've also written a
wrapper around the python headers that make every call to the Python
API runtime raise an exception if the GIL is not held.

Sam

Amaury Forgeot d'Arc

unread,
Mar 8, 2012, 5:49:15 AM3/8/12
to wxpyth...@googlegroups.com
2012/3/7 Robin Dunn <ro...@alldunn.com>:

>> #define Py_INCREF(op) (                         \
>>     _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
>>     ((PyObject*)(op))->ob_refcnt++)
>>
>>
>> Which is most definitely not atomic.
>
>
> I can't say for sure since it has been so long, but I seem to recall reading
> something about the GIL only needing to be held when actual Python byte-code
> is being (or could be) executed, so I was probably using that as the basis
> of that coding decision.  Since a Py_DECREF can potentially cause a Python
> __del__ method to be called then they are all protected by the GIL, but
> since Py_INCREF only executes a bit of C code which should never directly
> result in Python code being executed then I didn't.  At least that's my
> guess at this point in time, it has been a *very* long time since then.

If another thread (which owns the GIL) calls Py_INCREF on the same
object at the same time,
the result is undefined: the final value of refcount will be
incremented by one or two,
depending on race condition, L1 cache, or whatever.
The GIL is not only here for Python bytecode (and this is probably why
it is so difficult to remove from CPython)

Py_INCREF should only be called with the GIL held.

--
Amaury Forgeot d'Arc

Sam Partington

unread,
Mar 16, 2012, 10:51:40 AM3/16/12
to wxpyth...@googlegroups.com
Hi there,

Whilst fixing this I came across the following comment :

// TODO: Can not wxASSERT here because inside a
wxPyBeginBlock Threads,
// will lead to a deadlock when it tries to aquire the GIL
again.

This is no longer the case when wxPyUSE_GIL_STATE != 0. is ? It looks
to me like PyGILState_Ensure/Release have been used for python >= 2.4.

I can't see any definitive minimum python version anywhere though.

Do we still need to still support Python 2.3 and earlier? Or can this code go?

The reason I ask is because if the wxPyUSE_GIL_STATE == 0 code can
deadlock if the GIL is taken twice then I will need to be a bit more
careful with where it's used. In general I have tried hard not to do
any unnecessary recursive locking, but overall policy is that a small
performance hit because a lock has been taken twice is better than an
almost impossible to detect race condition.

Thanks

Sam

PS Patches are now ready, I'm just having a bit of difficulty getting
it to build on OSX in the trunk :-)

> --
> To unsubscribe, send email to wxPython-dev...@googlegroups.com
> or visit http://groups.google.com/group/wxPython-dev?hl=en

Robin Dunn

unread,
Mar 16, 2012, 2:38:55 PM3/16/12
to wxpyth...@googlegroups.com
On 3/16/12 7:51 AM, Sam Partington wrote:
> Hi there,
>
> Whilst fixing this I came across the following comment :
>
> // TODO: Can not wxASSERT here because inside a
> wxPyBeginBlock Threads,
> // will lead to a deadlock when it tries to aquire the GIL
> again.
>
> This is no longer the case when wxPyUSE_GIL_STATE != 0. is ? It looks
> to me like PyGILState_Ensure/Release have been used for python>= 2.4.
>
> I can't see any definitive minimum python version anywhere though.
>
> Do we still need to still support Python 2.3 and earlier? Or can this code go?

It can go. I think a few people are still wanting 2.5 support but 2.6
is as far back as I'm concerned about, and Phoenix will probably require
2.7.

>
> The reason I ask is because if the wxPyUSE_GIL_STATE == 0 code can
> deadlock if the GIL is taken twice then I will need to be a bit more
> careful with where it's used. In general I have tried hard not to do
> any unnecessary recursive locking, but overall policy is that a small
> performance hit because a lock has been taken twice is better than an
> almost impossible to detect race condition.

Agreed.

> PS Patches are now ready, I'm just having a bit of difficulty getting
> it to build on OSX in the trunk :-)

I may have some changes that need to be pushed up the the SVN server.
I'll check in a few minutes.

Sam Partington

unread,
Apr 17, 2012, 12:05:58 PM4/17/12
to wxpyth...@googlegroups.com
Hello,

Sorry for the delay on this but I now have my patch ready, but it is
rather messed up because it looks like the osx_cocoa swig generated
files are out of date on the head.

So first, here is a patch for all of the osx_cocoa changes using the
head. If you can commit those then I can generate a patch of the
genuine changes I have made.

Thanks

Sam

swig-generated.diff

Sam Partington

unread,
Apr 19, 2012, 9:57:00 AM4/19/12
to wxpyth...@googlegroups.com
On 17 April 2012 17:05, Sam Partington <sam.par...@gmail.com> wrote:
> Sorry for the delay on this but I now have my patch ready, but it is
> rather messed up because it looks like the osx_cocoa swig generated
> files are out of date on the head.

After thinking about this more I realised that actually I could
generate a patch without the swig generated files quite easily. I
just needed to do a fresh checkout and apply my patch to that to get a
proper svn patch.

The main strategy for this patch was to have an RAII class to the
locking/unlocking :

493 // helper for RAII thread blocking
494 class wxPyThreadBlocker {
495 public:
496 #ifdef wxPyUSE_EXPORTED_API
497 static wxPyBlock_t wxPyBeginBlockThreads() { return
wxPyGetCoreAPIPtr()->p_wxPyBeginBlockThreads(); }
498 static void wxPyEndBlockThreads(wxPyBlock_t oldstate) {
wxPyGetCoreAPIPtr()->p_wxPyEndBlockThreads(oldstate); }
499 #endif
500
501 explicit wxPyThreadBlocker(bool block=true)
502 : m_oldstate(block ? wxPyBeginBlockThreads() :
wxPyBlock_t_default),
503 m_block(block)
504 {
505 }
506 ~wxPyThreadBlocker() {
507 if (m_block) {
508 wxPyEndBlockThreads(m_oldstate);
509 }
510 }
511
512 private:
513 void operator=(const wxPyThreadBlocker&);
514 explicit wxPyThreadBlocker(const wxPyThreadBlocker&);
515 wxPyBlock_t m_oldstate;
516 bool m_block;
517 };

Secondly to factor out the common code from the various user data
object holders (such as wxPyUserData, wxPyClientData,
wxPyOORClientData and wxPyTreeItemData) into a single template class
with a few derived classes for specialisations.

521 // helper template to make common code for all of the various
user data owners
522 template<typename Base>
523 class wxPyUserDataHelper : public Base {
524 public:
525 // This incRef flag seems to be used by the wxApp OOR stuff ONLY.
526 explicit wxPyUserDataHelper(PyObject* obj, bool incRef=true)
: m_obj(obj ? obj : Py_None) {
527 if (incRef) {
528 wxPyThreadBlocker blocker;
529 Py_INCREF(m_obj);
530 }
531 }
532 ~wxPyUserDataHelper()
533 { // normally the derived class does the clean up, or
deliberately leaks
534 // by setting m_obj to 0, but if not then do it here.
535 if (m_obj) {
536 wxPyThreadBlocker blocker;
537 Py_DECREF(m_obj);
538 m_obj = 0;
539 }
540 }
541
542 // Return Value: New reference
543 PyObject* GetData() const {
544 wxPyThreadBlocker blocker;
545 Py_INCREF(m_obj);
546 return m_obj;
547 }
548 // Return Value: Borrowed reference
549 PyObject* BorrowData() const {
550 return m_obj;
551 }
552
553 void SetData(PyObject* obj) {
554 if (obj != m_obj) {
555 wxPyThreadBlocker blocker;
556 Py_DECREF(m_obj);
557 m_obj = obj ? obj : Py_None;
558 Py_INCREF(m_obj);
559 }
560 }
561
562 // Return the object in udata or None if udata is null
563 // Return Value: New reference
564 static PyObject* SafeGetData(wxPyUserDataHelper<Base>* udata) {
565 wxPyThreadBlocker blocker;
566 PyObject* obj = udata ? udata->BorrowData() : Py_None;
567 Py_INCREF(obj);
568 return obj;
569 }
570
571 // Set the m_obj to null, this should only be used during
clean up, when
572 // the object should be leaked.
573 // Calling any other methods on this object is then undefined behaviour
574 void ReleaseDataDuringCleanup()
575 {
576 m_obj = 0;
577 }
578
579 private:
580 PyObject* m_obj;
581 };

The rest of the patch is just using these classes in the right way in
the right places.

Sam

fix-threading-issues.diff

Robin Dunn

unread,
Apr 19, 2012, 5:02:18 PM4/19/12
to wxpyth...@googlegroups.com
On 4/19/12 6:57 AM, Sam Partington wrote:
> On 17 April 2012 17:05, Sam Partington<sam.par...@gmail.com> wrote:
>> Sorry for the delay on this but I now have my patch ready, but it is
>> rather messed up because it looks like the osx_cocoa swig generated
>> files are out of date on the head.
>
> After thinking about this more I realised that actually I could
> generate a patch without the swig generated files quite easily. I
> just needed to do a fresh checkout and apply my patch to that to get a
> proper svn patch.

Making patches for generated files is kinda dangerous anyway, since
they'll be lost when the files are regenerated.

>
> The main strategy for this patch was to have an RAII class to the
> locking/unlocking :

>


> The rest of the patch is just using these classes in the right way in
> the right places.

Thanks, I'll take a closer look when I can.

Sam Partington

unread,
May 24, 2012, 12:04:54 PM5/24/12
to wxpyth...@googlegroups.com
On 19 April 2012 22:02, Robin Dunn <ro...@alldunn.com> wrote:
> On 4/19/12 6:57 AM, Sam Partington wrote:
>>
>> On 17 April 2012 17:05, Sam Partington<sam.par...@gmail.com>  wrote:
>>>
>> After thinking about this more I realised that actually I could
>> generate a patch without the swig generated files quite easily.  I
>> just needed to do a fresh checkout and apply my patch to that to get a
>> proper svn patch.
> Thanks, I'll take a closer look when I can.

Hi Robin,

Any progress on this?

Sam

Robin Dunn

unread,
May 24, 2012, 3:12:31 PM5/24/12
to wxpyth...@googlegroups.com
I plan on integrating it (and also look at Werner's bug reports) before
the next 2.9.4 preview build, which will probably be in a few days.
Reply all
Reply to author
Forward
0 new messages