"Fatal Python error: PyMUTEX_UNLOCK(gil->mutex) failed" and bool interpreted as signed int with a random signed int value in printf

337 views
Skip to first unread message

Sour Ce

unread,
Jul 16, 2020, 12:43:49 AM7/16/20
to cython-users
Hey there.
I have a fairly large cython project consisting of 8000 lines of code.
I've experienced a few cases of odd/buggy behavior that I'd like to share.

Issue 1)
 I get the error "Fatal Python error: PyMUTEX_UNLOCK(gil->mutex) failed
Python runtime state: initialized

Current thread 0x00007fb906ea4740 (most recent call first):
<no Python frame>
Aborted (core dumped)".
 Haven't been able to locate exactly where all the code is run from on this one, but I've been able to narrow down the circumstances which triggers this error without being able to explain what exactly is going on.
 The circumstance is that a recursive asyncio loop.call_later() is going on with a delay. This in itself does not trigger the error.
 The function being recursively called loops over a C++ set of numeric ids, fetches a cppclass pointer from a map if it exists, and calls a method on this pointer (.send) which for some reason triggers an internal "not allocated" error which calls an error function that longjmp's to this code:
  jmpval = setjmp(jmpenv)
 
if jmpval:
            printf
("LS jmpval: %d\n", jmpval)
           
if jmpval == -1:
                gnet
.reset()
               
return
            gnet
.close()
            printf
("LS jmpval return\n")
           
return
 The error is thrown at some point after this return


Issue 2)
 I have a bool named "allocated" inside a "virtual" cppclass (no extern with) named "mystring".
 Only two lines of code touches this variable,
 a)
 
mystring():
   
this.allocated = False

 and b)
 
void create(unsigned int buflen):
 
# snip ...
 
this.allocated = True


 I then have a cppclass "networkhandle" which has among others the variables mystring inbuf and mystring outbuf.
 Further there's a networkhandle pointer variable "gnet".
 When I then call printf("sockid: %d, closed: %d, gnet.inbuf.allocated: %d, gnet.outbuf.allocated: %d\n", gnet.sockid, gnet.inbuf.allocated, gnet.outbuf.allocated) the result I get is "sockid: 1, closed: 1, gnet.inbuf.allocated: 1, gnet.outbuf.allocated: -394987424".
 What's going on here? Why is the last variable interpreted as a signed int when the previous one is not and both variables are bools? Maybe an unrelated bug, I'm just digging for bugs/weird behavior here trying to piece it all together.
 

I've tried running traceback.print_stack() in the function that handles internal (network) errors right before I get the Fatal Python error, but alas without any luck as the result is a mere:
  File "/usr/lib/python3.8/traceback.py", line 190, in print_stack
    print_list(extract_stack(f, limit=limit), file=file)

da-woods

unread,
Jul 16, 2020, 9:32:38 AM7/16/20
to cython...@googlegroups.com
Issue 2 just sounds like your standard "someone has written to an invalid pointer or past the end of an array" bug - `allocated` is being accidentally overwritten from somewhere so has a nonsense value. It's possible that this is a Cython bug, but more likely to be a user bug.

Issue 1 could be more symptoms of the same problem or could be something else. It looks very "internal" suggesting it's memory corruption of some sort.

You could try using C++-level debugger watch-points. Other than that it's just a case of cutting your code down until you can make a small example that reproduces the problem and then examining that in detail.
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/c94e5c13-7cc2-4eba-844f-b2c55280bde6o%40googlegroups.com.


SourCe

unread,
Jul 16, 2020, 9:33:03 AM7/16/20
to cython-users

Update: Made a silly mistake with the printf, forgot to add gnet.closed in the printf. I now get "sockid: 1, closed: 0, gnet.inbuf.allocated: 1, gnet.outbuf.allocated: 1".

Still investigating what's going on with the fatal error and related bugs, only thing I've found out so far is that the fatal error may be caused by asyncio loop.call* > func > longjmp.

da-woods

unread,
Jul 16, 2020, 3:15:02 PM7/16/20
to cython...@googlegroups.com
longjmp seems dodgy to me for at least 3 reasons:
1) If you're using C++ (and it looks like you are?) then it'll bypass destructors etc
2) if there's any Python objects in your Cython code then it'll bypass all of Cython's assumptions about reference counting
3) There may well be other bits of internal state that Cython is using that longjmp is bypassing too.

I'm not saying it's absolutely wrong, but everything about it is seeming like a bad idea. If you're going to use it then I'd do it from within handwritten C code, not Cython generated code (which may have a lot of stuff going on that you don't control)


On 16/07/2020 13:54, 'SourCe' via cython-users wrote:

Update: Made a silly mistake with the printf, forgot to add gnet.closed in the printf. I now get "sockid: 1, closed: 0, gnet.inbuf.allocated: 1, gnet.outbuf.allocated: 1".

Still investigating what's going on with the fatal error and related bugs, only thing I've found out so far is that the fatal error may be caused by asyncio loop.call* > func > longjmp.
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

SourCe

unread,
Jul 17, 2020, 8:13:15 AM7/17/20
to cython-users
@da-woods
Thanks for your replies.

Responding to your earlier point about memory corruption: previously when I was using pure C classes/C memory allocation I struggled some with memory corruption/segfaults (and at a point even further behind I had some memory leaks in some cython / mixed Python/C++ code, but these have been fixed), after moving to C++ classes I haven't experienced a single segfault or memory leak, so there's no signs of memory corruption or memory problems as far as I can see.

The thing is this fatal Python error ONLY occurs when I mix asyncio call_* with longjmp (note: the loop call_* are called inside and to a reguar cdef function, not async def in case it matters).
A little bit about the longjmp: the longjmp is used to handle out of bounds networkhandle errors, instead of continuing code that tries to use a "disabled" networkhandle and filling in IF's everywhere to make sure the networkhandle isn't disabled yet I decided trying to use a longjmp instead and seemingly haven't had any issues with it until now.

I've finally found the source of the error which then triggers this fatal Python error: I forgot to declare "global gnet" in the function :P I'm not completely sure, but I think cython codes sometimes assigns to a global variable even if not explicitly declared global and sometimes not, so I've been a bit lazy with the global declarations.
Anyhow the exact cause of this fatal Python error is still a mystery to me and it would be nice getting to the bottom of it so it doesn't creep up on me in a later instance.

Are there any good options to using a longjmp for my case? I guess I guess should raise an exception and handle it with a try except even though it's likely a bit more expensive?

Sour Ce

unread,
Jul 30, 2020, 8:04:25 AM7/30/20
to cython-users
Hey again.

So the underlying problem seem to still be there with the longjmp and since the only alternative I could think of to a longjmp is to raise an exception I looked into my previous benchmarks and found that raising an exception is pretty expensive (as much as 100μs down to 25μs with except * according to the notes in my benchmark).
Even a 25μs cost for every network exception, including user-based errors (sending invalid data), would be completely suffocating, it's a cost I can't afford in this case as the server should be designed to handles 10's of thousands of pps and still have a lot of resources (>75%) left for server-sided (game progress/handling) operations.
So are there any other alternatives to longjmp that are less costly?

I can't limit the use to handwritten C code since it's necessary for asyncio code, but the classes triggering the function that longjmp and the function itself contains no python objects, they are cppclass objects, though currently (temporarily) I do use some python objects in the same scope as the use of the network handle object, e.g. pydata = net.getShort(); if pydata == 100: net.addShort(123), etc, and at any point net.getShort() and net.addShort() could trigger an internal error (buffer underflow/overflow) that are currently handled with a longjmp due to the lower cost compared with raising an exception.
Could this (likely) turn out to be a problem?
Haven't had the time/energy to look into my code much since the last time, but was hoping to possibly get an answer to this question anyways.

Stefan Behnel

unread,
Jul 30, 2020, 2:37:23 PM7/30/20
to cython...@googlegroups.com
'Sour Ce' via cython-users schrieb am 30.07.20 um 13:54:
> So the underlying problem seem to still be there with the longjmp and since
> the only alternative I could think of to a longjmp is to raise an exception
> I looked into my previous benchmarks and found that raising an exception is
> pretty expensive (as much as 100μs down to 25μs with except * according to
> the notes in my benchmark).

Did you only try "except *" (which is somewhat costly) or also a specific
error return value, e.g. "except -2" or so?


> Even a 25μs cost for every network exception, including user-based errors
> (sending invalid data), would be completely suffocating, it's a cost I
> can't afford in this case as the server should be designed to handles 10's
> of thousands of pps and still have a lot of resources (>75%) left for
> server-sided (game progress/handling) operations.
> So are there any other alternatives to longjmp that are less costly?
>
> I can't limit the use to handwritten C code since it's necessary for
> asyncio code, but the classes triggering the function that longjmp and the
> function itself contains no python objects, they are cppclass objects,
> though currently (temporarily) I do use some python objects in the same
> scope as the use of the network handle object, e.g. pydata =
> net.getShort(); if pydata == 100: net.addShort(123), etc, and at any point
> net.getShort() and net.addShort() could trigger an internal error (buffer
> underflow/overflow) that are currently handled with a longjmp due to the
> lower cost compared with raising an exception.
> Could this (likely) turn out to be a problem?
> Haven't had the time/energy to look into my code much since the last time,
> but was hoping to possibly get an answer to this question anyways.

Is "net" an @final class here? If it doesn't have subtypes, then you can
declare it as @final and let Cython inline the (cdef) method calls, which
might then also improve the error handling performance since the C compiler
could see the cases where the error value is returned.

Stefan

Sour Ce

unread,
Oct 25, 2020, 10:35:40 AM10/25/20
to cython-users
Thought I'd ask a couple of questions here that I should have asked earlier (thanks for your answers earlier):
1) What do you mean by handwritten C code? Do you mean C from a code.c file, or does your definition include Cython C such as cdef int val?
2) If setjmp is basically not supported and/or its use is not properly documented (might or might cause issues with other unspecified Cython bits), why is it included in the default Cython includes/libraries (libc/setjmp.pxd) with no documentation/comments about its highly limited Cython support?

I've loved Cython so far, would be sad if the adventure stops here with setjmp, while included by default, not being supported.

@stefan
I don't know what @final does, all I know is that a simple loop with try except is somewhat costly even if the except clause is just a pass.
I would also have to fill in try and except everywhere in my code, since network errors can happen just about anywhere, not just in asyncio server code, so try except is not really a solution here, it's a desperate last resort option if nothing else.
And if using e.g. except -1, I would have to fill in literally thousands of if net.method() == -1: 'raise Exception' or carefully craft 'return's in thousands of different cases.
This just seems like a really ugly and inappropriate solution, but if there's no other way around in I might have to give it a shot if I don't decide to abandon the project.
More likely I'll stick with longjmp trying to make it work somehow.

Sour Ce

unread,
Oct 26, 2020, 3:11:06 AM10/26/20
to cython-users

Did a few tests today and discovered the following about longjmp: Newly created local Python Objects and AnyScope C++ strings (probably also other C++ objects and others) that have not been freed up before the longjmp ends up creating a memory leak. Cdef numeric types, including pointers, does not seem to cause any memory leaks, and global Python Objects seem to be automatically freed up anyways.
Additionally it seems like C++ Objects actually cannot be freed using new and delete from a global scope (using the Python global statement), it has to be deleted/freed up locally, even assigning a global pointer to it or appending to a container and trying to delete from it later on doesn't work.
A conceptual workaround for using longjmp seems to be:
1) Not creating any new C++ objects prior to a longjmp or make sure they've already been manually freed up locally before the longjmp.
2) Before any longjmp make all Python Objects global, ensuring they're automatically freed up, or manually free them up with del locally.
3) If you're going to create new C++ objects before a longjmp it seems it has to be created and deleted with the new and delete keywords to allow them to be deleted before the longjmp with a del.
4) Create numeric variables (all C data structures?) as much as possible in favor of C++ objects and Python objects, since simple numeric variables does not seem to have to be freed up manually. If you're going to use C++ or Python objects, make sure they've been created before the longjmp with the exception of global Python objects.

Note: Never tested structs and only tested C++ vectors and C++ strings, but I expect the same negative results with all C++ objects and the same positive results with all C data structures.

So I guess you were basically correct da-woods, however doing some testing on my own helped me realize more exactly what can and can't be done without creating memory leaks. Seems to me it's still possible using longjmp with proper care without handcrafting C code.

da-woods

unread,
Oct 26, 2020, 4:51:38 AM10/26/20
to cython...@googlegroups.com
Your assessment of when longjump should work sounds about right to me. The only other thing to add is that you can't rely on finally blocks.

Just to answer a few of your questions from your earlier post:

> What do you mean by handwritten C code? Do you mean C from a code.c file, or does your definition include Cython C such as cdef int val?

In principle sufficiently carefully typed Cython code should probably be OK, but remember that you don't control it 100% and it's very easy to inadvertently generate Python objects. If you make your functions `nogil` (i.e. `cdef f() nogil:`) then that stops Cython attempting most things involving Python objects to might be a useful guard to add. It isn't foolproof though.

> If setjmp is basically not supported [...] why is it included in the default Cython includes/libraries

The general approach is to create thin wrappers for as much of the C and C++ standard libraries as possible. There's plenty of stuff in there that can be used to break your program. Similarly, `setjmp/longjump` doesn't play well with most C++ classes (as you've identified) but is still available in the C++ standard library.

> I don't know what @final does

`@final` can be applied to a cdef class to promise that the class won't be inherited from. The advantage is that the compiler knows exactly what `cdef` function it's calling (i.e. that it will never call a derived class function) and that may allow it to optimize better - especially identifying where exceptions won't happen.
Reply all
Reply to author
Forward
0 new messages