Spontaneous object type change

23 views
Skip to first unread message

David Roe

unread,
Feb 21, 2022, 4:20:49 AM2/21/22
to cython...@googlegroups.com, Julian Rüth
I am working on a new feature in Sage, which has a large Cython codebase.  We started running into segmentation faults and are struggling to find a method for debugging the problem.  Perhaps the most intriguing behavior that we've seen so far is a pair of print statements on consecutive lines that yield different results:
print(type(S))
print(type(S))
outputs
<class 'AttributeError'>
<class 'builtin_function_or_method'>
when further up the call stack S was a completely different type (a p-adic element).

We don't know of a simple way to reproduce the issue (the best we can offer is a branch of Sage on gitlab which you could build and run ./sage crash.sage).  Our best guess is that the Python stack is getting corrupted somehow, but we don't know how to find the source of this corruption.  Any suggestions of tools to try or debugging methods to attempt would be welcome.
David


Stefan Behnel

unread,
Feb 22, 2022, 11:07:12 AM2/22/22
to cython...@googlegroups.com
David Roe schrieb am 21.02.22 um 09:52:
> I am working on a new feature in Sage <http://www.sagemath.org>, which has
> a large Cython codebase. We started running into segmentation faults and
> are struggling to find a method for debugging the problem. Perhaps the
> most intriguing behavior that we've seen so far is a pair of print
> statements on consecutive lines that yield different results:
>
> print(type(S))
> print(type(S))
>
> outputs
>
> <class 'AttributeError'><class 'builtin_function_or_method'>
>
> when further up the call stack S was a completely different type (a p-adic
> element).

What this suggests to me is that the object that S refers to might already
have been deleted before (possibly due to a reference counting issue) and
gets (partly) overwritten in memory, so that the type reference changes
when print() allocates some memory/objects on its own.

CPython allocates objects on the heap, which means that memory can end up
being reused without intermittently being returned to the operating system
(which would make an access a segfault).


> We don't know of a simple way to reproduce the issue (the best we can offer
> is a branch of Sage on gitlab
> <https://gitlab.com/sagemath/dev/sage/-/tree/crash> which you could build
> and run ./sage crash.sage). Our best guess is that the Python stack is
> getting corrupted somehow, but we don't know how to find the source of this
> corruption. Any suggestions of tools to try or debugging methods to
> attempt would be welcome.
> David

Is this still using Python 2? Recent Python 3 release series come with a
couple of memory debugging features.

Stefan

Julian Rüth

unread,
Feb 23, 2022, 1:08:29 AM2/23/22
to cython-users
Hi Stefan,

On Tuesday, February 22, 2022 at 11:07:12 AM UTC-5 Stefan Behnel wrote:
David Roe schrieb am 21.02.22 um 09:52:
> print(type(S))
> print(type(S))
>
> outputs
>
> <class 'AttributeError'><class 'builtin_function_or_method'>

What this suggests to me is that the object that S refers to might already
have been deleted before (possibly due to a reference counting issue) and
gets (partly) overwritten in memory, so that the type reference changes
when print() allocates some memory/objects on its own.

This seems indeed to be what is happening. We identified a method [1] that is causing the trouble. Strangely, if we just change this method from cpdef to def without any changes elsewhere, the problem disappears.

Unfortunately, we have not been able yet to create a simple reproducer for this issue. Looking at the generated C code, we find that the cpdef method has one additional DECREF in its exception handling block [2] that the def method does not have [3]. And printing the sys.getrefcount of the objects involved appears to confirm that the refcount is not updated correctly when coming back from a cpdef call [4] in some cases.

Is this still using Python 2? Recent Python 3 release series come with a
couple of memory debugging features.

We are running Python 3.9.

Thanks a lot for your suggestion. Since we can't seem to create a reproducer of the actual issue that does not involve half of the SageMath project, we are now trying to understand the generated C code better. It's strange that changing a cpdef to a def makes a difference here. Maybe you have some further thoughts what could be going on here?

julian

Stefan Behnel

unread,
Feb 23, 2022, 1:48:42 AM2/23/22
to cython...@googlegroups.com
Julian Rüth schrieb am 23.02.22 um 04:46:
> We identified a method [1] that
> is causing the trouble. Strangely, if we just change this method from cpdef
> to def without any changes elsewhere, the problem disappears.

What this changes is the way the method is *called* in other places. It now
becomes a C method, into which callers pass their arguments as straight C
values instead of a Python argument tuple (and keywords). This also means
that the Python object arguments are not kept alive by an argument tuple,
and that you need to make sure that an argument does not get deallocated
while the method is running and working with it. This can happen, for
example, when you pass an argument directly from an attribute of a cdef
class. Although I think there are guards for this case now that would
internally keep the reference alive. Still, there may be cases where this
can't be assured internally, so worth finding out.


> And printing the sys.getrefcount of the objects
> involved appears to confirm that the refcount is not updated correctly when
> coming back from a cpdef call [4] in some cases.

I would a) check if the method changes any state (i.e. replaces any
references) in the outside world, and b) look at the callers to see if they
depend on that state. If so, make sure that you keep a live reference
manually. Specifically, find out where the arguments "x" and "y" are coming
from.


> Unfortunately, we have not been able yet to create a simple reproducer for
> this issue. Looking at the generated C code, we find that the cpdef method
> has one additional DECREF in its exception handling block [2] that the def
> method does not have [3].

That's unrelated. It simply uses more temporary variables internally
because it needs to additionally handle the case that the method gets
called directly as a C method, but then detects that it's actually in a
subclass that has overwritten the method as a Python method, and that it
needs to call that instead. So it generates code to calls the Python method.

Is that something you do here, BTW? Overwrite the cpdef method with a def
method in a (Python) subclass?

Stefan
Reply all
Reply to author
Forward
0 new messages