I think I have discovered a weird bug in the Python interpreter. Under
Python2.2, when I run this script:
class foo(object):
def __init__(self, n):
self.n = n
__slots__ = ['n']
o = None
for i in xrange(43550):
o = foo(o)
del o
o = None
print "I'm fine."
for i in xrange(43551):
o = foo(o)
del o
print "I'm dead."
I get "I'm fine" and then a segmentation fault. This only happens when
__slots__ is defined on a new-style class. Seriously, i'm not making those
numbers up -- reliably, 43551 objects will crash the interpreter; fewer will
not.
I don't know the code here, but I'm willing to bet Python is smashing the stack
because PyType_GenericAlloc() is recursive. I get a stacktrace from gdb that
looks like this:
#0 0x100a3fc4 in PyDict_New ()
#1 0x100244e8 in _PyType_Lookup ()
#2 0x100249d4 in _PyObject_SlotCompare ()
#3 0x1001a3d4 in PyType_GenericAlloc ()
#4 0x1001a4fc in PyType_GenericAlloc ()
#5 0x1001a5b8 in PyType_GenericAlloc ()
#6 0x1001a5b8 in PyType_GenericAlloc ()
#7 0x1001a5b8 in PyType_GenericAlloc ()
#8 0x1001a5b8 in PyType_GenericAlloc ()
#9 0x1001a5b8 in PyType_GenericAlloc ()
#10 0x1001a5b8 in PyType_GenericAlloc ()
#11 0x1001a5b8 in PyType_GenericAlloc ()
#12 0x1001a5b8 in PyType_GenericAlloc ()
#13 0x1001a5b8 in PyType_GenericAlloc ()
#14 0x1001a5b8 in PyType_GenericAlloc ()
... and so on, and so on, for longer than I had the patience to page through.
(I did get to about #400, though.)
Version Information:
% python2.2
Python 2.2.1 (#1, May 4 2002, 11:33:48)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
And since those arguments to xrange() above look like some weird magic to me,
this may also be useful to know:
% free
total used free shared buffers cached
Mem: 253968 146192 107776 0 8008 59184
-/+ buffers/cache: 79000 174968
Swap: 262136 3168 258968
% uname -a
Linux zelda 2.4.18 #1 SMP Fri Apr 26 19:18:50 CDT 2002 ppc unknown
Is there any in-development code I should try running?
--
| <`'> | Glyph Lefkowitz: Traveling Sorcerer |
| < _/ > | Lead Developer, the Twisted project |
| < ___/ > | http://www.twistedmatrix.com |
> I get "I'm fine" and then a segmentation fault. This only happens when
> __slots__ is defined on a new-style class. Seriously, i'm not making those
> numbers up -- reliably, 43551 objects will crash the interpreter; fewer will
> not.
Interestingly I get the construction working fine, but the 'del' crashes the
interpreter with (on Mac OS X):
-----
Date/Time: 2002-06-26 13:53:19 +0100
OS Version: 10.1.5 (Build 5S66)
Command: python
PID: 18472
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_INVALID_ADDRESS (0x0001) at 0xbff7ffe0
Thread 0 Crashed:
#0 0x00037bbc in lookdict_string
#1 0x0003804c in PyDict_GetItem
#2 0x00018994 in _PyType_Lookup
#3 0x00017348 in lookup_maybe
#4 0x00016f38 in call_finalizer
#5 0x00017054 in subtype_dealloc
#6 0x00017118 in subtype_dealloc
#7 0x00017118 in subtype_dealloc
[etc...]
-----
This seems to fail for me at 5430 objects. I couldn't get the allocation to
crash at all (or at least up to a million objects, after which I got bored).
Jonathan
> On 26/6/2002 12:03, in article
> mailman.1025089474...@python.org, "Glyph Lefkowitz"
> <gl...@twistedmatrix.com> wrote:
>
> > I get "I'm fine" and then a segmentation fault. This only happens when
> > __slots__ is defined on a new-style class. Seriously, i'm not making those
> > numbers up -- reliably, 43551 objects will crash the interpreter; fewer will
> > not.
>
> Interestingly I get the construction working fine, but the 'del' crashes the
> interpreter with (on Mac OS X):
Hmm. Sprinkling print statements about shows it dying on the 'del' (this is
how I encountered the bug originally, I was trying to time the [de]allocation
of large numbers of objects under 2.1 vs. 2.2), but the stacktrace is still the
same; is there some difference between the way allocation works on OS X
vs. Linux? (I assume if you can run OS X we're using the same hardware.)
> This seems to fail for me at 5430 objects. I couldn't get the allocation to
> crash at all (or at least up to a million objects, after which I got bored).
Yep, same here. Allocation is no problem.
> Hmm. Sprinkling print statements about shows it dying on the 'del' (this is
> how I encountered the bug originally, I was trying to time the [de]allocation
> of large numbers of objects under 2.1 vs. 2.2), but the stacktrace is still
> the
> same; is there some difference between the way allocation works on OS X
> vs. Linux? (I assume if you can run OS X we're using the same hardware.)
>
>> This seems to fail for me at 5430 objects. I couldn't get the allocation to
>> crash at all (or at least up to a million objects, after which I got bored).
>
> Yep, same here. Allocation is no problem.
I'm not sure about allocation differences, but if I up the stacksize limit
then I can run the example up to much higher numbers of objects. I think
it's basically a flaw in the deallocation mechanism. I don't understand
enough about the __slots__ stuff to know why it should only affect that.
No-one else seems to have noted it as a known bug, so I'd guess it's fair to
log it to Sourceforge as a new one (besides which there's no harm in
duplication). If you can't get it logged, let me know and I'll have a go
myself.
Jonathan
Interesting. Using Python 2.2.1 on Mandrake 8.1, I get a segfault
without printing "I'm fine". Go ahead and file a bug.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
Project Vote Smart: http://www.vote-smart.org/
> I get "I'm fine" and then a segmentation fault. This only happens
> when
> __slots__ is defined on a new-style class. Seriously, i'm not making
> those
> numbers up -- reliably, 43551 objects will crash the interpreter;
> fewer will
> not.
Hmm, the script runs fine on my Slackware 8.0 system, even if I crank
the number up to 100000.
--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/ \ See the son in your bad day / Smell the flowers in the valley
\__/ Chante Moore
Bosskey.net: Aliens vs. Predator 2 / http://www.bosskey.net/avp2/
A personal guide to Aliens vs. Predator 2.
Glyph> I think I have discovered a weird bug in the Python interpreter.
Glyph> Under Python2.2, when I run this script:
...
Glyph,
This looks like a new bug to me. A quick scan for __slots__ in the SF bug
tracker didn't turn up anything. If you are unable to get ahold of the SF
site, I'll file a bug report for you, just let me know.
--
Skip Montanaro
sk...@pobox.com
consulting: http://manatee.mojam.com/~skip/resume.html
> Interesting. Using Python 2.2.1 on Mandrake 8.1, I get a segfault
> without printing "I'm fine". Go ahead and file a bug.
I finally got Sourceforge to talk to me long enough to submit a bug report
for this:
#574207 Chained __slots__ dealloc segfault
Feel free to followup with any information I've missed.
Jonathan
When run in a thread, it dies on my redhat 7.3 machine. When not run in
a thread, it seems to succeed even for very large values.
Non-main threads are presumably started with a smaller stack..
Jeff
> When run in a thread, it dies on my redhat 7.3 machine. When not run
> in
> a thread, it seems to succeed even for very large values.
I can reproduce that crash in a thread on Slackware 8.0, as well.
>----Security_Multipart(Wed_Jun_26_10:23:45_2002_363)--
>Content-Type: Text/Plain; charset=us-ascii
>Content-Transfer-Encoding: 7bit
>
>From: Jonathan Hogg <jona...@onegoodidea.com>
>Subject: Re: Python 2.2 __slots__ Bug?
>Date: Wed, 26 Jun 2002 14:00:48 +0100
>
>> On 26/6/2002 12:03, in article
>> mailman.1025089474...@python.org, "Glyph Lefkowitz"
>> <gl...@twistedmatrix.com> wrote:
>>
>> > I get "I'm fine" and then a segmentation fault. This only happens when
>> > __slots__ is defined on a new-style class. Seriously, i'm not making those
>> > numbers up -- reliably, 43551 objects will crash the interpreter; fewer will
>> > not.
>>
>> Interestingly I get the construction working fine, but the 'del' crashes the
>> interpreter with (on Mac OS X):
>
>Hmm. Sprinkling print statements about shows it dying on the 'del' (this is
>how I encountered the bug originally, I was trying to time the [de]allocation
>of large numbers of objects under 2.1 vs. 2.2), but the stacktrace is still the
>same; is there some difference between the way allocation works on OS X
>vs. Linux? (I assume if you can run OS X we're using the same hardware.)
>
>> This seems to fail for me at 5430 objects. I couldn't get the allocation to
>> crash at all (or at least up to a million objects, after which I got bored).
>
>Yep, same here. Allocation is no problem.
>
Wondering if there's any 12*n byte chunks involved, fitting
into some 2**n (-overhead) allocation space? Just noticing
>>> divmod(2**19, 43551)
(12, 1676)
>>> divmod(2**16, 5430)
(12, 376)
Sometimes these things mean something, sometimes they don't ;-)
Maybe it will jog someone's thought. E.g., a last-item index/limit
problem?
Regards,
Bengt Richter
> Wondering if there's any 12*n byte chunks involved, fitting
> into some 2**n (-overhead) allocation space? Just noticing
>
>>>> divmod(2**19, 43551)
> (12, 1676)
>>>> divmod(2**16, 5430)
> (12, 376)
>
> Sometimes these things mean something, sometimes they don't ;-)
> Maybe it will jog someone's thought. E.g., a last-item index/limit
> problem?
It appears that GjR is on the case ;-) but I did some digging into the
source out of curiosity and it's definitely a stack overrun problem. If
anyone is interested, here's my analysis:
At the moment the deallocators for the builtin container types pull a neat
trick to stop the stack getting too deep. As they're recursively
deallocating objects they keep a count of how deep deallocation has gotten
so far. If this gets over a set limit then instead of deallocating the
object it jams it onto a list and returns immediately.
When the stack has unwound back to the top a new piece of magic is invoked
that checks to see if anything was put onto the "delete later" list. If
something was then it grabs the first thing off the list and starts
recursively deleting at that point again.
A hack is used to make this "delete later" list: The object's type is
checked and, since this trick is only used by a few builtin object types, a
numeric type code is stored in the reference count slot of the object (which
is no longer interesting since it must be 0). The type object pointer slot
is then re-used as a pointer to the next item in the "delete later" list.
When the item is pulled off the list and actually deallocated the reverse
trick is pulled to correct the type object pointer and reset the reference
count to 0.
This is all fine except that this trick is not pulled for __slots__
variables as these are not stored in a dict that can be added to the "delete
later" list (the point of slots variables). In the case of Glyph's example,
because the chain is made up entirely of slots pointers, there is never a
dict, tuple, or list - the only* types that can take part in this trick -
that can be added to the "delete later" list allowing the stack to be
unwound, so the deallocation simply recurses until it runs out of stack.
I can't see an easy way to fix this without changing the way that the
"delete later" list is constructed.
Chewy problem.
Jonathan
* OK, I lied slightly there. Actually frames and traceback objects take part
in this crime as well, but they're not interesting here.
> It appears that GjR is on the case ;-)
...and his alter-ego GvR.
oops.
Jonathan
Na, I assigned the bug to GvR just to irritate him <0.7 wink>: he wrote the
__slot__ deallocation code that's blowing up, yet has also called the funky
trashcan mechanism (which you described well later) "evil" more than once.
The irritating part is that the evil trashcan mechanism is the only hope of
avoiding stack overflow in absurd cases like this.
> ...
> I can't see an easy way to fix this without changing the way that the
> "delete later" list is constructed.
Sometimes we just agree to pretend that we never saw the bug report <wink>.
more-than-one-way-to-close-a-bug-ly y'rs - tim
def doStuff(x):
o = None
for i in xrange(x):
o = foo(o)
# x = 43550
x = 3
doStuff(x)
print "I'm fine."
doStuff(x+1)
print "I'm dead."
>> ...
>> I can't see an easy way to fix this without changing the way that the
>> "delete later" list is constructed.
>
> Sometimes we just agree to pretend that we never saw the bug report <wink>.
Actually I think I just figured out an easy way to fix the problem.
I was thinking that if there had been a liberal sprinkling of dicts in the
chain then it wouldn't overflow. But there's no way of doing that from the
Python side of things.
Then it occurred to me that it could be done from the C side of things:
Surround the slots deallocation in 'subtype_dealloc' with Py_TRASHCAN type
macros with the modification that if the recursion limit is hit, we create a
new tuple and set its items to be the contents of the slots, then DECREF it,
set the slots to NULL, and continue. The tuple deallocating code will see
that the recursion limit has been hit and add itself to the delete later
list.
This has the effect of inserting a container into the chain every few items
which can take part in the stack unwinding and restarting magic.
Simple and effective, no?
If I want to have a go at this, should I just upload a patch to Sourceforge,
or should I email it to someone?
Jonathan