Python might just be holding the memory, waiting for GC to be called.
Carl
I know very little about garbage collection, but Beautiful Soup
objects are very densely interconnected, exactly the sort of object
that a garbage collector would have trouble with. I've written a
method Tag.decompose which recursively disassembles the object graph:
def decompose(self):
"""Recursively disassembles this object."""
contents = [i for i in self.contents]
for i in contents:
if isinstance(i, Tag):
i.decompose()
else:
i.extract()
self.extract()
Try it out on your soup objects before letting them go out of scope,
and let me know if it helps your memory usage.
Leonard
I'm not sure about this, but I think DEBUG_LEAK *prevents* unreachable
objects from being gc'ed until the garbage collector actually runs. What
do you see if you don't set DEBUG_LEAK?
Here is some more info:
http://groups.google.com/group/comp.lang.python/msg/e7b1a081c65a79f3
Kent
It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
in the next release.
> Is there any news on the sgmllib.py unicode bug? I am rolling my own
> version at the moment, but I'd like to use an official release if
> possible.
I've never been aple to reproduce the bug I think you're talking
about. Can you send me your version and some markup that makes stock
BS fail?
Leonard
>> It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
>> in the next release.
Brilliant! It makes such a huge difference to my project :-)
>> > Is there any news on the sgmllib.py unicode bug? I am rolling my own
>> > version at the moment, but I'd like to use an official release if
>> > possible.
>> I've never been aple to reproduce the bug I think you're talking
>> about. Can you send me your version and some markup that makes stock
>> BS fail?
Cor, it was ages ago that I came across it. It was when i was downloading a
web site in Cyrillic. It was an eastern European site so had a lot of weird character
sets, none of the pages were UTF8, so BS was translating a lot of odd code pages
to unicode.
My version of sgmllib.py has this:
def convert_charref(self, name):
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
#if not 0 <= n <= 255:
if not 0 <= n <= 127 : # ASCII ends at 127, not 255
return
return self.convert_codepoint(n)
So I guess you (because you are smarter than me :-) ) could create a page that has
some characters that will raise an error?
I'd guess doing:
<html>
<body>

€
ÿ
Ā
</body>
</html>
may do it? I'm not sure....
Oh, here you go, I found a simpler explanation :-) ::
http://mail.python.org/pipermail/python-bugs-list/2007-February/037082.html
Hope that helps, by the way, do you have a paypal donate thing? I may be able to persuade my boss to chuck some money your way.
Thanks again
monk.e.boy
>> Leonard
Kind Regards,
John Glazebrook
_________________________________________
Neutralize (*\*)
Search Engine Marketing Services
T: 08700 630707
F: 08700 630708
E: jo...@neutralize.com
U: http://www.neutralize.com
International T: 00 44 1209 722340
International F: 00 44 1209 717263
_________________________________________
Members of the Search Marketing Association UK
http://www.sma-uk.org
The information transmitted is intended only for the person or entity to which it is addressed. This email is subject to the Terms and Conditions available at:
http://www.neutralize.com/emailterms.txt
_________________________________________
Head Office: 3 The Setons, Tolvaddon Energy Park, Cornwall, TR14 0HX
Registered Address: Nuera Limited trading as Neutralize, 70 Conduit Street,London W1S 2GF
Company Registration No. 3849708 - VAT Registration No. 743 9641 09
Neutralize & (*\*) are a registered TradeMarks of Nuera Limited.
>> It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
>> in the next release.
Brilliant! It makes such a huge difference to my project :-)
>> > Is there any news on the sgmllib.py unicode bug? I am rolling my own
>> > version at the moment, but I'd like to use an official release if
>> > possible.
>> I've never been aple to reproduce the bug I think you're talking
>> about. Can you send me your version and some markup that makes stock
>> BS fail?
Cor, it was ages ago that I came across it. It was when i was downloading a
This was fixed in 3.0.5. If you look at
BeautifulStoneSoup.convert_charref() you'll see it looks almost
exactly like that, down to the comment.
However from another user I did find a page that BS can't turn into
Unicode: http://domolink.net/. It claims to be UTF-8 but then has
random-looking binary data in the page. I think pages like this are
behind a lot of recent complaints. I haven't been able to resolve this
satisfactorily, and html5lib parses that page okay, so I may write a
BS interface for html5lib or even switch to using html5lib instead of
sgmllib.
> Hope that helps, by the way, do you have a paypal donate thing? I may be able to persuade my boss to chuck some money your way.
I've put up a donate button on the main BS site.
Leonard
BTW, chaining queries would be very cool. Have you guys played with jQuery? It does this sort of chaining very well...