Memory leak?

monk.e.boy

unread,

Dec 13, 2007, 11:43:48 AM12/13/07

to beautifulsoup

Hi guys,

Thanks for BeautifulSoup, it has saved me a ton of time, and the
code is teaching me a lot :-)

We use BS to download a fair bit of data, and I noticed that it
leaks quite a bit of memory the way I use it. I wrote a simple test:

#
# test for memory leaks:
#
import gc
from dump_garbage import *
from BeautifulSoup import BeautifulSoup

gc.enable()
gc.set_debug(gc.DEBUG_LEAK)

#
for i in range(2):
print 'test %d' % i
b = BeautifulSoup()
print b
del b

dump_garbage()

here is the dump_garbage file (I found this on the internet
somewhere):

import gc

def dump_garbage():
"""
show us what's the garbage about
"""

# force collection
print "\nGARBAGE:"
gc.collect()

print "\nGARBAGE OBJECTS:"
for x in gc.garbage:
s = str(x)
if len(s) > 80: s = s[:80]
print type(x),"\n ", s

if __name__=="__main__":
import gc
gc.enable()
gc.set_debug(gc.DEBUG_LEAK)

# make a leak
l = []
l.append(l)
del l

# show the dirt ;-)
dump_garbage()

Does anyone know if it really is BS that is leaking memory. Could any
help me please :-)

monk.e.boy

http://teethgrinder.co.uk/open-flash-chart/

monk.e.boy

unread,

Dec 19, 2007, 9:21:22 AM12/19/07

to beautifulsoup

I have done a nice simple test file (see end of post) when I run this
against my test HTML page using Windows XP I can watch the amount of
memory go up by about 2,000 K for each page loaded. So I load this in
IDLE:

13,584K
run the python
30,548K
run the python
45,608K
run the python
60,515K

So there is my proof. I can't see that it is anything apart from BS
that is leaking. How can I find the leaks and more importantly, if I
do find the leaks will anyone apply the patches?

Here is the test code:

import sys

if __name__ == "__main__":
# hack to get beautiful soup imported
sys.path.append("C:\\py")

from BeautifulSoup import BeautifulSoup
from BeautifulSoup import NavigableString
import urllib2

url = 'http://dev/mem-1.html'

def soup(html):
soup = BeautifulSoup(html)

for css in soup('link'):
found = False
for c in css.attrs:
if c[0]=='rel' and c[1]=='stylesheet':
found = True
if found:
print 'css: '+str( css )

def go(url):
request = urllib2.Request(url)
request.add_header( 'Accept', 'text/html' )
opener = urllib2.build_opener()
f = opener.open(request)
soup(f.read())
del f

for x in range(15):
print x
go(url)

Carl Zmola

unread,

Dec 19, 2007, 9:27:55 AM12/19/07

to beauti...@googlegroups.com

What version of Python are you using?
I didn't check your code for errors, but older versions of Python (even
2.4) did not release memory as fast as they might have.
You do need to make sure that Garbage Collection is run to actually
prove a memory leak exists.

Python might just be holding the memory, waiting for GC to be called.

Carl

monk.e.boy

unread,

Dec 19, 2007, 10:05:42 AM12/19/07

to beautifulsoup

OK, if you modify my previous code, so it now has the garbage
collector:

url='http://uk.msn.com/'
import gc
gc.enable()
gc.set_debug(gc.DEBUG_LEAK)

for x in range(1):
print x
go(url)

gc.collect()

for x in gc.garbage:
s = str(x)
if len(s) > 80: s = s[:80]
print type(x),"\n ", s

I used msn as the url because this shows a lot of leaks on my machine.
I am using Python2.5 on WinXP. I know something leaks because my 2gig
Linux server running my BS scraper chews up all the ram :-( after a
few thousand pages.

I thought Python2.5 was supposed to fix circular references? Or does
it only work to a certain depth/complexity?

Any help would be appreciated :-) I am now trying to code my own
scraper which isn't as much fun as using BS is ;-)

monk.e.boy

Leonard Richardson

unread,

Dec 19, 2007, 1:09:31 PM12/19/07

to beauti...@googlegroups.com

> I thought Python2.5 was supposed to fix circular references? Or does
> it only work to a certain depth/complexity?

I know very little about garbage collection, but Beautiful Soup
objects are very densely interconnected, exactly the sort of object
that a garbage collector would have trouble with. I've written a
method Tag.decompose which recursively disassembles the object graph:

def decompose(self):
"""Recursively disassembles this object."""
contents = [i for i in self.contents]
for i in contents:
if isinstance(i, Tag):
i.decompose()
else:
i.extract()
self.extract()

Try it out on your soup objects before letting them go out of scope,
and let me know if it helps your memory usage.

Leonard

Kent Johnson

unread,

Dec 19, 2007, 2:31:43 PM12/19/07

to beauti...@googlegroups.com

monk.e.boy wrote:
> Hi guys,
>
> Thanks for BeautifulSoup, it has saved me a ton of time, and the
> code is teaching me a lot :-)
>
> We use BS to download a fair bit of data, and I noticed that it
> leaks quite a bit of memory the way I use it. I wrote a simple test:
>
>
> #
> # test for memory leaks:
> #
> import gc
> from dump_garbage import *
> from BeautifulSoup import BeautifulSoup
>
> gc.enable()
> gc.set_debug(gc.DEBUG_LEAK)

I'm not sure about this, but I think DEBUG_LEAK *prevents* unreachable
objects from being gc'ed until the garbage collector actually runs. What
do you see if you don't set DEBUG_LEAK?

Here is some more info:
http://groups.google.com/group/comp.lang.python/msg/e7b1a081c65a79f3

Kent

monk.e.boy

unread,

Dec 20, 2007, 5:53:53 AM12/20/07

to beautifulsoup

OK, I ran my tests **without** your change enabled:

Open IDLE: 13,732K
Run 1: 30,176K
Run 2: 45,132K
Run 3: 60,040K
Run 4: 74,928K

Now I enable your changes:

Open IDLE: 13,732K
Run 1: 16,824K
Run 2: 17,160K
Run 3: 17,232K
Run 4: 17,284K

So that looks pretty convincing to me ;-)

I tried to add:

def __del__(self):
self.decompose()

but this didn't work, I guess because __del__ is only called when the
object reference count is zero... hm...

Thank you Leonard!! You really helped me out :-) any idea when/if
this will be in a release?

Is there any news on the sgmllib.py unicode bug? I am rolling my own
version at the moment, but I'd like to use an official release if
possible.

Thanks again :-)

monk.e.boy

--------------------------------

For the future generations and google searchers:

To run the new test you need to add:

#
# patch from: http://groups.google.com/group/beautifulsoup/browse_thread/thread/36a734dc6f8d2ce2
#

def decompose(self):
"""Recursively disassembles this object."""
contents = [i for i in self.contents]
for i in contents:
if isinstance(i, Tag):
i.decompose()
else:
i.extract()
self.extract()

to line 418 of BeautifulSoup.py.

Save the following test to a file (note the new soup.decompose()
line), open your memory viewer (e.g. Task Manager) then run the python
test:

import sys
import gc

from BeautifulSoup import BeautifulSoup
from BeautifulSoup import NavigableString
import urllib2

url = 'http://dev/mem-1.html'

def soup(html):
soup = BeautifulSoup(html)

for css in soup('link'):
found = False
for c in css.attrs:
if c[0]=='rel' and c[1]=='stylesheet':
found = True
if found:
print 'css: '+str( css )

soup.decompose()

def go(url):
request = urllib2.Request(url)
request.add_header( 'Accept', 'text/html' )
opener = urllib2.build_opener()
f = opener.open(request)
soup(f.read())
del f

#gc.enable()
#gc.set_debug(gc.DEBUG_LEAK)

for x in range(15):
print x
go(url)

#gc.collect()
#print len(gc.garbage)

#for x in gc.garbage:
# s = str(x)
# if len(s) > 80: s = s[:80]
# print type(x),"\n ", s

On Dec 19, 6:09 pm, "Leonard Richardson"

Leonard Richardson

unread,

Dec 20, 2007, 1:41:40 PM12/20/07

to beauti...@googlegroups.com

> Thank you Leonard!! You really helped me out :-) any idea when/if
> this will be in a release?

It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
in the next release.

> Is there any news on the sgmllib.py unicode bug? I am rolling my own
> version at the moment, but I'd like to use an official release if
> possible.

I've never been aple to reproduce the bug I think you're talking
about. Can you send me your version and some markup that makes stock
BS fail?

Leonard

John Glazebrook

unread,

Dec 21, 2007, 6:47:59 AM12/21/07

to beauti...@googlegroups.com

>> > Thank you Leonard!! You really helped me out :-) any idea when/if
>> > this will be in a release?

>> It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
>> in the next release.

Brilliant! It makes such a huge difference to my project :-)

>> > Is there any news on the sgmllib.py unicode bug? I am rolling my own
>> > version at the moment, but I'd like to use an official release if
>> > possible.

>> I've never been aple to reproduce the bug I think you're talking
>> about. Can you send me your version and some markup that makes stock
>> BS fail?

Cor, it was ages ago that I came across it. It was when i was downloading a
web site in Cyrillic. It was an eastern European site so had a lot of weird character
sets, none of the pages were UTF8, so BS was translating a lot of odd code pages
to unicode.

My version of sgmllib.py has this:

def convert_charref(self, name):
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
#if not 0 <= n <= 255:
if not 0 <= n <= 127 : # ASCII ends at 127, not 255
return
return self.convert_codepoint(n)

So I guess you (because you are smarter than me :-) ) could create a page that has
some characters that will raise an error?

I'd guess doing:

may do it? I'm not sure....

Oh, here you go, I found a simpler explanation :-) ::
http://mail.python.org/pipermail/python-bugs-list/2007-February/037082.html

Hope that helps, by the way, do you have a paypal donate thing? I may be able to persuade my boss to chuck some money your way.

Thanks again

monk.e.boy

>> Leonard

Kind Regards,

John Glazebrook

_________________________________________
Neutralize (*\*)
Search Engine Marketing Services
T: 08700 630707
F: 08700 630708
E: jo...@neutralize.com
U: http://www.neutralize.com

International T: 00 44 1209 722340
International F: 00 44 1209 717263
_________________________________________
Members of the Search Marketing Association UK
http://www.sma-uk.org

The information transmitted is intended only for the person or entity to which it is addressed. This email is subject to the Terms and Conditions available at:
http://www.neutralize.com/emailterms.txt
_________________________________________
Head Office: 3 The Setons, Tolvaddon Energy Park, Cornwall, TR14 0HX
Registered Address: Nuera Limited trading as Neutralize, 70 Conduit Street,London W1S 2GF
Company Registration No. 3849708 - VAT Registration No. 743 9641 09
Neutralize & (*\*) are a registered TradeMarks of Nuera Limited.

John Glazebrook

unread,

Dec 21, 2007, 6:47:59 AM12/21/07

to beauti...@googlegroups.com

>> > Thank you Leonard!! You really helped me out :-) any idea when/if
>> > this will be in a release?

>> It's in SVN HEAD right now (I renamed it to 'dismember'), so it'll be
>> in the next release.

Brilliant! It makes such a huge difference to my project :-)

>> > Is there any news on the sgmllib.py unicode bug? I am rolling my own

>> > version at the moment, but I'd like to use an official release if
>> > possible.

>> I've never been aple to reproduce the bug I think you're talking
>> about. Can you send me your version and some markup that makes stock
>> BS fail?

Cor, it was ages ago that I came across it. It was when i was downloading a

Leonard Richardson

unread,

Dec 21, 2007, 7:33:43 PM12/21/07

to beauti...@googlegroups.com

> def convert_charref(self, name):
> """Convert character reference, may be overridden."""
> try:
> n = int(name)
> except ValueError:
> return
> #if not 0 <= n <= 255:
> if not 0 <= n <= 127 : # ASCII ends at 127, not 255
> return
> return self.convert_codepoint(n)

This was fixed in 3.0.5. If you look at
BeautifulStoneSoup.convert_charref() you'll see it looks almost
exactly like that, down to the comment.

However from another user I did find a page that BS can't turn into
Unicode: http://domolink.net/. It claims to be UTF-8 but then has
random-looking binary data in the page. I think pages like this are
behind a lot of recent complaints. I haven't been able to resolve this
satisfactorily, and html5lib parses that page okay, so I may write a
BS interface for html5lib or even switch to using html5lib instead of
sgmllib.

> Hope that helps, by the way, do you have a paypal donate thing? I may be able to persuade my boss to chuck some money your way.

I've put up a donate button on the main BS site.

Leonard

John Glazebrook

unread,

Jan 2, 2008, 7:24:25 AM1/2/08

to beauti...@googlegroups.com

Sorry people :-( my pointy haired boss 'helpfully' turned on my out of office reply.

BTW, chaining queries would be very cool. Have you guys played with jQuery? It does this sort of chaining very well...

marty

unread,

Feb 22, 2008, 12:54:27 PM2/22/08

to beautifulsoup

Hello all,

I'm a bit confused about this dismember function.. First of all, the
3.0.5. version on your site doesn't have neither "decompose" nor
"dismember", so I guess you didn't put it into the latest version..

What confuses me more is that you explicitly said you named it
"dismember", but according to the revision r25 in SVN, the method is
called "decomopse" :x

Am I missing something or what is going on here :)

Cheers,
Martin

Reply all

Reply to author

Forward