Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Memory leak on 64-bit machines ?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Romy Maxwell  
View profile  
 More options Jul 14 2012, 3:33 pm
From: Romy Maxwell <romy.maxw...@gmail.com>
Date: Sat, 14 Jul 2012 12:33:00 -0700 (PDT)
Subject: Memory leak on 64-bit machines ?

Running 4.1.0 (pip installed) on Ubuntu 11 & 12 machines (3.4.2-x86_64),
isolated BeautifulSoup (both 3 and 4) as the cause of endless memory
leakage while using BeautifulSoup to clean HTML<http://pastebin.com/e34n7jtz>(remove script tags, add nofollow, etc). On one machine the processes
continue to bloat until OOM killer steps in and/or the machine reboots. As
a stopgap, I had to use supervisord with the memmon module to restart procs
bloating over 100M RSS, which happened fairly rapidly.

Removal of said code stops memory bloat entirely. I've tested
lxml.html.soupparser, which uses BS, and it bloats as well. Using
lxml.html.cleaner and its built-in clean_html function (which does NOT rely
on BS) shows no memory leakage, however. Anyone else been seeing anything
like this ? I've done lots of googling and have come up empty handed.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Wheaton Little  
View profile  
 More options Jul 16 2012, 10:05 pm
From: Wheaton Little <wheatont...@gmail.com>
Date: Tue, 17 Jul 2012 10:05:53 +0800
Local: Mon, Jul 16 2012 10:05 pm
Subject: Re: Memory leak on 64-bit machines ?
Did you figure this out? I found this question really interesting but
don't know enough to help out.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron DeVore  
View profile  
 More options Jul 17 2012, 4:46 pm
From: Aaron DeVore <aaron.dev...@gmail.com>
Date: Tue, 17 Jul 2012 13:46:29 -0700
Local: Tues, Jul 17 2012 4:46 pm
Subject: Re: Memory leak on 64-bit machines ?
This can happen when the Python interpreter has trouble deallocating
the many circular references in a Beautiful Soup tree. The function
tag.decompose() explicitly breaks the links in a tree, allowing the
interpreter to easily deallocate all of the tree. Example use:

dom = BeautifulSoup(untrusted_html)
# operations...
dom.decompose()


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Romy Maxwell  
View profile  
 More options Jul 23 2012, 4:53 am
From: Romy Maxwell <romy.maxw...@gmail.com>
Date: Mon, 23 Jul 2012 01:53:41 -0700 (PDT)
Local: Mon, Jul 23 2012 4:53 am
Subject: Re: Memory leak on 64-bit machines ?

Yup, this was actually the first thing I tried, as suggested by an old
thread somewhere, and the result was the same.

There's almost certainly a bug in the 64-bit version. Should I file a bug
report ?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leonard Richardson  
View profile  
 More options Jul 23 2012, 10:16 am
From: Leonard Richardson <leona...@segfault.org>
Date: Mon, 23 Jul 2012 10:16:10 -0400
Local: Mon, Jul 23 2012 10:16 am
Subject: Re: Memory leak on 64-bit machines ?

I'm not very good at diagnosing memory leaks, but let's give this a
try. I've attached a script that should demonstrate the problem
whenever it exists. I ran it like this on my 32-bit Python
installation:

$ python memory_test.py lxml 10000
$ python memory_test.py html5lib 10000
$ python memory_test.py html.parser 10000

The results didn't show anything indicating a memory leak. Memory
usage on my system didn't increase noticeably while the script was
running, gc.garbage was always empty, and after gc.collect() the
object counts and reference counts always went down to the baseline
level. When I enabled gc.debug, I saw a lot of objects being
collected, as I'd expect, and nothing else.

Since I couldn't duplicate the problem, it didn't make any difference
whether or not I called decompose() on the BeautifulSoup object--that
object got GCed either way.

I then ran the script under Python 3.

$ python3 memory_test.py lxml 10000
$ python3 memory_test.py html.parser 10000

The results were the same: no noticeable memory usage, no gc.garbage,
and no leftover objects after the gc.collect() call.

I then modified the script slightly so it would work against Beautiful
Soup 3. Again the results were the same.

Some possible reasons why I don't see the problem:

* I'm using the 32-bit version of Python, not the 64-bit version.
* The leaky code path may only be triggered by specific markup.
* My debugging code may have missed something.

> There's almost certainly a bug in the 64-bit version. Should I file a bug
> report ?

Beautiful Soup is pure Python, so there's no 64-bit version per se.
I'm not sure if that's what you meant, or if you meant the 64-bit
version of Python itself.

I see three places where there might be a memory leak (and of course
there might be multiple memory leaks):

1. Python
2. The lxml C extension
3. Beautiful Soup

If #3 was true, we would expect my test script to bloat, even on the
32-bit Python. But I don't find that.

If #2 was true, we would expect your BS3 test script not to bloat,
since BS3 never uses lxml. But you do find that.

(The fact that lxml.html_cleaner doesn't show the problem doesn't
eliminate #2 from consideration. Beautiful Soup has found bugs in lxml
before, notably https://bugs.launchpad.net/lxml/+bug/963936 and
https://bugs.launchpad.net/lxml/+bug/984936)

So I think the most likely explanation is #1. But I anticipate an
uphill battle convincing the Python devs to accept a bug report based
on this data.

I would like you to answer these questions:

1. Can you duplicate the problem with your real-world markup on a
32-bit Python installation? (assuming you have one available)

2. What parser are you using when you get the problem with BS4? Is it
lxml? Does the problem go away if you use a different parser?

3. I'd like to see you run the attached script on your 64-bit
installation using html5lib, lxml, and html.parser.

3a. What are the results?

3b. Does the problem show up for some parsers but not others?

3c. Does the problem manifest itself, but the script diagnostic says
there's nothing wrong?

3d. Does it make a difference if you set decompose_after_parse=True?

3e. Do you see anything odd in the debug messages if you uncomment the line:
 # gc.set_debug(gc.DEBUG_LEAK)

4. Let's suppose you run my script and the problem doesn't manifest.
Can you manifest the problem by using real markup instead of the
markup generated by my random_markup() function?

5. I'd also like to see you run the attached script on your 64-bit
installation under Python 3, using both lxml and html.parser. What are
the results?

Leonard

  memory_test.py
2K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »