Re: Memory leak on 64-bit machines ?

765 views
Skip to first unread message

Wheaton Little

unread,
Jul 16, 2012, 10:05:53 PM7/16/12
to beauti...@googlegroups.com
Did you figure this out? I found this question really interesting but
don't know enough to help out.

On Sun, Jul 15, 2012 at 3:33 AM, Romy Maxwell <romy.m...@gmail.com> wrote:
> Running 4.1.0 (pip installed) on Ubuntu 11 & 12 machines (3.4.2-x86_64),
> isolated BeautifulSoup (both 3 and 4) as the cause of endless memory leakage
> while using BeautifulSoup to clean HTML (remove script tags, add nofollow,
> etc). On one machine the processes continue to bloat until OOM killer steps
> in and/or the machine reboots. As a stopgap, I had to use supervisord with
> the memmon module to restart procs bloating over 100M RSS, which happened
> fairly rapidly.
>
> Removal of said code stops memory bloat entirely. I've tested
> lxml.html.soupparser, which uses BS, and it bloats as well. Using
> lxml.html.cleaner and its built-in clean_html function (which does NOT rely
> on BS) shows no memory leakage, however. Anyone else been seeing anything
> like this ? I've done lots of googling and have come up empty handed.
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/smSaNGGoA_QJ.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

Aaron DeVore

unread,
Jul 17, 2012, 4:46:29 PM7/17/12
to beauti...@googlegroups.com
This can happen when the Python interpreter has trouble deallocating
the many circular references in a Beautiful Soup tree. The function
tag.decompose() explicitly breaks the links in a tree, allowing the
interpreter to easily deallocate all of the tree. Example use:

dom = BeautifulSoup(untrusted_html)
# operations...
dom.decompose()

Romy Maxwell

unread,
Jul 23, 2012, 4:53:41 AM7/23/12
to beauti...@googlegroups.com
Yup, this was actually the first thing I tried, as suggested by an old thread somewhere, and the result was the same.

There's almost certainly a bug in the 64-bit version. Should I file a bug report ?


On Tuesday, July 17, 2012 1:46:29 PM UTC-7, Aaron DeVore wrote:
This can happen when the Python interpreter has trouble deallocating
the many circular references in a Beautiful Soup tree. The function
tag.decompose() explicitly breaks the links in a tree, allowing the
interpreter to easily deallocate all of the tree. Example use:

dom = BeautifulSoup(untrusted_html)
# operations...
dom.decompose()

Leonard Richardson

unread,
Jul 23, 2012, 10:16:10 AM7/23/12
to beauti...@googlegroups.com
I'm not very good at diagnosing memory leaks, but let's give this a
try. I've attached a script that should demonstrate the problem
whenever it exists. I ran it like this on my 32-bit Python
installation:

$ python memory_test.py lxml 10000
$ python memory_test.py html5lib 10000
$ python memory_test.py html.parser 10000

The results didn't show anything indicating a memory leak. Memory
usage on my system didn't increase noticeably while the script was
running, gc.garbage was always empty, and after gc.collect() the
object counts and reference counts always went down to the baseline
level. When I enabled gc.debug, I saw a lot of objects being
collected, as I'd expect, and nothing else.

Since I couldn't duplicate the problem, it didn't make any difference
whether or not I called decompose() on the BeautifulSoup object--that
object got GCed either way.

I then ran the script under Python 3.

$ python3 memory_test.py lxml 10000
$ python3 memory_test.py html.parser 10000

The results were the same: no noticeable memory usage, no gc.garbage,
and no leftover objects after the gc.collect() call.

I then modified the script slightly so it would work against Beautiful
Soup 3. Again the results were the same.

Some possible reasons why I don't see the problem:

* I'm using the 32-bit version of Python, not the 64-bit version.
* The leaky code path may only be triggered by specific markup.
* My debugging code may have missed something.

> There's almost certainly a bug in the 64-bit version. Should I file a bug
> report ?

Beautiful Soup is pure Python, so there's no 64-bit version per se.
I'm not sure if that's what you meant, or if you meant the 64-bit
version of Python itself.

I see three places where there might be a memory leak (and of course
there might be multiple memory leaks):

1. Python
2. The lxml C extension
3. Beautiful Soup

If #3 was true, we would expect my test script to bloat, even on the
32-bit Python. But I don't find that.

If #2 was true, we would expect your BS3 test script not to bloat,
since BS3 never uses lxml. But you do find that.

(The fact that lxml.html_cleaner doesn't show the problem doesn't
eliminate #2 from consideration. Beautiful Soup has found bugs in lxml
before, notably https://bugs.launchpad.net/lxml/+bug/963936 and
https://bugs.launchpad.net/lxml/+bug/984936)

So I think the most likely explanation is #1. But I anticipate an
uphill battle convincing the Python devs to accept a bug report based
on this data.

I would like you to answer these questions:

1. Can you duplicate the problem with your real-world markup on a
32-bit Python installation? (assuming you have one available)

2. What parser are you using when you get the problem with BS4? Is it
lxml? Does the problem go away if you use a different parser?

3. I'd like to see you run the attached script on your 64-bit
installation using html5lib, lxml, and html.parser.

3a. What are the results?

3b. Does the problem show up for some parsers but not others?

3c. Does the problem manifest itself, but the script diagnostic says
there's nothing wrong?

3d. Does it make a difference if you set decompose_after_parse=True?

3e. Do you see anything odd in the debug messages if you uncomment the line:
# gc.set_debug(gc.DEBUG_LEAK)

4. Let's suppose you run my script and the problem doesn't manifest.
Can you manifest the problem by using real markup instead of the
markup generated by my random_markup() function?

5. I'd also like to see you run the attached script on your 64-bit
installation under Python 3, using both lxml and html.parser. What are
the results?

Leonard
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/LmEAlJ9m7VcJ.
memory_test.py
Reply all
Reply to author
Forward
0 new messages