I'm not very good at diagnosing memory leaks, but let's give this a
try. I've attached a script that should demonstrate the problem
whenever it exists. I ran it like this on my 32-bit Python
$ python memory_test.py lxml 10000
$ python memory_test.py html5lib 10000
$ python memory_test.py html.parser 10000
The results didn't show anything indicating a memory leak. Memory
usage on my system didn't increase noticeably while the script was
running, gc.garbage was always empty, and after gc.collect() the
object counts and reference counts always went down to the baseline
level. When I enabled gc.debug, I saw a lot of objects being
collected, as I'd expect, and nothing else.
Since I couldn't duplicate the problem, it didn't make any difference
whether or not I called decompose() on the BeautifulSoup object--that
object got GCed either way.
I then ran the script under Python 3.
$ python3 memory_test.py lxml 10000
$ python3 memory_test.py html.parser 10000
The results were the same: no noticeable memory usage, no gc.garbage,
and no leftover objects after the gc.collect() call.
I then modified the script slightly so it would work against Beautiful
Soup 3. Again the results were the same.
Some possible reasons why I don't see the problem:
* I'm using the 32-bit version of Python, not the 64-bit version.
* The leaky code path may only be triggered by specific markup.
* My debugging code may have missed something.
> There's almost certainly a bug in the 64-bit version. Should I file a bug
> report ?
Beautiful Soup is pure Python, so there's no 64-bit version per se.
I'm not sure if that's what you meant, or if you meant the 64-bit
version of Python itself.
I see three places where there might be a memory leak (and of course
there might be multiple memory leaks):
2. The lxml C extension
3. Beautiful Soup
If #3 was true, we would expect my test script to bloat, even on the
32-bit Python. But I don't find that.
If #2 was true, we would expect your BS3 test script not to bloat,
since BS3 never uses lxml. But you do find that.
(The fact that lxml.html_cleaner doesn't show the problem doesn't
eliminate #2 from consideration. Beautiful Soup has found bugs in lxml
before, notably https://bugs.launchpad.net/lxml/+bug/963936 and
So I think the most likely explanation is #1. But I anticipate an
uphill battle convincing the Python devs to accept a bug report based
on this data.
I would like you to answer these questions:
1. Can you duplicate the problem with your real-world markup on a
32-bit Python installation? (assuming you have one available)
2. What parser are you using when you get the problem with BS4? Is it
lxml? Does the problem go away if you use a different parser?
3. I'd like to see you run the attached script on your 64-bit
installation using html5lib, lxml, and html.parser.
3a. What are the results?
3b. Does the problem show up for some parsers but not others?
3c. Does the problem manifest itself, but the script diagnostic says
there's nothing wrong?
3d. Does it make a difference if you set decompose_after_parse=True?
3e. Do you see anything odd in the debug messages if you uncomment the line:
4. Let's suppose you run my script and the problem doesn't manifest.
Can you manifest the problem by using real markup instead of the
markup generated by my random_markup() function?
5. I'd also like to see you run the attached script on your 64-bit
installation under Python 3, using both lxml and html.parser. What are