Message from discussion
Memory leak on 64-bit machines ?
Date: Mon, 23 Jul 2012 01:53:41 -0700 (PDT)
From: Romy Maxwell <romy.maxw...@gmail.com>
To: beautifulsoup@googlegroups.com
Message-Id: <381e64c5-4421-42c3-acbe-23e066aedef3@googlegroups.com>
In-Reply-To: <CAL4sBwgTLwUg0OO0fC5PSywxZnQ3xV-FPsq4dQ9txJbOkDq5Uw@mail.gmail.com>
References: <e1420c46-4b20-4d7e-960b-a8dd7a36950c@googlegroups.com>
<CAL4sBwgTLwUg0OO0fC5PSywxZnQ3xV-FPsq4dQ9txJbOkDq5Uw@mail.gmail.com>
Subject: Re: Memory leak on 64-bit machines ?
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_481_15184161.1343033621909"
------=_Part_481_15184161.1343033621909
Content-Type: multipart/alternative;
boundary="----=_Part_482_18744463.1343033621909"
------=_Part_482_18744463.1343033621909
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Yup, this was actually the first thing I tried, as suggested by an old
thread somewhere, and the result was the same.
There's almost certainly a bug in the 64-bit version. Should I file a bug
report ?
On Tuesday, July 17, 2012 1:46:29 PM UTC-7, Aaron DeVore wrote:
>
> This can happen when the Python interpreter has trouble deallocating
> the many circular references in a Beautiful Soup tree. The function
> tag.decompose() explicitly breaks the links in a tree, allowing the
> interpreter to easily deallocate all of the tree. Example use:
>
> dom = BeautifulSoup(untrusted_html)
> # operations...
> dom.decompose()
>
> On Sat, Jul 14, 2012 at 12:33 PM, Romy Maxwell wrote:
> > Running 4.1.0 (pip installed) on Ubuntu 11 & 12 machines (3.4.2-x86_64),
> > isolated BeautifulSoup (both 3 and 4) as the cause of endless memory
> leakage
> > while using BeautifulSoup to clean HTML (remove script tags, add
> nofollow,
> > etc). On one machine the processes continue to bloat until OOM killer
> steps
> > in and/or the machine reboots. As a stopgap, I had to use supervisord
> with
> > the memmon module to restart procs bloating over 100M RSS, which
> happened
> > fairly rapidly.
> >
> > Removal of said code stops memory bloat entirely. I've tested
> > lxml.html.soupparser, which uses BS, and it bloats as well. Using
> > lxml.html.cleaner and its built-in clean_html function (which does NOT
> rely
> > on BS) shows no memory leakage, however. Anyone else been seeing
> anything
> > like this ? I've done lots of googling and have come up empty handed.
>
------=_Part_482_18744463.1343033621909
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit
Yup, this was actually the first thing I tried, as suggested by an old thread somewhere, and the result was the same.<div><br></div><div>There's almost certainly a bug in the 64-bit version. Should I file a bug report ?<br><br>On Tuesday, July 17, 2012 1:46:29 PM UTC-7, Aaron DeVore wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">This can happen when the Python interpreter has trouble deallocating
<br>the many circular references in a Beautiful Soup tree. The function
<br>tag.decompose() explicitly breaks the links in a tree, allowing the
<br>interpreter to easily deallocate all of the tree. Example use:
<br>
<br>dom = BeautifulSoup(untrusted_html)
<br># operations...
<br>dom.decompose()
<br>
<br>On Sat, Jul 14, 2012 at 12:33 PM, Romy Maxwell wrote:
<br>> Running 4.1.0 (pip installed) on Ubuntu 11 & 12 machines (3.4.2-x86_64),
<br>> isolated BeautifulSoup (both 3 and 4) as the cause of endless memory leakage
<br>> while using BeautifulSoup to clean HTML (remove script tags, add nofollow,
<br>> etc). On one machine the processes continue to bloat until OOM killer steps
<br>> in and/or the machine reboots. As a stopgap, I had to use supervisord with
<br>> the memmon module to restart procs bloating over 100M RSS, which happened
<br>> fairly rapidly.
<br>>
<br>> Removal of said code stops memory bloat entirely. I've tested
<br>> lxml.html.soupparser, which uses BS, and it bloats as well. Using
<br>> lxml.html.cleaner and its built-in clean_html function (which does NOT rely
<br>> on BS) shows no memory leakage, however. Anyone else been seeing anything
<br>> like this ? I've done lots of googling and have come up empty handed.
<br></blockquote></div>
------=_Part_482_18744463.1343033621909--
------=_Part_481_15184161.1343033621909--