Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Memory leak on 64-bit machines ?

Date: Mon, 23 Jul 2012 01:53:41 -0700 (PDT)
From: Romy Maxwell <romy.maxw...@gmail.com>
To: beautifulsoup@googlegroups.com
Message-Id: <381e64c5-4421-42c3-acbe-23e066aedef3@googlegroups.com>
In-Reply-To: <CAL4sBwgTLwUg0OO0fC5PSywxZnQ3xV-FPsq4dQ9txJbOkDq5Uw@mail.gmail.com>
References: <e1420c46-4b20-4d7e-960b-a8dd7a36950c@googlegroups.com>
 <CAL4sBwgTLwUg0OO0fC5PSywxZnQ3xV-FPsq4dQ9txJbOkDq5Uw@mail.gmail.com>
Subject: Re: Memory leak on 64-bit machines ?
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_481_15184161.1343033621909"

------=_Part_481_15184161.1343033621909
Content-Type: multipart/alternative; 
	boundary="----=_Part_482_18744463.1343033621909"

------=_Part_482_18744463.1343033621909
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Yup, this was actually the first thing I tried, as suggested by an old 
thread somewhere, and the result was the same.

There's almost certainly a bug in the 64-bit version. Should I file a bug 
report ?

On Tuesday, July 17, 2012 1:46:29 PM UTC-7, Aaron DeVore wrote:
>
> This can happen when the Python interpreter has trouble deallocating 
> the many circular references in a Beautiful Soup tree. The function 
> tag.decompose() explicitly breaks the links in a tree, allowing the 
> interpreter to easily deallocate all of the tree. Example use: 
>
> dom = BeautifulSoup(untrusted_html) 
> # operations... 
> dom.decompose() 
>
> On Sat, Jul 14, 2012 at 12:33 PM, Romy Maxwell wrote: 
> > Running 4.1.0 (pip installed) on Ubuntu 11 & 12 machines (3.4.2-x86_64), 
> > isolated BeautifulSoup (both 3 and 4) as the cause of endless memory 
> leakage 
> > while using BeautifulSoup to clean HTML (remove script tags, add 
> nofollow, 
> > etc). On one machine the processes continue to bloat until OOM killer 
> steps 
> > in and/or the machine reboots. As a stopgap, I had to use supervisord 
> with 
> > the memmon module to restart procs bloating over 100M RSS, which 
> happened 
> > fairly rapidly. 
> > 
> > Removal of said code stops memory bloat entirely. I've tested 
> > lxml.html.soupparser, which uses BS, and it bloats as well. Using 
> > lxml.html.cleaner and its built-in clean_html function (which does NOT 
> rely 
> > on BS) shows no memory leakage, however. Anyone else been seeing 
> anything 
> > like this ? I've done lots of googling and have come up empty handed. 
>

------=_Part_482_18744463.1343033621909
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

Yup, this was actually the first thing I tried, as suggested by an old thread somewhere, and the result was the same.<div><br></div><div>There's almost certainly a bug in the 64-bit version. Should I file a bug report ?<br><br>On Tuesday, July 17, 2012 1:46:29 PM UTC-7, Aaron DeVore wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">This can happen when the Python interpreter has trouble deallocating
<br>the many circular references in a Beautiful Soup tree. The function
<br>tag.decompose() explicitly breaks the links in a tree, allowing the
<br>interpreter to easily deallocate all of the tree. Example use:
<br>
<br>dom = BeautifulSoup(untrusted_html)
<br># operations...
<br>dom.decompose()
<br>
<br>On Sat, Jul 14, 2012 at 12:33 PM, Romy Maxwell&nbsp;wrote:
<br>&gt; Running 4.1.0 (pip installed) on Ubuntu 11 &amp; 12 machines (3.4.2-x86_64),
<br>&gt; isolated BeautifulSoup (both 3 and 4) as the cause of endless memory leakage
<br>&gt; while using BeautifulSoup to clean HTML (remove script tags, add nofollow,
<br>&gt; etc). On one machine the processes continue to bloat until OOM killer steps
<br>&gt; in and/or the machine reboots. As a stopgap, I had to use supervisord with
<br>&gt; the memmon module to restart procs bloating over 100M RSS, which happened
<br>&gt; fairly rapidly.
<br>&gt;
<br>&gt; Removal of said code stops memory bloat entirely. I've tested
<br>&gt; lxml.html.soupparser, which uses BS, and it bloats as well. Using
<br>&gt; lxml.html.cleaner and its built-in clean_html function (which does NOT rely
<br>&gt; on BS) shows no memory leakage, however. Anyone else been seeing anything
<br>&gt; like this ? I've done lots of googling and have come up empty handed.
<br></blockquote></div>
------=_Part_482_18744463.1343033621909--

------=_Part_481_15184161.1343033621909--