using beautifulsoup 4 for xml causes strange behaviour (memory issues?)

161 views
Skip to first unread message

chobok

unread,
Mar 23, 2012, 10:19:52 AM3/23/12
to beautifulsoup
This is a slightly modified version of my question at:
http://stackoverflow.com/questions/9837713/using-beautifulsoup-4-for-xml-causes-strange-behaviour-memory-issues

I'm getting strange behaviour with this

>>> from bs4 import BeautifulSoup

>>> smallfile = 'small.xml' #approx 600bytes
>>> largerfile = 'larger.xml' #approx 2300 bytes
>>> len(BeautifulSoup(open(smallfile, 'r'), ['lxml', 'xml']))
1
>>> len(BeautifulSoup(open(largerfile, 'r'), ['lxml', 'xml']))
0
Contents of small.xml:

<?xml version="1.0" encoding="us-ascii"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>
largerfile is simply the smaller file, but padded with spaces and
newlines (inbetween the last two tags in case it's relevant). IE the
structure and contents of the xml should be identical for both cases.

On rare occasions processing largerfile will actually yield a partial
result where only a small portion of the xml has been parsed. I can't
seem to reliably recreate the circumstances.

Since BeautifulSoup uses lxml, I tested to see if lxml could handle
the files independently. lxml appeared to be able to parse both files.

>>> from lxml import etree
>>> tree = etree.parse(smallfile)
>>> len(etree.tostring(tree))
547
>>> tree = etree.parse(largerfile)
>>> len(etree.tostring(tree))
2294
I'm using

netbook with 1gb ram
windows 7
lxml 2.3 (had some trouble installing this, I hope a dodgy
installation isn't causing the problem)
beautiful soup 4.0.1
python 3.2 (I also have python 2.7x installed, but have been using 3.2
for this code)
What could be preventing the larger file from being processed
properly? My current suspicion is some weird memory issue, since the
file size seems to make a difference, perhaps in conjunction with some
bug in how BeautifulSoup 4 interacts with lxml.

there's also a similar question here:
http://stackoverflow.com/questions/9622474/beautifulsoup-xml-only-printing-first-line

Leonard Richardson

unread,
Mar 23, 2012, 7:05:23 PM3/23/12
to beauti...@googlegroups.com
Apparently BS4+lxml won't parse an XML document that's longer than
about 550 bytes. I only tested it with small documents. The BS4
handler code is not even being called, which makes it hard to debug,
but it's not a guarantee the problem is on the lxml side.

I'll look into this tomorrow or Monday.

Leonard

> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>

chobok

unread,
Mar 23, 2012, 10:53:12 PM3/23/12
to beautifulsoup
Thanks Leonard,

From my count using
>>> for n in count():
nchildren = len(BeautifulSoup("<a>"+" "*n+"</a>", 'xml'))
if nchildren != 1: # broken
print(n) # -> 10
break

gives 1085

1085 plus the 7 characters from "<a>" and "</a>" = 1092.
IE the code processes the xml as expected for a byte count less than
1091. Anything above 1092 usually fails.



On Mar 24, 10:05 am, Leonard Richardson <leona...@segfault.org> wrote:
> Apparently BS4+lxml won't parse an XML document that's longer than
> about 550 bytes. I only tested it with small documents. The BS4
> handler code is not even being called, which makes it hard to debug,
> but it's not a guarantee the problem is on the lxml side.
>
> I'll look into this tomorrow or Monday.
>
> Leonard
>
>
>
>
>
>
>
> On Fri, Mar 23, 2012 at 10:19 AM, chobok <fall7stand...@gmail.com> wrote:
> > This is a slightly modified version of my question at:
> >http://stackoverflow.com/questions/9837713/using-beautifulsoup-4-for-...
> >http://stackoverflow.com/questions/9622474/beautifulsoup-xml-only-pri...

Leonard Richardson

unread,
Mar 24, 2012, 11:04:09 AM3/24/12
to beauti...@googlegroups.com
It looks like a bug in lxml.

https://bugs.launchpad.net/lxml/+bug/963936

I've put a workaround in bzr. I'll probably release a 4.0.2 on Monday
with the workaround, but I want to see what the lxml developers say.

Leonard

Erin Hodgess

unread,
Mar 24, 2012, 7:56:52 PM3/24/12
to beauti...@googlegroups.com
Beautiful soup is SO EXCELLENT

--
Erin Hodgess
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: erinm....@gmail.com

chobok

unread,
Mar 27, 2012, 2:35:34 AM3/27/12
to beautifulsoup
Thanks for the workaround!

On Mar 25, 10:56 am, Erin Hodgess <erinm.hodg...@gmail.com> wrote:
> Beautiful soup is SO EXCELLENT
>
> On Fri, Mar 23, 2012 at 6:05 PM, Leonard Richardson
>
>
>
>
>
>
>
>
>
> <leona...@segfault.org> wrote:
> > Apparently BS4+lxml won't parse an XML document that's longer than
> > about 550 bytes. I only tested it with small documents. The BS4
> > handler code is not even being called, which makes it hard to debug,
> > but it's not a guarantee the problem is on the lxml side.
>
> > I'll look into this tomorrow or Monday.
>
> > Leonard
>
> > On Fri, Mar 23, 2012 at 10:19 AM, chobok <fall7stand...@gmail.com> wrote:
> >> This is a slightly modified version of my question at:
> >>http://stackoverflow.com/questions/9837713/using-beautifulsoup-4-for-...
> >>http://stackoverflow.com/questions/9622474/beautifulsoup-xml-only-pri...
>
> >> --
> >> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> >> To post to this group, send email to beauti...@googlegroups.com.
> >> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/beautifulsoup?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> > To post to this group, send email to beauti...@googlegroups.com.
> > To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/beautifulsoup?hl=en.
>
> --
> Erin Hodgess
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: erinm.hodg...@gmail.com

Leonard Richardson

unread,
May 24, 2013, 12:24:42 PM5/24/13
to beauti...@googlegroups.com
> Is this problem resolved? I am using BS 4.2.0. and am still running into
> this problem with the LXML parser. I've tried bringing the parser up to the
> latest version, but I'm still having issues with larger HTML files.
> I've tried switching over to the HTML5Lib parser, but its behaviour seems
> less predictable on some of the HTML content I am working with, and I would
> prefer to use LXML if possible. That said, at least HTML5Lib doesn't
> actually just start dropping content from the file if it is too large.

No, it's not resolved. I haven't done any work on memory usage.

Leonard
Reply all
Reply to author
Forward
0 new messages