bs4 cuts html if attribute is long

96 views
Skip to first unread message

Thomas Güttler

unread,
Apr 24, 2015, 7:09:20 AM4/24/15
to beauti...@googlegroups.com
Hi,

we have an integration test which checks if bs4 is not buggy.

The very strange thing: it fails sometimes, but works afterwards.

Can some one explain this?

    def test_beautifulsoup_does_not_cut_html(self):
        html='<table><tr class="' + (10**5)*'x' + '"></tr><tr></tr></table>'
        soup = bs4.BeautifulSoup(html.encode(encoding='UTF-8'))
        for link in soup.findAll('a'):
            del(link['href'])
        html_post_soup=unicode(soup)
        self.assertEqual('<html><body><table><tr class=""></tr><tr></tr></table></body></html>', html_post_soup.replace('x', ''))

Our versions:

user@host> pip freeze | grep -iE 'beau|xml'
beautifulsoup4==4.3.2
lxml==3.2.3

user@ghost> python --version
Python 2.7.3

Is this a known bug?

Do you need more information to check this?

Thomas Güttler

unread,
May 13, 2015, 7:31:37 AM5/13/15
to beauti...@googlegroups.com
OK, no reply is like an answer to me :-)

We will use lxml in the future. It can parse invalid html, too.

http://lxml.de/parsing.html#parsing-html

Regards,
  Thomas Güttler

leonardr

unread,
Jun 27, 2015, 9:06:39 AM6/27/15
to beauti...@googlegroups.com, h...@tbz-pariv.de
For the record, I can't duplicate this and I don't know what might cause it. I've encountered problems in the past where the lxml manifested a bug as soon as some aspect of an HTML file exceeded a certain length. That's the only thing that comes to mind.

Leonard

Thomas Güttler

unread,
Apr 8, 2016, 9:10:12 AM4/8/16
to beautifulsoup
... one year later I had the same error again. I found the solution and posted it here:

http://stackoverflow.com/a/36500434/633961

soup = bs4.BeautifulSoup(html, 'html5lib')

Regards,
  Thomas Güttler

Am Freitag, 24. April 2015 13:09:20 UTC+2 schrieb Thomas Güttler:
Reply all
Reply to author
Forward
0 new messages