Possible bug with nested lists in v4.7.1

54 views
Skip to first unread message

Alex Krupp

unread,
May 15, 2019, 11:20:16 PM5/15/19
to beauti...@googlegroups.com
For nested lists in HTML, the nested list is supposed to be within an <li> element rather than as a child of the <ol> or <ul>. However, given this slightly non-conformant HTML snippet produced by Gmail:

x = "<ol><li>1</li><ol><li>2</li></ol></ol>"
soup = BeautifulSoup(x, "lxml")

The above code snippet works the way I would expect, and creates the following tree (minus the html and body wrapper tags):

"<ol><li>1</li><ol><li>2</li></ol></ol>"

However, if we look at this very similar code snippet below:

y = "<ol><li>1</li><ul><li>*</li></ul></ol>"
soup = BeautifulSoup(y, "lxml")

We get the value:

"<ol><li>1</li></ol><ul><li>*</li></ul>"

Which, unless there is something I don't know about in the HTML spec prohibiting nested lists of mixed types, seems inconsistent and/or broken because now the unordered nested list is outside of the ordered list rather than nested within it.

I would expect instead for bs4 to produce one of the following two trees:

"<ol><li>1</li><ul><li>*</li></ul></ol>"
"<ol><li>1</li><li><ul><li>*</li></ul></li></ol>"

Either would be fine, but what currently gets produced seems pretty suboptimal because it substantially changes the meaning of the text.

Alex

--
Alex Krupp
Cell: (607) 351 2671
Read my Email: www.fwdeveryone.com/u/alex3917
Subscribe to my blog: http://alexkrupp.typepad.com/
My homepage: www.alexkrupp.com

Aaron DeVore

unread,
May 15, 2019, 11:31:15 PM5/15/19
to beauti...@googlegroups.com
The tree that is produced is governed by the underlying parser, not by Beautiful Soup itself. In your case, that is lxml's HTML parser. You could try one of the other parsers like html5lib or html.parser to see if the parse tree is closer to what you are looking for.

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/CAOMQBP8-E3YN1C2adb%2B8p%2BmHPN%3DExWdPKiHmWLwbFcuh5AcrVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

facelessuser

unread,
May 15, 2019, 11:36:43 PM5/15/19
to beautifulsoup
This is what lxml parser does, not beautifulsoup. You can see this when we use the lxml parser directly

>>> from lxml import etree
>>> html = etree.HTML('<ol><li>1</li><ul><li>*</li></ul></ol>')
>>> etree.tostring(html)
b'<html><body><ol><li>1</li></ol><ul><li>*</li></ul></body></html>'

Now, as I understand, the children of a list should be in a list item:

>>> html = etree.HTML('<ol><li>1</li><li><ul><li>*</li></ul></li></ol>')
>>> etree.tostring(html)
b'<html><body><ol><li>1</li><li><ul><li>*</li></ul></li></ol></body></html>'

Alex Krupp

unread,
May 15, 2019, 11:42:05 PM5/15/19
to beauti...@googlegroups.com
Fair enough, I'll submit this as a bug on the lxml mailing list then. (Unless there is some valid reason is to why the two code snippets would be treated differently depending on whether the nested list is of the same type as the outer list or not.)

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages