Parser - closing tags more promptly

15 views
Skip to first unread message

Chris Angelico

unread,
Oct 24, 2022, 5:15:54 AM10/24/22
to beautifulsoup
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<OL>
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)


On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.
The same would ideally be possible for <p> elements.

ChrisA

Isaac Muse

unread,
Oct 24, 2022, 9:56:26 AM10/24/22
to beautifulsoup

This is not a direct issue with Beautiful Soup as much as an issue with html.parser. Python’s built-in HTML parser (html.parser) is not very sophisticated. It does not support implied ends to <li> tags, even if they are technically allowed per the spec and in browsers.

To parse more sophisticated HTML conforming to later specs, you may need a more advanced parser such as html5lib or lxml (assuming you’ve installed them):

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<OL>
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""

soup = BeautifulSoup(blob, "lxml")
print(soup)

Output:

➜  soupsieve git:(main) ✗ python3 example.py                                                                                             
<html><body><ol>
<li>'THERE sinks the nebulous star we call the Sun,
</li><li>If that hypothesis of theirs be sound,'
</li><li>Said Ida;' let us down and rest:' and we
</li><li>Down from the lean and wrinkled precipices,
</li><li>By every coppice-feather'd chasm and cleft,
</li><li>Dropt thro' the ambrosial gloom to where below
</li><li>No bigger than a glow-worm shone the tent
</li><li>Lamp-lit from the inner. Once she lean'd on me,
</li><li>Descending; once or twice she lent her hand,
</li><li>And blissful palpitations in the blood,
</li><li>Stirring a sudden transport rose and fell.
</li></ol>
</body></html>
Reply all
Reply to author
Forward
0 new messages