This is not a direct issue with Beautiful Soup as much as an issue with html.parser
. Python’s built-in HTML parser (html.parser
) is not very sophisticated. It does not support implied ends to <li>
tags, even if they are technically allowed per the spec and in browsers.
To parse more sophisticated HTML conforming to later specs, you may need a more advanced parser such as html5lib
or lxml
(assuming you’ve installed them):
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<OL>
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "lxml")
print(soup)
Output:
➜ soupsieve git:(main) ✗ python3 example.py
<html><body><ol>
<li>'THERE sinks the nebulous star we call the Sun,
</li><li>If that hypothesis of theirs be sound,'
</li><li>Said Ida;' let us down and rest:' and we
</li><li>Down from the lean and wrinkled precipices,
</li><li>By every coppice-feather'd chasm and cleft,
</li><li>Dropt thro' the ambrosial gloom to where below
</li><li>No bigger than a glow-worm shone the tent
</li><li>Lamp-lit from the inner. Once she lean'd on me,
</li><li>Descending; once or twice she lent her hand,
</li><li>And blissful palpitations in the blood,
</li><li>Stirring a sudden transport rose and fell.
</li></ol>
</body></html>