This is the bug:
>>> bs4.BeautifulSoup(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>not in div</body></html></style></div>
>>> bs4.BeautifulSoup(' '*1572826+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>
BS4 seems to be blaming the lxml library in some sense. I tried however to get lxml to do BS4's parse failure, and could not.
The above small repro string took quite a bit of trimming from an original real document. Some of the things can change. In fact the <html> and <body> could be removed entirely, but I left them just so the html would be valid.
>>> bs4.diagnose.diagnose(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>')
Diagnostic running on Beautiful Soup 4.13.5
Python version 3.13.7 (main, Aug 14 2025, 00:00:00) [GCC 14.3.1 20250523 (Red Hat 14.3.1-1)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 5.3.2.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<html>
<body>
<div>
<p>
</p>
<style>
</style>
</div>
not in div
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<div>
<p>
</p>
<style>
</style></div>not in div</body></html>
</style>
</div>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<html>
<body>
<div>
<p/>
<style/>
</div>
not in div
</body>
</html>