invalid parsing on long html input with a <style> tag in a <div>

13 views
Skip to first unread message

John

unread,
Sep 19, 2025, 1:16:37 PM (5 days ago) Sep 19
to beautifulsoup
This is the bug:
>>> bs4.BeautifulSoup(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>not in div</body></html></style></div>
>>> bs4.BeautifulSoup(' '*1572826+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>

BS4 seems to be blaming the lxml library in some sense.  I tried however to get lxml to do BS4's parse failure, and could not.

The above small repro string took quite a bit of trimming from an original real document. Some of the things can change. In fact the <html> and <body> could be removed entirely, but I left them just so the html would be valid.

>>> bs4.diagnose.diagnose(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>')
Diagnostic running on Beautiful Soup 4.13.5
Python version 3.13.7 (main, Aug 14 2025, 00:00:00) [GCC 14.3.1 20250523 (Red Hat 14.3.1-1)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 5.3.2.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<html>
 <body>
  <div>
   <p>
   </p>
   <style>
   </style>
  </div>
  not in div
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <div>
   <p>
   </p>
   <style>
    </style></div>not in div</body></html>
   </style>
  </div>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<html>
 <body>
  <div>
   <p/>
   <style/>
  </div>
  not in div
 </body>
</html>


Reply all
Reply to author
Forward
0 new messages