invalid parsing on long html input with a <style> tag in a <div>

39 views
Skip to first unread message

John

unread,
Sep 19, 2025, 1:16:37 PMSep 19
to beautifulsoup
This is the bug:
>>> bs4.BeautifulSoup(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>not in div</body></html></style></div>
>>> bs4.BeautifulSoup(' '*1572826+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>

BS4 seems to be blaming the lxml library in some sense.  I tried however to get lxml to do BS4's parse failure, and could not.

The above small repro string took quite a bit of trimming from an original real document. Some of the things can change. In fact the <html> and <body> could be removed entirely, but I left them just so the html would be valid.

>>> bs4.diagnose.diagnose(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>')
Diagnostic running on Beautiful Soup 4.13.5
Python version 3.13.7 (main, Aug 14 2025, 00:00:00) [GCC 14.3.1 20250523 (Red Hat 14.3.1-1)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 5.3.2.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<html>
 <body>
  <div>
   <p>
   </p>
   <style>
   </style>
  </div>
  not in div
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <div>
   <p>
   </p>
   <style>
    </style></div>not in div</body></html>
   </style>
  </div>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<html>
 <body>
  <div>
   <p/>
   <style/>
  </div>
  not in div
 </body>
</html>


John

unread,
Oct 31, 2025, 1:28:10 PM (6 days ago) Oct 31
to beautifulsoup
For anyone else blocked on this critical bug, I found a bizarre workaround:

Fails:
>>> bs4.BeautifulSoup(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml').find('div')
<div><p></p><style></style></div>not in div</body></html></style></div>

Works:
>>> bs4.BeautifulSoup(str(bs4.BeautifulSoup(' '*1572827+'<html><body><div><p></p><style></style></div>not in div</body></html>','lxml')),'lxml').find('div')

<div><p></p><style></style></div>

What's particularly bizarre about this is that the output from str() is itself invalid.  Yet the second round of parsing somehow triggers bs4 magic cleanup that causes it to go from failing to working.

I now always do decode, str, decode again on every input file and it seems to allow everything to work normally.  I do wonder though if this might be merely making the bug less likely, and not entirely fixing it.

leonardr

unread,
Oct 31, 2025, 2:09:24 PM (6 days ago) Oct 31
to beautifulsoup
John,

Thanks for taking the time to explain your issue. You're running into a bug/shortcoming in the lxml library which was fixed by the 6.0.0 release in June.

I don't know exactly what the problem is, but it looks like if lxml 5.x has to keep more than a certain amount of unresolved markup in memory, it starts processing incoming markup incorrectly. That's why the string </style></div>not in div</body></html> shows up, as text, inside your example's <style> tag.

Here's some example code I wrote based on your demonstration:

from bs4 import BeautifulSoup
import lxml
def test_markup(size):
    markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
    soup = BeautifulSoup(markup, 'lxml')
    div = soup.find('div')
    print(size, div, div.style.contents)

print(lxml.__version__)
test_markup(0)
test_markup(1024)
test_markup(1572826)
test_markup(1572827)


Here's the output when I have lxml 5.4.0 installed:

5.4.0
0 <div><p></p><style></style></div> []
1024 <div><p></p><style></style></div> []
1572826 <div><p></p><style></style></div> []
1572827 <div><p></p><style></style></div>not in div</body></html></style></div> ['</style></div>not in div</body></html>']


Here's the output after updating to the subsequent lxml release, 6.0.0:

6.0.0
0 <div><p></p><style></style></div> []
1024 <div><p></p><style></style></div> []
1572826 <div><p></p><style></style></div> []
1572827 <div><p></p><style></style></div> []


Here's a second test program which uses the diagnose.lxml_trace function to see what events lxml issues to Beautiful Soup:

from bs4.diagnose import lxml_trace
import lxml

print(lxml.__version__)
size = 1572827
markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
print(lxml_trace(markup))


The output on lxml 5.4.0:

5.4.0
end,    p, None
end, style, </style></div>not in div</body></html>
end,  div, None
end, body, None

The <style> tag is being given the textual contents "</style></div>not in div</body></html>", which is wrong. That doesn't happen with lxml 6.0.0:

6.0.0
end,    p, None
end, style, None
end,  div, None
end, body, None
end, html, None
None

Running the data through Beautiful Soup a second time works because when the document is parsed the first time, the whitespace before the <html> tag gets stripped out:

from bs4 import BeautifulSoup
import lxml
print(lxml.__version__)
size = 1572827
markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
soup = BeautifulSoup(markup, 'lxml')
print(soup.encode())

Output:

5.4.0
b'<html><body><div><p></p><style></style></div>not in div</body></html></style></div></body></html>'

A large document with a lot of whitespace at the beginning has become a very small (albeit invalid) HTML document with no whitespace at the beginning, and lxml can handle it normally.

Leonard
Reply all
Reply to author
Forward
0 new messages