John,
Thanks for taking the time to explain your issue. You're running into a bug/shortcoming in the lxml library which was fixed by the 6.0.0 release in June.
I don't know exactly what the problem is, but it looks like if lxml 5.x has to keep more than a certain amount of unresolved markup in memory, it starts processing incoming markup incorrectly. That's why the string </style></div>not in div</body></html> shows up, as text, inside your example's <style> tag.
Here's some example code I wrote based on your demonstration:
from bs4 import BeautifulSoup
import lxml
def test_markup(size):
markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
soup = BeautifulSoup(markup, 'lxml')
div = soup.find('div')
print(size, div, div.style.contents)
print(lxml.__version__)
test_markup(0)
test_markup(1024)
test_markup(1572826)
test_markup(1572827)
Here's the output when I have lxml 5.4.0 installed:
5.4.0
0 <div><p></p><style></style></div> []
1024 <div><p></p><style></style></div> []
1572826 <div><p></p><style></style></div> []
1572827 <div><p></p><style></style></div>not in div</body></html></style></div> ['</style></div>not in div</body></html>']
Here's the output after updating to the subsequent lxml release, 6.0.0:
6.0.0
0 <div><p></p><style></style></div> []
1024 <div><p></p><style></style></div> []
1572826 <div><p></p><style></style></div> []
1572827 <div><p></p><style></style></div> []
Here's a second test program which uses the diagnose.lxml_trace function to see what events lxml issues to Beautiful Soup:
from bs4.diagnose import lxml_trace
import lxml
print(lxml.__version__)
size = 1572827
markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
print(lxml_trace(markup))
The output on lxml 5.4.0:
5.4.0
end, p, None
end, style, </style></div>not in div</body></html>
end, div, None
end, body, None
The <style> tag is being given the textual contents "</style></div>not in div</body></html>", which is wrong. That doesn't happen with lxml 6.0.0:
6.0.0
end, p, None
end, style, None
end, div, None
end, body, None
end, html, None
None
Running the data through Beautiful Soup a second time works because when the document is parsed the first time, the whitespace before the <html> tag gets stripped out:
from bs4 import BeautifulSoup
import lxml
print(lxml.__version__)
size = 1572827
markup = ' '*size + '<html><body><div><p></p><style></style></div>not in div</body></html>'
soup = BeautifulSoup(markup, 'lxml')
print(soup.encode())
Output:
5.4.0
b'<html><body><div><p></p><style></style></div>not in div</body></html></style></div></body></html>'
A large document with a lot of whitespace at the beginning has become a very small (albeit invalid) HTML document with no whitespace at the beginning, and lxml can handle it normally.
Leonard