Possible bug introduced in 4.12.1

28 views
Skip to first unread message

Alex Krupp

unread,
Apr 9, 2023, 7:40:07 PM4/9/23
to beauti...@googlegroups.com
One of my test cases is breaking in 4.12.1, presumably because of the new tree copying behavior. The issue is still present in 4.12.2. I'm not sure how to describe this well enough to turn into a ticket (assuming it's a bug), but here is a simple reproduction:

from bs4 import BeautifulSoup

html = "<ul></ul><li><span>1</span><span>2</span></li>"

soup = BeautifulSoup(html, "lxml")

ul_tag = soup.ul
li_tag = soup.li

li_tag.contents = li_tag.contents[1:]
li_tag.previous_sibling.append(li_tag)

print(soup)

# Output in 4.12.0: `<html><body><ul><li><span>2</span></li></ul></body></html>`

# Output in 4.12.2: `<html><body><ul><li><span>1</span><span>2</span></li></ul></body></html>`

--
Alex Krupp
Read my Email: www.fwdeveryone.com/u/alex3917
Subscribe to my blog: https://alexkrupp.typepad.com/
My homepage: www.alexkrupp.com

leonardr

unread,
Apr 9, 2023, 10:28:07 PM4/9/23
to beautifulsoup
Alex,

I can see how the behavior you see changed because of the new code in 4.12.1, but the underlying problem is this line of code:

li_tag.contents = li_tag.contents[1:]

Modifying Tag.contents directly, rather than using the tree modification methods (Tag.extract in this case) disconnects the parse tree. Beyond that point, the behavior of the tree is undefined. That's why you get different behavior in 4.12.0 and 4.12.1. To the extent there's a bug here, I think it's that this line didn't raise an exception.

See Bug #1948661 for the original report of this problem and my abandoned attempt to fix it. Even if I had fixed it, it wouldn't have caught your problem because it would only have intercepted method calls. The only realistic solution I can think of is to make it impossible for external code to modify Tag.contents, e.g. to make it into a private variable and expose Tag.contents as an iterator over it.

Leonard

Alex Krupp

unread,
Apr 10, 2023, 8:53:17 AM4/10/23
to beauti...@googlegroups.com
Makes sense, and it's certainly easy enough to work around. Raising an exception would be a nice enhancement though imho!

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/97871597-5030-4f3e-81d7-6280197e2741n%40googlegroups.com.


--
Alex Krupp
Cell: (607) 351 2671
Reply all
Reply to author
Forward
0 new messages