soup = BeautifulSoup("""<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.id\
rg/2007/ops">
<body>
<p epub:type="toc"></p>
</body>
</html>
""", "xml")
# Existing tag is found:
print(soup.select("[epub|type=toc]")) # [<p epub:type="toc"/>]
# Creating new tag
new_tag = soup.new_tag("p", attrs={"epub:type": "pagebreak"})
soup.body.append(new_tag)
# Created tag is not found
print(soup.select("[epub|type=pagebreak]")) # []
tag = soup.p
new_tag = tag.copy_self()
new_tag.attrs["epub:type"] = "pagebreak"
soup.body.append(new_tag)
# Created tag is found
print(soup.select("[epub|type=pagebreak]")) # [<p epub:type="pagebreak"/>]
I've tried reproducing the copy_self method be calling the Tag constructor directly, but had no success. I've done a little digging into soupsieve's code but couldn't understand what was the problem. Any pointers for helping me understand this bug -- if a bug at all -- are welcome.
If this is in fact a bug, I'd also like help understanding where to file it: at beautifulsoup or at soupsieve.
Thanks in advance,
João
This is not a bug in soupsieve. Normally, when attributes that have namespaces are created, the key in the attribute, while it looks like a normal string, has special attributes on it that contains the namespace:
>>> element = soup.select("[epub|type=toc]")[0] >>> [(k.namespace, k.name) for k in element.attrs.keys()] [('http://www.id rg/2007/ops', 'type')]It seems Beautiful Soup doesn’t create the new tag with the namespace context, maybe because it has no parent with the namespace, or maybe for other reasons, I haven’t looked into it. Regardless, soupsieve expects namespace attributes to be constructed as namespace attributes, and when they are not, they are assumed to be normal attributes. We can force the tag to have the namespace attributes by copying the attribute key, which is what copy_self does.
element = soup.select("[epub|type=toc]")[0] attrs2 = element.attrs.copy() attrs2["epub:type"] = "pagebreak" new_tag = soup.new_tag("p", attrs=attrs2) soup.body.append(new_tag)Anyway, BeautifulSoup would need to have a way to create a new tag directly under another element so it could use that context to populate namespaces correctly based on where it is being inserted. Soupsieve is simply doing what is expected, checking if the attribute follows the namespace attribute convention, and when it doesn’t, it is assumed to be a normal attribute.