Shorter way to create and add new tag + children?

81 views

Skip to first unread message

Heck Lennon

unread,

Jul 15, 2024, 2:18:12 PM7/15/24

to beautifulsoup

Hello,

This code works, but I was wondering if there's a shorter way to create a new element that holds children before inserting the chunk into the tree.

============

"""
Goal:
<Document>
<Style id="red_line">
<LineStyle>
<color>FF0000FF</color>
<width>6</width>
</LineStyle>
</Style>
"""
LS = "<Style><LineStyle><color>FF0000FF</color><width>6</width></LineStyle></Style>"
LS_inner = "<LineStyle><color>FF0000FF</color><width>6</width></LineStyle>"

#OK doc = soup.find(name=re.compile('document', re.IGNORECASE))
doc = soup.kml.Document
if doc:
print("Found doc")

#BAD doc.insert(0,LS)
red_line = soup.new_tag('Style')
red_line.attrs['id'] = "red_line"
red_line.append(BeautifulSoup(LS_inner,features='xml'))
doc.insert(0,red_line)

============

Thank you.

Chris Papademetrious

unread,

Jul 28, 2024, 8:21:25 AM7/28/24

to beautifulsoup

Hi frdt,

When I need to insert a complex HTML fragment like this, I use helper functions like this:

def get_bs4_fragment(html: str) -> bs4.Tag:
"""
Returns a Tag object for a given HTML fragment string.

Args:
html (str): The HTML fragment string to parse. (Must have a single top-level root element.)

Returns:
mybs4.Tag: The Tag object for the HTML root element.
"""
soup = mybs4.BeautifulSoup(html, "lxml")
tag_name = re.search(r"^\s*<(\w+)", html).group(1)
return soup.find(tag_name)

def get_bs4_fragments(html: str) -> list[bs4.PageElement]:
"""
Returns a list of Tag/NavigableString objects for a given HTML fragment string.

Args:
html (str): The HTML fragment string. (Can have any content except <html> or <body>.)

Returns:
list[bs4.PageElement]: A list of Tag/NavigableString objects.
"""
soup = mybs4.BeautifulSoup(f"<div>{html}</div>", "lxml")
return [x.extract() for x in list(soup.find("div").children)]

The first function returns a single element; the second function returns a list of elements (for when you want to create a mix of string/element content elements).

These functions handle the case where lxml adds extra <html>/<body> content around the input HTML. I haven't tried other parsers.

In your case, you should be able to do something like:

doc.insert(0, get_bs4_fragment(f"<Style>...your stuff here...</Style>"))