Shorter way to create and add new tag + children?

79 views
Skip to first unread message

Heck Lennon

unread,
Jul 15, 2024, 2:18:12 PM7/15/24
to beautifulsoup
Hello,

This code works, but I was wondering if there's a shorter way to create a new element that holds children before inserting the chunk into the tree.

============
"""
Goal:
<Document>
<Style id="red_line">
<LineStyle>
<color>FF0000FF</color>
<width>6</width>
</LineStyle>
</Style>
"""
LS = "<Style><LineStyle><color>FF0000FF</color><width>6</width></LineStyle></Style>"
LS_inner = "<LineStyle><color>FF0000FF</color><width>6</width></LineStyle>"

#OK doc = soup.find(name=re.compile('document', re.IGNORECASE))
doc = soup.kml.Document
if doc:
print("Found doc")

#BAD doc.insert(0,LS)
red_line = soup.new_tag('Style')
red_line.attrs['id'] = "red_line"
red_line.append(BeautifulSoup(LS_inner,features='xml'))
doc.insert(0,red_line)
============

Thank you.

Chris Papademetrious

unread,
Jul 28, 2024, 8:21:25 AM7/28/24
to beautifulsoup
Hi frdt,

When I need to insert a complex HTML fragment like this, I use helper functions like this:

def get_bs4_fragment(html: str) -> bs4.Tag:
    """
    Returns a Tag object for a given HTML fragment string.

    Args:
        html (str): The HTML fragment string to parse. (Must have a single top-level root element.)

    Returns:
        mybs4.Tag: The Tag object for the HTML root element.
    """
    soup = mybs4.BeautifulSoup(html, "lxml")
    tag_name = re.search(r"^\s*<(\w+)", html).group(1)
    return soup.find(tag_name)


def get_bs4_fragments(html: str) -> list[bs4.PageElement]:
    """
    Returns a list of Tag/NavigableString objects for a given HTML fragment string.

    Args:
        html (str): The HTML fragment string. (Can have any content except <html> or <body>.)

    Returns:
        list[bs4.PageElement]: A list of Tag/NavigableString objects.
    """
    soup = mybs4.BeautifulSoup(f"<div>{html}</div>", "lxml")
    return [x.extract() for x in list(soup.find("div").children)]

The first function returns a single element; the second function returns a list of elements (for when you want to create a mix of string/element content elements).

These functions handle the case where lxml adds extra <html>/<body> content around the input HTML. I haven't tried other parsers.

In your case, you should be able to do something like:

doc.insert(0, get_bs4_fragment(f"<Style>...your stuff here...</Style>"))

 - Chris
Reply all
Reply to author
Forward
0 new messages