Thanks Leonard! I just wanted to make sure I wasn't missing something obvious. I have helper functions for stuff like this:
import bs4
import re
def get_bs4_fragment(html: str) -> bs4.Tag:
"""
Returns a Tag object for a given HTML fragment string.
Args:
html (str): The HTML fragment string to parse. (Must have a single top-level root element.)
Returns:
bs4.Tag: The Tag object for the HTML root element.
"""
soup = bs4.BeautifulSoup(html, "lxml")
tag_name = re.search(r"^\s*<(\w+)", html).group(1)
return soup.find(tag_name)
def get_bs4_fragments(html: str) -> list[bs4.PageElement]:
"""
Returns a list of Tag/NavigableString objects for a given HTML fragment string.
Args:
html (str): The HTML fragment string. (Can have any content except <html> or <body>.)
Returns:
list[bs4.PageElement]: A list of Tag/NavigableString objects.
"""
soup = bs4.BeautifulSoup(f"<div>{html}</div>", "lxml")
return [x.extract() for x in list(soup.find("div").children)]
which can be demonstrated with
print(get_bs4_fragment("<html><head/></html>"))
print(get_bs4_fragment("<body><p>text</p></body>"))
print(get_bs4_fragment("<p>text</p>"))
print(get_bs4_fragment("<div>text</div>"))
print(get_bs4_fragment("<ul><li>text</li><li>text</li></ul>"))
print()
print(get_bs4_fragments("foo <p>text</p> bar"))
print(get_bs4_fragments("foo <div>text</div> bar"))
print(get_bs4_fragments("foo <ul><li>text</li><li>text</li></ul> bar"))
I just wanted to be sure I wasn't missing some more elegant way.
The difference between the two methods is that get_bs4_fragment() returns a single Tag object (and requires that the HTML provide this), while the get_bs4_fragments() function returns a list of Tag/NavigableString elements, useful for stuffing any free-form HTML content into another tag.
Thanks!
- Chris