with "lxml", can I parse an HTML fragment without normalizing it to a full HTML document?

32 views
Skip to first unread message

Chris Papademetrious

unread,
May 28, 2024, 3:28:57 PMMay 28
to beautifulsoup
Hello fellow Soupers!

I am using the lxml parser. When I parse an HTML fragment with a single tag at the top, it adds extra stuff to make a full HTML document. For example,

>>> bs4.BeautifulSoup("<div>TEXT</div>", "lxml")
<html><body><div>TEXT</div></body></html>

Is there a way to suppress this document normalization so that I get only the content I asked it to parse?

Thanks!

 - Chris


leonardr

unread,
May 28, 2024, 4:24:59 PMMay 28
to beautifulsoup
I don't think you can do this with lxml through Beautiful Soup. You can do this with lxml alone, by passing the document into document.fromstring and setting ensure_head_body=False. But Beautiful Soup uses lxml's incremental parser API, which is very different and doesn't have that feature. (Because of this there are a couple of lxml features that Beautiful Soup doesn't have access to, like line numbers, but the incremental parser is the only way to parse an enormous document.)

Leonard

Chris Papademetrious

unread,
May 28, 2024, 5:45:46 PMMay 28
to beautifulsoup
Thanks Leonard! I just wanted to make sure I wasn't missing something obvious. I have helper functions for stuff like this:

import bs4
import re

def get_bs4_fragment(html: str) -> bs4.Tag:
    """
    Returns a Tag object for a given HTML fragment string.

    Args:
        html (str): The HTML fragment string to parse. (Must have a single top-level root element.)

    Returns:
        bs4.Tag: The Tag object for the HTML root element.
    """
    soup = bs4.BeautifulSoup(html, "lxml")
    tag_name = re.search(r"^\s*<(\w+)", html).group(1)
    return soup.find(tag_name)


def get_bs4_fragments(html: str) -> list[bs4.PageElement]:
    """
    Returns a list of Tag/NavigableString objects for a given HTML fragment string.

    Args:
        html (str): The HTML fragment string. (Can have any content except <html> or <body>.)

    Returns:
        list[bs4.PageElement]: A list of Tag/NavigableString objects.
    """
    soup = bs4.BeautifulSoup(f"<div>{html}</div>", "lxml")
    return [x.extract() for x in list(soup.find("div").children)]

which can be demonstrated with

print(get_bs4_fragment("<html><head/></html>"))
print(get_bs4_fragment("<body><p>text</p></body>"))
print(get_bs4_fragment("<p>text</p>"))
print(get_bs4_fragment("<div>text</div>"))
print(get_bs4_fragment("<ul><li>text</li><li>text</li></ul>"))
print()
print(get_bs4_fragments("foo <p>text</p> bar"))
print(get_bs4_fragments("foo <div>text</div> bar"))
print(get_bs4_fragments("foo <ul><li>text</li><li>text</li></ul> bar"))

I just wanted to be sure I wasn't missing some more elegant way.

The difference between the two methods is that get_bs4_fragment() returns a single Tag object (and requires that the HTML provide this), while the get_bs4_fragments() function returns a list of Tag/NavigableString elements, useful for stuffing any free-form HTML content into another tag.

Thanks!

 - Chris
Reply all
Reply to author
Forward
0 new messages