is there a way to create a new_tag() without a soup object handy?

44 views
Skip to first unread message

Chris Papademetrious

unread,
Jan 19, 2024, 2:22:01 PMJan 19
to beautifulsoup
Hi everyone,

The only way I know how to call new_tag() is as a method from a from a bs4.BeautifulSoup object. For example,

from bs4 import BeautifulSoup
soup = BeautifulSoup('<body/>', 'lxml')
body = soup.find('body')

def add_p_with_text(soup, tag, text):
    p = soup.new_tag('p')
    p.string = text
    tag.append(p)

add_p_with_text(soup, body, 'abc')
add_p_with_text(soup, body, 'def')

But because I have complex processing code with nested functions, I have to thread the soup object down into everything so it's available to create new tags where needed.

Is there a way to call new_tag() directly instead of as a method to a bs4.BeautifulSoup object? I can almost get to it with this:

BeautifulSoup.new_tag(???, 'p')

but I can't find anything to pass for ??? (which is normally the self of the soup object) to makes it work.

Thanks!

 - Chris


leonardr

unread,
Jan 19, 2024, 3:49:11 PMJan 19
to beautifulsoup
You can try instantiating a Tag object directly, with the constructor. I don't generally recommend this because there are a large number of subtle rules about how different tags and attributes should be treated, rules which are managed by the TreeBuilder object (BeautifulSoup.builder). That's the thing that really needs to be passed into the Tag constructor. new_tag is an instance method of BeautifulSoup mainly so we can guarantee that a TreeBuilder is available when Tag needs it.

If you invoke the Tag constructor without passing in a TreeBuilder, you'll get a Tag, but it will probably behave very slightly differently from Tags that were created when a TreeBuilder was available, which can lead to subtle problems.

Here's a simple example. TreeBuilders designed to parse HTML have special rules for handling HTML multi-valued attributes such as "class":

>>> html_soup = BeautifulSoup("", "html.parser")
>>> html_soup.new_tag(name="a", attrs={"class": "class1 class2"})['class']
['class1', 'class2']

TreeBuilders designed to parse XML don't have those rules:

>>> xml_soup = BeautifulSoup("", "lxml-xml")
>>> xml_soup.new_tag(name="a", attrs={"class": "class1 class2"})['class']
'class1 class2'

If you invoke the Tag constructor with no builder, Beautiful Soup won't know which set of rules to apply, so you'll always get the "no rules" behavior:

>>> Tag(name="a", attrs={"class": "class1 class2"})['class']
'class1 class2'

Leonard

Chris Papademetrious

unread,
Mar 24, 2024, 7:18:44 AMMar 24
to beautifulsoup
Hi everyone,

I ended up going with an approach where I construct the desired new content as HTML:

import bs4
def get_bs4_fragment(html: str) -> list[PageElement]:
    """Given an HTML5 string, return the BS4 content elements"""
    soup = bs4.BeautifulSoup(f'<div>{html}</div>', 'lxml')
    return [x.extract() for x in list(soup.find('div').children)]


Now I can do stuff like this:

first_tag.insert_before(*get_bs4_fragment(f'<p class="default">Default value: {dv}</p>'))

This makes it super-easy to insert complex nested elements, attributes, use f-strings to customize the content, etc. Perhaps it's not as performant as using lower-level constructors, but the flexibility is fantastic.

Inside get_bs4_fragment(), by wrapping the given content in a <div> then extracting it,

  • It doesn't matter what kind of surrounding context is added by the selected HTML parser.
  • The function can parse and return HTML strings containing multiple sibling elements.

 - Chris
Reply all
Reply to author
Forward
0 new messages