how do I extend BeautifulSoup to add my own convenience methods?

32 views
Skip to first unread message

Chris Papademetrious

unread,
Apr 18, 2024, 6:09:07 PMApr 18
to beautifulsoup
Hello fellow Soupers!

I would like to add my own convenience methods to BeautifulSoup. For example, perhaps I might want a contains_a_single() method that indicates whether a tag contains only a single element, and that element matches the specified conditions:

ul = soup.find('ul')
if ul.contains_a_single('li'):
    # this is a single-item list

For reference, the XML::Twig Perl library contains many such convenience methods, and I would like to start implementing my own BeautifulSoup versions of some of them.

My question is, how do I approach this with BeautifulSoup? It has many classes (BeautifulSoup, PageElement, Tag, NavigableString). Do I have to subclass everything? For example, if I subclassed PageElement, how do I get BeautifulSoup objects to start working with those?

I'm new to Python and subclassing. I think I understand how to subclass a single existing class, but not a whole collection of them at once.

 - Chris


leonardr

unread,
Apr 19, 2024, 10:23:06 AMApr 19
to beautifulsoup
Chris,

The supported way is to use the element_classes argument to the BeautifulSoup constructor:

        :param element_classes: A dictionary mapping BeautifulSoup
         classes like Tag and NavigableString, to other classes you'd
         like to be instantiated instead as the parse tree is
         built. This is useful for subclassing Tag or NavigableString
         to modify default behavior.

Here's a simple example:

from bs4 import BeautifulSoup, Tag

class MyTag(Tag):

    @property
    def number_of_attributes(self):
        return len(self.attrs)

markup = '<a id="1" class="a" href="/">text</a>'
soup = BeautifulSoup(markup, 'html.parser', element_classes={Tag:MyTag})

print(soup.a.number_of_attributes)
print(soup.a['class'])

I've never heard back from someone who used this in a real application, so your feedback would be useful. I think there is likely some odd edge case behavior that arises from replacing the standard classes but nothing has been reported so far.

Leonard

Chris Papademetrious

unread,
Apr 21, 2024, 8:57:15 AMApr 21
to beautifulsoup
Hi Leonard,

As a test, I adapted my_get_text() from


into a custom method per your example:

class MyTag(bs4.Tag):
    def my_get_text(self, block_elements=True) -> str:
        ...omitted...


and I created a helper function to standardize the creation of BeautifulSoup objects (as this happens in many places in the code):

def MySoup(html_doc: str) -> bs4.BeautifulSoup:
    return bs4.BeautifulSoup(html_doc, "lxml", element_classes={bs4.Tag: MyTag})

soup = MySoup(html_doc)

print(soup.html.my_get_text())

and it works perfectly! This is super cool. I can't wait to start cleaning up my code by converting various helper functions over to custom methods.

 - Chris

Chris Papademetrious

unread,
Apr 28, 2024, 8:53:05 AMApr 28
to beautifulsoup
Okay, I think I hit my first wrinkle.

There are various methods and properties that work universally across BeautifulSoup, Tag, and NavigableString object types:

decompose()
encode()
extract()
find()
find_all_next()
find_all_previous()
find_next()
find_next_sibling()
find_next_siblings()
find_parent()
find_parents()
find_previous()
find_previous_sibling()
find_previous_siblings()
format_string()
formatter_for_name()
get_text()
index()
insert_after()
insert_before()
match()
replace_with()
setup()
wrap()

next_elements
next_siblings
parents
previous_elements
previous_siblings
strings
stripped_strings

I created a mybs4.py file with the following content:

# mybs4.py
import bs4

class PageElement(bs4.PageElement):
    def get_text(self):
        return "SURPRISE! (PageElement)"

class Tag(bs4.Tag):
    def get_text(self):
        return "SURPRISE! (Tag)"

class NavigableString(bs4.NavigableString):
    def get_text(self):
        return "SURPRISE! (NavigableString)"

class BeautifulSoup(bs4.BeautifulSoup):
    def get_text(self):
        return "SURPRISE! (BeautifulSoup)"

def BeautifulSoup(*args) -> bs4.BeautifulSoup:
    return bs4.BeautifulSoup(
        *args,
        element_classes={
            bs4.PageElement: PageElement,
            bs4.Tag: Tag,
            bs4.NavigableString: NavigableString,
            bs4.BeautifulSoup: BeautifulSoup,
        },
    )

then I tried the following testcase:

import mybs4

soup = mybs4.BeautifulSoup("<p>text</p>")

print('BeautifulSoup:', soup.get_text())
print('Tag:', soup.p.get_text())
print('NavigableString:', soup.find(string=True).get_text())

Problem #1 is that my custom method is called for Tag and NavigableString objects, but not for BeautifulSoup objects:

BeautifulSoup:
Tag: SURPRISE! (Tag)
NavigableString: SURPRISE! (NavigableString)

Problem #2 is that I would like to simplify the code to have only a single copy of each method. I tried commenting out the Tag/NavigableString/BeautifulSoup stuff and left only the PageElement stuff, but that failed completely. (I hoped this would work, but I wasn't expecting it to work.)

If it seems like I'm a novice with Python subclassing and I'm making obvious mistakes, that's because I am and I am! Don't assume that anything I did above is out of cleverness.  :)

What's the best way to define custom methods that work across all PageElement-derived objects?

Thank you!

 - Chris
Reply all
Reply to author
Forward
0 new messages