Okay, I think I hit my first wrinkle.
There are various methods and properties that work universally across BeautifulSoup, Tag, and NavigableString object types:
decompose()
encode()
extract()
find()
find_all_next()
find_all_previous()
find_next()
find_next_sibling()
find_next_siblings()
find_parent()
find_parents()
find_previous()
find_previous_sibling()
find_previous_siblings()
format_string()
formatter_for_name()
get_text()
index()
insert_after()
insert_before()
match()
replace_with()
setup()
wrap()
next_elements
next_siblings
parents
previous_elements
previous_siblings
strings
stripped_strings
I created a mybs4.py file with the following content:
# mybs4.py
import bs4
class PageElement(bs4.PageElement):
def get_text(self):
return "SURPRISE! (PageElement)"
class Tag(bs4.Tag):
def get_text(self):
return "SURPRISE! (Tag)"
class NavigableString(bs4.NavigableString):
def get_text(self):
return "SURPRISE! (NavigableString)"
class BeautifulSoup(bs4.BeautifulSoup):
def get_text(self):
return "SURPRISE! (BeautifulSoup)"
def BeautifulSoup(*args) -> bs4.BeautifulSoup:
return bs4.BeautifulSoup(
*args,
element_classes={
bs4.PageElement: PageElement,
bs4.Tag: Tag,
bs4.NavigableString: NavigableString,
bs4.BeautifulSoup: BeautifulSoup,
},
)
then I tried the following testcase:
import mybs4
soup = mybs4.BeautifulSoup("<p>text</p>")
print('BeautifulSoup:', soup.get_text())
print('Tag:', soup.p.get_text())
print('NavigableString:', soup.find(string=True).get_text())
Problem #1 is that my custom method is called for Tag and NavigableString objects, but not for BeautifulSoup objects:
BeautifulSoup:
Tag: SURPRISE! (Tag)
NavigableString: SURPRISE! (NavigableString)
Problem #2 is that I would like to simplify the code to have only a single copy of each method. I tried commenting out the Tag/NavigableString/BeautifulSoup stuff and left only the PageElement stuff, but that failed completely. (I hoped this would work, but I wasn't expecting it to work.)
If it seems like I'm a novice with Python subclassing and I'm making obvious mistakes, that's because I am and I am! Don't assume that anything I did above is out of cleverness. :)
What's the best way to define custom methods that work across all PageElement-derived objects?
Thank you!
- Chris