Hi everyone,
I thought I found an elegant solution with custom filtering functions! But not quite.
If I read in the following HTML and grab the first child of the <p> element:
html_doc = """
<p>
<b>bold</b>
<i>italic</i>
<u>underline</u>
text
<br>
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
p = soup.find('p')
first_p_child = p.next_element
I see that the next_siblings attribute returns both NavigableString (whitespace and non-whitespace) and Tag objects:
>>> list(first_p_child.next_siblings)
[<b>bold</b>,
'\n',
<i>italic</i>,
'\n',
<u>underline</u>,
'\n text\n ',
<br/>,
'\n']
And I see that the find_siblings() method can return
both NavigableString and Tag objects:
>>> first_p_child.find_next_siblings(string=True)
['\n', '\n', '\n text\n ', '\n']
>>> first_p_child.find_next_siblings(True)
[<b>bold</b>, <i>italic</i>, <u>underline</u>]
But if I define a filter function to ignore whitespace strings and return only non-whitespace strings and tags:
def is_non_whitespace(tag) -> bool:
return not (isinstance(tag, NavigableString) and tag.text.isspace())
it seems that find_*() filtering considers only Tag objects (the string containing "text" is missing below):
>>> first_p_child.find_next_siblings(is_non_whitespace)
[<b>bold</b>, <i>italic</i>, <u>underline</u>, <br/>]
If the find_*() methods considered both
NavigableString and Tag objects when filtering, I could have elegantly skipped over whitespace-only nodes with find_previous_sibling() and find_next_sibling().
- Chris