skipping whitespace string tags when navigating the tree?

Chris Papademetrious

unread,

Dec 27, 2023, 8:03:54 PM12/27/23

to beautifulsoup

Hello everyone,

I'm performing HTML5 content restructuring that requires me to frequently explore surrounding content with the previous_sibling and next_sibling attributes.

These attributes can return the whitespace strings (e.g. newlines) that separate tags.

I'm currently wrapping these attributes with my own functions that skip over whitespace strings and return the next tag or non-whitespace string of interest.

Has anyone one else encountered this need in their scripts? Did you find a clever or elegant solution to it than wrapper functions?

Thanks!

- Chris

Chris Papademetrious

unread,

Dec 29, 2023, 10:44:11 AM12/29/23

to beautifulsoup

Hi everyone,

I thought I found an elegant solution with custom filtering functions! But not quite.

If I read in the following HTML and grab the first child of the element:

html_doc = """

bold
italic
underline
text
 

"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
p = soup.find('p')
first_p_child = p.next_element

I see that the next_siblings attribute returns both NavigableString (whitespace and non-whitespace) and Tag objects:

>>> list(first_p_child.next_siblings)
[bold,
'\n',
italic,
'\n',
underline,
'\n text\n ',
 ,
'\n']

And I see that the find_siblings() method can return both NavigableString and Tag objects:

>>> first_p_child.find_next_siblings(string=True)
['\n', '\n', '\n text\n ', '\n']

>>> first_p_child.find_next_siblings(True)
[bold, italic, underline]

But if I define a filter function to ignore whitespace strings and return only non-whitespace strings and tags:

def is_non_whitespace(tag) -> bool:
return not (isinstance(tag, NavigableString) and tag.text.isspace())

it seems that find_*() filtering considers only Tag objects (the string containing "text" is missing below):

>>> first_p_child.find_next_siblings(is_non_whitespace)
[bold, italic, underline, ]

If the find_*() methods considered both NavigableString and Tag objects when filtering, I could have elegantly skipped over whitespace-only nodes with find_previous_sibling() and find_next_sibling().

- Chris

Chris Papademetrious

unread,

Dec 29, 2023, 12:24:14 PM12/29/23

to beautifulsoup

Ahhh, I realized that the primary argument is a Tag filter by construction, and that passing my filter function to the string argument does indeed return the non-whitespace string:

>>> first_p_child.find_next_siblings(string=is_non_whitespace)
['\n text\n ']

So I think to do what I want, the find*() methods would need a new page_element argument that considers both Tag and NavigableString objects:

# get next non-whitespace sibling

next_thing = this_thing.find_next_sibling(page_element=is_non_whitespace)

I am used to processing content in XSLT/XQuery. Beautiful Soup and XSLT/XQuery object types correlate as follows:

Tag is like * (element nodes)
NavigableString is like text() (text nodes)
Comment is like comment() (comment nodes)
ProcessingInstruction is like processing-instruction() (PI nodes)
PageElement is like node() (any node containable in a document)
Filter tests and functions are like [...] XQuery predicates
- Except that they are type-specific in Beautiful Soup

The idea of filtering page_element objects with a function is analogous to filtering node() objects with a predicate - a fundamentally common action in XSLT. It would be powerful to have this in Beautiful Soup too. If I can figure out how to implement it, I'll submit a merge request.

For comparison, here's how I get the adjacent non-whitespace node in XSLT:

following-sibling::node()[not(mine:is-whitespace-text(.))][1]

preceding-sibling::node()[not(mine:is-whitespace-text(.))][1]

where is-whitespace-text() is the following XSLT function:

<xsl:function name="mine:is-whitespace-text" as="xs:boolean">
<xsl:param name="node" as="node()"/>
<xsl:apply-templates mode="is-whitespace-text" select="$node"/>
</xsl:function>

<xsl:template match="node()" mode="is-whitespace-text">
<xsl:value-of select="false()"/>
</xsl:template>
<xsl:template match="text()[normalize-space(.) = '']" mode="is-whitespace-text">
<xsl:value-of select="true()"/>
</xsl:template>

- Chris

Chris Papademetrious

unread,

Dec 29, 2023, 3:34:47 PM12/29/23

to beautifulsoup

My understanding of PageElement was incorrect (I thought it was the parent type of Tag, NavigableString, Comment, and ProcessingInstruction); please disregard that part.

I filed the following enhancement request:

#2047713: enhance find*() methods to filter through all object types

Thanks!

- Chris

Reply all

Reply to author

Forward