skipping whitespace string tags when navigating the tree?

47 views
Skip to first unread message

Chris Papademetrious

unread,
Dec 27, 2023, 8:03:54 PM12/27/23
to beautifulsoup
Hello everyone,

I'm performing HTML5 content restructuring that requires me to frequently explore surrounding content with the previous_sibling and next_sibling attributes.

These attributes can return the whitespace strings (e.g. newlines) that separate tags.
I'm currently wrapping these attributes with my own functions that skip over whitespace strings and return the next tag or non-whitespace string of interest.

Has anyone one else encountered this need in their scripts? Did you find a clever or elegant solution to it than wrapper functions?

Thanks!

 - Chris



Chris Papademetrious

unread,
Dec 29, 2023, 10:44:11 AM12/29/23
to beautifulsoup
Hi everyone,

I thought I found an elegant solution with custom filtering functions! But not quite.

If I read in the following HTML and grab the first child of the <p> element:

html_doc = """
  <p>
    <b>bold</b>
    <i>italic</i>
    <u>underline</u>
    text
    <br>
  </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
p = soup.find('p')
first_p_child = p.next_element


I see that the next_siblings attribute returns both NavigableString (whitespace and non-whitespace) and Tag objects:

>>> list(first_p_child.next_siblings)
[<b>bold</b>,
 '\n',
 <i>italic</i>,
 '\n',
 <u>underline</u>,
 '\n    text\n    ',
 <br/>,
 '\n']

And I see that the find_siblings() method can return both NavigableString and Tag objects:

>>> first_p_child.find_next_siblings(string=True)
['\n', '\n', '\n    text\n    ', '\n']


>>> first_p_child.find_next_siblings(True)
[<b>bold</b>, <i>italic</i>, <u>underline</u>]

But if I define a filter function to ignore whitespace strings and return only non-whitespace strings and tags:

def is_non_whitespace(tag) -> bool:
    return not (isinstance(tag, NavigableString) and tag.text.isspace())

it seems that find_*() filtering considers only Tag objects (the string containing "text" is missing below):

>>> first_p_child.find_next_siblings(is_non_whitespace)
[<b>bold</b>, <i>italic</i>, <u>underline</u>, <br/>]


If the find_*() methods considered both NavigableString and Tag objects when filtering, I could have elegantly skipped over whitespace-only nodes with find_previous_sibling() and find_next_sibling().

 - Chris

Chris Papademetrious

unread,
Dec 29, 2023, 12:24:14 PM12/29/23
to beautifulsoup
Ahhh, I realized that the primary argument is a Tag filter by construction, and that passing my filter function to the string argument does indeed return the non-whitespace string:

>>> first_p_child.find_next_siblings(string=is_non_whitespace)
['\n    text\n    ']

So I think to do what I want, the find*() methods would need a new page_element argument that considers both Tag and NavigableString objects:

# get next non-whitespace sibling
next_thing = this_thing.find_next_sibling(page_element=is_non_whitespace)

I am used to processing content in XSLT/XQuery. Beautiful Soup and XSLT/XQuery object types correlate as follows:

  • Tag is like * (element nodes)
  • NavigableString is like text() (text nodes)
  • Comment is like comment() (comment nodes)
  • ProcessingInstruction is like processing-instruction() (PI nodes)
  • PageElement is like node() (any node containable in a document)
  • Filter tests and functions are like [...] XQuery predicates
    • Except that they are type-specific in Beautiful Soup

The idea of filtering page_element objects with a function is analogous to filtering node() objects with a predicate - a fundamentally common action in XSLT. It would be powerful to have this in Beautiful Soup too. If I can figure out how to implement it, I'll submit a merge request.

For comparison, here's how I get the adjacent non-whitespace node in XSLT:

following-sibling::node()[not(mine:is-whitespace-text(.))][1]
preceding-sibling::node()[not(mine:is-whitespace-text(.))][1]

where is-whitespace-text() is the following XSLT function:

  <!-- returns true() if node is a whitespace text() node -->
  <xsl:function name="mine:is-whitespace-text" as="xs:boolean">
    <xsl:param name="node" as="node()"/>
    <xsl:apply-templates mode="is-whitespace-text" select="$node"/>
  </xsl:function>

  <xsl:template match="node()" mode="is-whitespace-text">
    <xsl:value-of select="false()"/>
  </xsl:template>
  <xsl:template match="text()[normalize-space(.) = '']" mode="is-whitespace-text">
    <xsl:value-of select="true()"/>
  </xsl:template>


 - Chris

Chris Papademetrious

unread,
Dec 29, 2023, 3:34:47 PM12/29/23
to beautifulsoup
My understanding of PageElement was incorrect (I thought it was the parent type of Tag, NavigableString, Comment, and ProcessingInstruction); please disregard that part.

I filed the following enhancement request:


Thanks!

 - Chris
Reply all
Reply to author
Forward
0 new messages