Why does BS ignore paragraphs with italics sub-element?

72 views
Skip to first unread message

Heck Lennon

unread,
Sep 3, 2024, 8:34:51 AM9/3/24
to beautifulsoup
Hello,

For some reason, this code ignores paragraphs with italics but has no problem seeing those with plain text. Any idea why?

Thank you.
=============
<div id="page17"><p>body</p>
<p>1. footnote
blah.</p>
<p>2. blah <i>et a l,</i> blah
<i>blah,</i> blah
blah.</p>
<p>3. blah, <i>blah
and</i> <i>Change,</i> blah</p>
</div>
=============
divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:
print("----- page:", div['id'])
ps = div.find_all("p", string=re.compile(r"^\d+\. "))
for p in ps:
print(p.string)
=============

leonardr

unread,
Sep 3, 2024, 10:18:42 AM9/3/24
to beautifulsoup
Hello,

You're seeing unexpected behavior because the "string" argument to find() methods works differently than you expect.

"Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string."

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

So you're searching for a tag with a special .string. How does .string work?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None."

In your example, the first two paragraphs each contain one thing, so .string is defined to be "body" and "1. footnote blah", respectively.
The third paragraph contains five things: a string, an <i> tag, another string, a second <i> tag, and a third string. So that tag's .string is defined as None, and it doesn't match your filter.

There's a detailed discussion of the tradeoffs in these issues:

And in a comment to issue 1645513,  Isaac Muse suggests using the ":contains()" pseudo-class with soupsieve() to do what you want:

Leonard

Heck Lennon

unread,
Sep 3, 2024, 11:30:07 AM9/3/24
to beautifulsoup
Thanks for the pointer.

Google returned no example with a regex, and this (unsurprisingly) doesn't work:
=======
divs = soup.find_all('div', id=re.compile(r"^page\d+$")) #one page = <div id="page123">blah</div>
for div in divs:
ps = div.select('p:contains(^\d+\. )') #footnotes start with "x. "

for p in ps:
print(p.string)
=======

Heck Lennon

unread,
Sep 3, 2024, 1:48:43 PM9/3/24
to beautifulsoup
If there's no way to tell BS to handle those paragraphs that contain <i> like a plain string, I could pre-edit the HTML file with a regex (egi. replace all <i> and </i> with  __ à la Markdown, and then, after parsing, use BS to replace those signs into HTML as sub-elements. Kludgy, but.

leonardr

unread,
Sep 3, 2024, 3:05:04 PM9/3/24
to beautifulsoup
If the ":contains()" pseudo-class idea doesn't work for whatever reason, you can pass a function into the find() methods that checks .text instead of .string. The performance of this is bad, which is why it's not part of the find() API the way .string is, but it'll do what you want:

from bs4 import BeautifulSoup
import re

markup = """<div id="page17"><p>body</p>

<p>1. footnote
blah.</p>
<p>2. blah <i>et a l,</i> blah
<i>blah,</i> blah
blah.</p>
<p>3. blah, <i>blah
and</i> <i>Change,</i> blah</p>
</div>"""

soup = BeautifulSoup(markup, 'html.parser')

def containsText(x):
    return x.name=="p" and re.compile(r"^\d+\. ").match(x.text)


divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:
    ps = div.find_all(containsText)
    for p in ps:
        print(p.text + "\n")

Leonard

Heck Lennon

unread,
Sep 4, 2024, 5:40:03 AM9/4/24
to beautifulsoup
Thanks very much.

Isaac Muse

unread,
Sep 5, 2024, 9:16:21 AM9/5/24
to beautifulsoup

It should be noted that Soup Sieve offers: :-soup-contains() and :-soup-contains-own(). -soup was added to signify they are not CSS provided pseudo classes. The own variant ensures that the text is found in the target element, not in descendants.

Reply all
Reply to author
Forward
0 new messages