Why does BS ignore paragraphs with italics sub-element?

Heck Lennon

unread,

Sep 3, 2024, 8:34:51 AM9/3/24

to beautifulsoup

Hello,

For some reason, this code ignores paragraphs with italics but has no problem seeing those with plain text. Any idea why?

Thank you.

=============

<div id="page17">body
1. footnote
blah.
2. blah et a l, blah
blah, blah
blah.
3. blah, blah
and Change, blah
</div>
=============

divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:
print("----- page:", div['id'])
ps = div.find_all("p", string=re.compile(r"^\d+\. "))
for p in ps:
print(p.string)

=============

leonardr

unread,

Sep 3, 2024, 10:18:42 AM9/3/24

to beautifulsoup

Hello,

You're seeing unexpected behavior because the "string" argument to find() methods works differently than you expect.

"Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string."

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

So you're searching for a tag with a special .string. How does .string work?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None."

In your example, the first two paragraphs each contain one thing, so .string is defined to be "body" and "1. footnote blah", respectively.

The third paragraph contains five things: a string, an tag, another string, a second tag, and a third string. So that tag's .string is defined as None, and it doesn't match your filter.

There's a detailed discussion of the tradeoffs in these issues:

https://bugs.launchpad.net/beautifulsoup/+bug/1366856

https://bugs.launchpad.net/beautifulsoup/+bug/1645513

And in a comment to issue 1645513, Isaac Muse suggests using the ":contains()" pseudo-class with soupsieve() to do what you want:

https://bugs.launchpad.net/beautifulsoup/+bug/1645513/comments/9

Leonard

Heck Lennon

unread,

Sep 3, 2024, 11:30:07 AM9/3/24

to beautifulsoup

Thanks for the pointer.

Google returned no example with a regex, and this (unsurprisingly) doesn't work:

=======

divs = soup.find_all('div', id=re.compile(r"^page\d+$")) #one page = <div id="page123">blah</div>
for div in divs:
ps = div.select('p:contains(^\d+\. )') #footnotes start with "x. "

for p in ps:

print(p.string)

=======

Heck Lennon

unread,

Sep 3, 2024, 1:48:43 PM9/3/24

to beautifulsoup

If there's no way to tell BS to handle those paragraphs that contain like a plain string, I could pre-edit the HTML file with a regex (egi. replace all and with __ à la Markdown, and then, after parsing, use BS to replace those signs into HTML as sub-elements. Kludgy, but.

leonardr

unread,

Sep 3, 2024, 3:05:04 PM9/3/24

to beautifulsoup

If the ":contains()" pseudo-class idea doesn't work for whatever reason, you can pass a function into the find() methods that checks .text instead of .string. The performance of this is bad, which is why it's not part of the find() API the way .string is, but it'll do what you want:

from bs4 import BeautifulSoup
import re

markup = """<div id="page17">body

1. footnote
blah.
2. blah et a l, blah
blah, blah
blah.
3. blah, blah
and Change, blah
</div>"""

soup = BeautifulSoup(markup, 'html.parser')

def containsText(x):
return x.name=="p" and re.compile(r"^\d+\. ").match(x.text)

divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:

ps = div.find_all(containsText)
for p in ps:
print(p.text + "\n")

Leonard

Heck Lennon

unread,

Sep 4, 2024, 5:40:03 AM9/4/24

to beautifulsoup

Thanks very much.

Isaac Muse

unread,

Sep 5, 2024, 9:16:21 AM9/5/24

to beautifulsoup

It should be noted that Soup Sieve offers: :-soup-contains() and :-soup-contains-own(). -soup was added to signify they are not CSS provided pseudo classes. The own variant ensures that the text is found in the target element, not in descendants.

Reply all

Reply to author

Forward