Cannot find Element with specific text!

1,087 views
Skip to first unread message

Tillmann Brodbeck

unread,
Feb 16, 2021, 9:29:27 AM2/16/21
to beautifulsoup
Hi. I have an issue with finding a html tag with an specific text:
```
h2 = html.findAll('h2')[1]
print(h2)

<h2><a id="wichtige_Informationen"></a>Was passiert mit E-Mails, E-Mail Adressen und der Webseite bei einem Domainumzug?</h2>

text = h2.text
print(html.findAll('h2', text=text))

[]
```
As you can see in this example I cannot find an element with an specific text. Am I missing something? Or is this a bug? Thank you!
I am using `bs4==0.0.1` btw. And in some other cases as `<h2>Interner Domainumzug oder externer Domainumzug - was ist der Unterschied?</h2>` it is working btw.

leonardr

unread,
Feb 16, 2021, 10:28:12 AM2/16/21
to beautifulsoup

Thanks for writing in. This is a common question and there's a discussion of the issue in bug #1645513. In that issue, Isaac Muse mentions that the ":contains()" CSS pseudoclass may let you run the query you want using the select() method.

Here's what's going on:

When you combine a tag name and a string in a find() type method, Beautiful Soup searches for a tag with that name whose .string value is that string. Your problem happens because .string is only defined when a tag contains a single string and nothing else. Otherwise, .string is undefined.

That's why your find() call works for <h2>Interner Domainumzug oder externer Domainumzug - was ist der Unterschied?</h2> -- the <h2> tag contains a single string and nothing else.

The <h2> tag in your first example contains two things -- a string and an <a> tag -- so .string is undefined, and the find() call returns nothing.

There are a number of possible strategies for changing this behavior, but they would break backwards compatibility pretty badly, and some of them also have very bad performance characteristics, so I don't think I'm going to implement a change until the next major release of Beautiful Soup.

Incidentally, the "bs4" package is an alias for the "beautifulsoup4" package, which I created to prevent someone else from registering a package called "bs4". I recommend installing the "beautifulsoup4" package instead of "bs4". It won't change any functionality, but it'll be easier to see which version is in use.

Leonard

facelessuser

unread,
Feb 16, 2021, 11:07:46 AM2/16/21
to beautifulsoup

Yes, Leonard is right `:contains()` or in recent versions `:-soup-contains()` (as contains is non-standard, we've decided to use prefixes moving forward to avoid future conflicts, but `:contains()` will still work with a warning).


```
>>> print(soup.select("h2:-soup-contains('{}')".format(h2.text)))
[<h2><a id="wichtige_Informationen"></a>Was passiert mit E-Mails, E-Mail Adressen und der Webseite bei einem Domainumzug?</h2>]
```

Obviously escaping quotes may be needed if you have quotes that conflict with the surrounding quotes in `:-soup-contains()`, but generally, it works.

facelessuser

unread,
Feb 16, 2021, 11:53:15 AM2/16/21
to beautifulsoup

Also, a little confused about your version of bs4==0.0.1 current, beautifulsoup4 version is:

>>> import bs4
>>> bs4.__version__
'4.9.3'

facelessuser

unread,
Feb 16, 2021, 11:55:24 AM2/16/21
to beautifulsoup

Never mind, I guess it is a name used to prevent name squatting: https://pypi.org/project/bs4/. But you really should probably use beautifulsoup4 and use an appropriate version instead.

Elton Senne

unread,
Apr 19, 2023, 9:16:25 PM4/19/23
to beautifulsoup
Hi Leonard,

I'm facing exactly the same problem.
The find and findAll methods() return None or [] for some Tags.
My bs4 version is 4.12.2 . Was this issued solved on this version? Is there some solution?

when I try:

site =  '<div class="bz1lBb"><form class="Pg70bf" id="sf"><input name="ie" type="hidden" value="ISO-8859-1"/><div class="H0PQec"><div class="sbc esbc"><input autocapitalize="none" autocomplete="off" class="noHIxc" name="q" spellcheck="false" type="text" value="desentupidora na consolacao"/><input name="oq" type="hidden"/><input name="aqs" type="hidden"/><div class="x">×</div><div class="sc"></div></div></div><button id="qdClwb" type="submit"></button></form></div>'

link = site.find('div' , attrs={'class': 'bz1lBb'})
print(link)

It returns  Nome.

Can you help me.

Thank you,

Elton Senne

Isaac Muse

unread,
Apr 20, 2023, 9:08:06 AM4/20/23
to beautifulsoup

I have no idea exactly what you are doing as you seem to be calling find directly on a string site instead of using BeautifulSoup, but when I use BeautifulSoup, I have no issues finding elements:

from bs4 import BeautifulSoup site = '<div class="bz1lBb"><form class="Pg70bf" id="sf"><input name="ie" type="hidden" value="ISO-8859-1"/><div class="H0PQec"><div class="sbc esbc"><input autocapitalize="none" autocomplete="off" class="noHIxc" name="q" spellcheck="false" type="text" value="desentupidora na consolacao"/><input name="oq" type="hidden"/><input name="aqs" type="hidden"/><div class="x">×</div><div class="sc"></div></div></div><button id="qdClwb" type="submit"></button></form></div>' soup = BeautifulSoup(site, 'html.parser') link = soup.find('div' , attrs={'class': 'bz1lBb'}) print(link)

Output:

<div class="bz1lBb"><form class="Pg70bf" id="sf"><input name="ie" type="hidden" value="ISO-8859-1"/><div class="H0PQec"><div class="sbc esbc"><input autocapitalize="none" autocomplete="off" class="noHIxc" name="q" spellcheck="false" type="text" value="desentupidora na consolacao"/><input name="oq" type="hidden"/><input name="aqs" type="hidden"/><div class="x">×</div><div class="sc"></div></div></div><button id="qdClwb" type="submit"></button></form></div>

Elton Senne

unread,
Apr 20, 2023, 9:10:50 PM4/20/23
to beautifulsoup
I simplified the code just to make it easier to exemplify. In my code site is an url.

Isaac Muse

unread,
Apr 20, 2023, 9:46:45 PM4/20/23
to beautifulsoup
Well, if you have a minimal, reproducible example, and maybe outline what you are expecting to get vs what you are actually getting, I'm more than happy to take a look, but based on what you provided, it seems to work as expected. Currently, I'm not understanding exactly what issue you are having.

Elton Senne

unread,
Apr 20, 2023, 10:29:26 PM4/20/23
to beautifulsoup
I'll give a better briefing:

I'm trying to scrap google search result.
I BeautifulSoup the url = 'https://www.google.com.br/search?q=desentupidora+na+consolacao' and realized that for some cases I'm not succeeding to get a tag using .find() nor findAll() methods.

resposta = requests.get('https://www.google.com.br/search?q=desentupidora+na+consolacao')
site = BeautifulSoup(resposta.content, 'html.parser')

This one works fine:
links = site.findAll('div', attrs={'class': 'v5yQqb'})

But this doesnt:
links = site.findAll('div', attrs={'class': 'vdQmEd'})

See the lines highlighted in blue in the screenshot attached.

Screen Shot 2023-04-20 at 23.16.29.png

Thank you very much,

Elton Senne

Isaac Muse

unread,
Apr 20, 2023, 11:41:01 PM4/20/23
to beautifulsoup

You would need to dump out a real example to test. I see you are viewing these classes in a browser, but you may not be considering that some of these classes are dynamically injected via JavaScript after the page load. So you would need to dump the actual HTML that scraped to see if those classes, which you think are there, are actually there.

Reply all
Reply to author
Forward
0 new messages