find_all on a class with exact string value

205 views
Skip to first unread message

Matt LaPlante

unread,
May 28, 2019, 4:23:40 PM5/28/19
to beautifulsoup
I'm having trouble with the documentation on CSS class searching behavior. My goal is to match tags where a class has a specific, limited value only. For example:

soup.find_all('div', class_='abc')

I want this to match

<div class="abc">

but not match

<div class="def abc hij">

In referencing the documentation for Searching by CSS class (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class), it's clear that "when you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes." This means the standard behavior will cause both of the above to match. However the workarounds are less clear. The next bit gives hope, stating that "You can also search for the exact string value of the class attribute," which sounds like exactly what I want, and it gives this example:

css_soup.find_all("p", class_="body strikeout")

However this seems predicated on having a multi-word string for the class? Removing one of the two words makes the example fall back on the default example in terms of syntax, meaning it's just going to match everything. Is there a way to denote that, even though I'm only providing one class name, I want the exact string value?


facelessuser

unread,
May 28, 2019, 9:37:07 PM5/28/19
to beautifulsoup
So, Beautiful Soup seems handles certain attributes, which are known to be space separated lists, special. Internally class is stored as a list. Then when you give `class_` a value, it tries to match each value individually, and it tries to match the whole attribute as a single string with the values separated as spaces.

This is why it is difficult to do what you want. I sometimes find this kind of "magic" confusing, much like you do. Anyways, there are ways to get what you want though.

One is to provide a function that evaluates the class exactly as you mean it to be evaluated. The other, at least for this case, is to use a CSS selector:

from bs4 import BeautifulSoup

html = """
<div class="abc"></div>
<div class="def abc hij"></div>
"""

soup = BeautifulSoup(html, 'html.parser')

# Custom function
def class_filter(el):
    attr = el.attrs.get('class')
    return attr is not None and ' '.join(attr) == 'abc'

print('===function===')
print(soup.find_all(class_filter))

# CSS selector
print('===selector===')
print(soup.select('[class="abc"]'))


Output:

===function===
[<div class="abc"></div>]
===selector===
[<div class="abc"></div>]


Matt LaPlante

unread,
May 29, 2019, 11:43:16 AM5/29/19
to beautifulsoup
Awesome reply, thank you. This was very helpful!
Reply all
Reply to author
Forward
0 new messages