The problem with regex

35 views

Skip to first unread message

SK

unread,

Jun 19, 2016, 5:15:24 PM6/19/16

to beautifulsoup

import re
from bs4 import BeautifulSoup

html_doc = """
<p class="title">The Dormouse's story</p>
<p class="story two">two</p>
<p class="story">one</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find('p', {'class':re.compile('^story$')}).text)


I want to get "one" but get "two". Please tell me where the error.
stackoverflow.com/questions/37897087/beautifulsoup-regex-not-working

Jim Tittsler

unread,

Jun 19, 2016, 11:33:07 PM6/19/16

to beautifulsoup

On Mon, Jun 20, 2016 at 6:14 AM, SK <semyon....@gmail.com> wrote:

html_doc = """
<p class="title">The Dormouse's story</p>
<p class="story two">two</p>
<p class="story">one</p>
"""


[...]


print(soup.find('p', {'class':re.compile('^story$')}).text)

The HTML class attribute is a list of space-separated values (not simply a string that might contain spaces).

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes

Here is an explicit way of doing it:

def p_only_story(tag):

classes = tag['class'] if 'class' in tag.attrs else []

return tag.name == 'p' and len(classes) == 1 and classes[0] == 'story'

print(soup.find(p_only_story))

Reply all

Reply to author

Forward

0 new messages