The problem with regex

35 views
Skip to first unread message

SK

unread,
Jun 19, 2016, 5:15:24 PM6/19/16
to beautifulsoup
import re
from bs4 import BeautifulSoup

html_doc = """
<p class="title">The Dormouse's story</p>
<p class="story two">two</p>
<p class="story">one</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find('p', {'class':re.compile('^story$')}).text)


I want to get
"one" but get "two".
Please tell me where the error.
stackoverflow.com/questions/37897087/beautifulsoup-regex-not-working

Jim Tittsler

unread,
Jun 19, 2016, 11:33:07 PM6/19/16
to beautifulsoup
On Mon, Jun 20, 2016 at 6:14 AM, SK <semyon....@gmail.com> wrote:
html_doc = """
<p class="title">The Dormouse's story</p>
<p class="story two">two</p>
<p class="story">one</p>
"""

[...]

print(soup.find('p', {'class':re.compile('^story$')}).text)

The HTML class attribute is a list  of space-separated values (not simply a string that might contain spaces).

Here is an explicit way of doing it:

def p_only_story(tag):
    classes = tag['class'] if 'class' in tag.attrs else []
    return tag.name == 'p' and len(classes) == 1 and classes[0] == 'story'

print(soup.find(p_only_story))


Reply all
Reply to author
Forward
0 new messages