Matching an empty 'selected' attr with built-in parser

166 views
Skip to first unread message

Sam Lai

unread,
Sep 7, 2013, 7:02:53 AM9/7/13
to beauti...@googlegroups.com
I think I've found a deliberate quirk with BeautifulSoup but I'm not sure why it exists.

I'm trying to match an empty attribute, specifically the 'selected' attribute in the option tag. Below is an example chunk of HTML - I'm trying to match the option with value 'B'.

<select class="period" name="period">
<option value=" ">21 May 2013 until 01 September 2013</option>
<option value="A">02 September 2013 until 03 September 2013</option>
<option value="B" selected>04 September 2013 until 07 September 2013</option>
<option value="C">08 September 2013 until 12 September 2013</option>
<option value="D">13 September 2013 onwards</option>
</select>

According to the docs, the following command should match it.

soup.find_all('option',attrs={'selected':True})

... however, it matches nothing; an empty list is returned.

I'm using Python 3.3 with beautifulsoup4 4.3.1. I do not have any other XML/HTML parsers installed, i.e. neither lxml or html5lib are installed, therefore I believe the built-in parser is used.

I've tracked this problem down to this line in the code - http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1575. Is there any reason why there is a 'is not None' condition on that line? The built-in parser parses the 'selected' attribute with a value of None (because no value has been specified), hence the matcher fails incorrectly.

I believe this doesn't happen with lxml or html5lib because those parsers implicitly converts the missing attribute value into an empty string.

The fix is simple (remove the condition on that line), but I'm not sure what the consequences are.

Sam


Leonard Richardson

unread,
Sep 8, 2013, 9:41:41 AM9/8/13
to beautifulsoup
Sam,

This is a quirk, and the documentation is inaccurate. There are four cases to consider:

1. The attribute is present and has a value
2. The attribute is present and has no value (i.e. its value is None)
3 The attribute is present and its value is the empty string
4. The attribute is not present

Searching for attr=True will find 1 and 3. Searching for attr=None will find 2. Searching for attr=False (which is not supported) will also find 2. To find 4 you will need to pass in a custom function that checks has_attr.

So attr=True doesn't mean "attr exists" (per the doc), it means "attr has a value".

The other problem is that the html.parser treebuilder interprets "<foo selected>" as a <foo> tag with the 'selected' attribute set to None (#2). This is different from lxml and html5lib, which interpret the same markup as a <foo> tag with the 'selected' attribute set to the empty string (#3).

It's difficult to fix this without breaking existing code, but fixing the tree builders to be consistent (such that <foo selected> always becomes <foo selected="">) would be a good first step. That would break some code but it would also fix some code--code that currently breaks randomly depending on which builder is in use.

At that point, case #2 would never happen on parsing. It would only happen if the user explicitly set an attribute value to None.

And at that point, I could change  the match algorithm that so that attr=True means "attr exists" (1,2,3), attr=False means "attr does not exist" (4), and attr=None still means "attr has the literal value None" (2), but attr=None would almost never work.

I'm not promising any of this, but I think I could make this consistent in a way that makes sense.

Leonard


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at http://groups.google.com/group/beautifulsoup.
For more options, visit https://groups.google.com/groups/opt_out.

Sam Lai

unread,
Sep 10, 2013, 1:22:41 AM9/10/13
to beauti...@googlegroups.com, leon...@segfault.org
Leonard,

Thanks for the quick response. That sounds like a plan, and after having a quick look at the code, it seems like the change should be relatively simple. In builder/_htmlparser.py, in the handle_starttag method on BeautifulSoupHTMLParser, change the contents to the following -

attrs = dict(attrs)
for k,v in attrs.items():
    if v is None:
        attrs[k] = ''
self.soup.handle_starttag(name, None, None, attrs)

Then it's just a matter of removing the condition on http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1575 and adding a condition into the search_tag method on SoupStrainer to handle the attr=False scenario. 

I agree with most of the breaking changes as the goal of find/find_all is to make a positive match on nodes, therefore this change should have minimal impact and will ensure the correct nodes are found with all parsers. I'm not so sure about the attrs=False change however, because even though it is unsupported, with this change, any one relying on that behaviour will now get an almost opposite result. It is a useful feature to have though.

Sam
Reply all
Reply to author
Forward
0 new messages