Beautiful Soup 4 beta 5

49 views
Skip to first unread message

Leonard Richardson

unread,
Feb 9, 2012, 4:52:14 PM2/9/12
to beauti...@googlegroups.com
Another day, another beta. My focus this time was on fixing
accumulated bugs from the bug tracker. The changelog is below, but
there's one thing in particular I want opinions on.

Here's some markup:

<p class="body strikeout">

Up to this point, the value of the "class" attribute in that tag has
been the string "body strikeout". There was some limited support for
searching by CSS class, but it was a hack and not implemented
consistently.

In beta 5, the value of that class attribute is the list ["body",
"strikeout"]. The 'class' attribute, and a few others that are very
obscure, can have more than one value. This is documented here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#multivalue

A semi-related feature is that you can also search by CSS class in a
consistent way. Any kind of search against CSS class will be run
separately against all of a tag's CSS classes. Documentation:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

The only thing I'm not sure about is whether it's good to have
p['class'] be sometimes a list and sometimes a string:

In <p class="body strikeout">, p['class'] == ['body', 'strikeout']
In <p class="body">, p['class'] == 'body'

This may confuse users. I certainly don't want to make this attribute
*always* be a list. I could present p['class'] as "body strikeout"
when it was accessed, but treat it as ["body", "strikeout"] when
searching. Would that itself be confusing? If you have a strong
opinion, let me know.

Leonard

---

= 4.0.0b5 (20120209) =

* Rationalized Beautiful Soup's treatment of CSS class. A tag
belonging to multiple CSS classes is treated as having a list of
values for the 'class' attribute. Searching for a CSS class will
match *any* of the CSS classes.

This actually affects all attributes that the HTML standard defines
as taking multiple values (class, rel, rev, archive, accept-charset,
and headers), but 'class' is by far the most common. [bug=41034]

* If you pass anything other than a dictionary as the second argument
to one of the find* methods, it'll assume you want to use that
object to search against a tag's CSS classes. Previously this only
worked if you passed in a string.

* Fixed a bug that caused a crash when you passed a dictionary as an
attribute value (possibly because you mistyped "attrs"). [bug=842419]

* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
like <meta charset="utf-8" />. [bug=837268]

* If Unicode, Dammit can't figure out a consistent encoding for a
page, it will try each of its guesses again, with errors="replace"
instead of errors="strict". This may mean that some data gets
replaced with REPLACEMENT CHARACTER, but at least most of it will
get turned into Unicode. [bug=754903]

* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
on certain kinds of markup. [bug=838800]

* Fixed a bug that wrecked the tree if you replaced an element with an
empty string. [bug=728697]

* Improved Unicode, Dammit's behavior when you give it Unicode to
begin with.

Derek Litz

unread,
Feb 10, 2012, 12:31:41 AM2/10/12
to beautifulsoup
I think the value of the class attribute should be a list, always.

Mainly because this Python object most closely matches the
specification for what the values of a class attribute represent
in an html file. The only difference I can see in meaning is the
order doesn't matter, but it does for a list. I assume, of course,
that this special treatment of the class attribute only applies
when using an htmlparser (as opposed to xml).

http://dev.w3.org/html5/spec/Overview.html#classes

Yes, it says set, but the spec never says anything about the
values being unique in this case.

It would also be nice to match the meanings of the other special
attributes. For example unordered set of unique space-separated
tokens could be represented with a set and applies to the
"header" attribute.

What to do about an ordered set? Well... a list is again ideal
for this situation, but it might be worth using an ordered
set.

If you want to take it that far you could even implement an
unordered list! But, in this case I believe practicality
beats purity (but it's close.)

html5 set - Python list
html5 unordered unique set - Python set
html5 ordered unique set - Python list

(There are more special attribute values to be considered
as well.)

(Note: I would NOT be opposed to special objects like
an unordered list etc, that match the html meanings more
closely.)

This will provide some consistency with the specification and
is grounded in reality rather then being a somewhat arbitrary
design decision.

Perhaps some way to get the underlying text values as well
for each of these objects too :), hmm now I'm starting to
lean towards specialty attribute value objects. What do
you think?

On Feb 9, 3:52 pm, Leonard Richardson <leona...@segfault.org> wrote:
> Another day, another beta. My focus this time was on fixing
> accumulated bugs from the bug tracker. The changelog is below, but
> there's one thing in particular I want opinions on.
>
> Here's some markup:
>
> <p class="body strikeout">
>
> Up to this point, the value of the "class" attribute in that tag has
> been the string "body strikeout". There was some limited support for
> searching by CSS class, but it was a hack and not implemented
> consistently.
>
> In beta 5, the value of that class attribute is the list ["body",
> "strikeout"]. The 'class' attribute, and a few others that are very
> obscure, can have more than one value. This is documented here:
>
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#multivalue
>
> A semi-related feature is that you can also search by CSS class in a
> consistent way. Any kind of search against CSS class will be run
> separately against all of a tag's CSS classes. Documentation:
>
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-cs...

Tikitu

unread,
Feb 10, 2012, 5:19:58 AM2/10/12
to beautifulsoup
I agree that having a polymorphic type (sometimes string, sometimes
list) is likely to lead to confusion. Sloppy code will assume one or
the other, and the fact that strings are iterable can make debugging
this stuff harder ("Why do I have a list of one-char classes?").
Careful code will have to check, which is a pain.

How about a convention that if attribute x can be multivalue,
accessing [x] gives you the string and accessing [x + '_list'] always
gives a (possibly empty or singleton) list? That seems more pythonic
to me (highly explicit), at the expense of diverging from the html.

Searching is also a bit tricky: find_all(class='strikeout') should
find a class="body strikeout", but I would expect
findall(class=['body', 'strikeout']) *not* to find a class="strikeout"
element. That's against the usual OR semantics of lists in filters,
though, if I'm reading the documentation right.

(Disclaimer: I don't actually use Beautiful Soup, so take these
opinions with a large grain of salt.)

Cheers,
Tikitu

Derek Litz

unread,
Feb 10, 2012, 10:24:52 AM2/10/12
to beautifulsoup
Yeah, if the consensuses is that using objects for attribute
values may lead to confusion then the attribute values should
be strings always.

In this case I certainly would like, as Tikitu points out,
have some way to access a Pythonic representation of the
attribute the matches the meaning of the html spec.

hmm from the docs:

soup.find_all(["a", "b"])

If you pass in a list, Beautiful Soup will allow a string
match against any item in that list. This code finds all
the <a> tags and all the <b> tags.

So I disagree on my expectation, it should retain or
semantics.

soup.find_all('p', ['body', 'strikeout'])

If I'd want to do an and I could do something like:

soup.find_all('p', ['body'])
soup.find_all('p', ['strikeout'])

However, I'd also expect soup.find_all('p', 'body strikeout')
either to raise an exception because it's not a list or to
try an match an identical string. Currently it returns
the empty list though which I find highly unexpected, and could
lead to subtle errors in code.

I actually lean towards raising an exception, we should always
have to pass a list when searching for classes on a tag.

Tikitu

unread,
Feb 11, 2012, 6:56:12 PM2/11/12
to beautifulsoup
Another 2c worth:

On Feb 10, 4:24 pm, Derek Litz <litzoma...@gmail.com> wrote:
> [...] So I disagree on my expectation, it should retain or
> semantics.

Sorry, I was unclear: I meant that my expectation *without knowing
Beautiful Soup* would be that ['oneclass', 'anotherclass'] should do
an AND search (based on how you would read the html). As you say, the
OR semantics should be maintained.

(Also: only now, re-reading the docs and trying some examples in a
terminal, do I realise why findall(class='body') won't work. How
bloody annoying...)

But css class AND seems to me quite an important thing to make easy to
do. It would be awesome (although presumably quite a rewrite) if
soup.findall('p', 'body').findall('p', 'strikeout') would do the AND.
(I don't see a nice way to avoid the repeated 'p', but as I said I'm
just working from the docs rather than from actually using this
stuff.) Of course it's also possible to do:

l1 = soup.find_all('p', 'body')
l2 = soup.find_all('p', 'strikeout')
l3 = [ p in l1 if p in l2 ]

but it's not very satisfying.

> However, I'd also expect soup.find_all('p', 'body strikeout')
> either to raise an exception because it's not a list or to
> try an match an identical string.

I agree, leaning towards the exception; as it stands, making a call
with a string containing a space is definitely a user error (either a
typo or they intend something they won't get).

cheers,
Tikitu
Reply all
Reply to author
Forward
0 new messages