Beautiful Soup 3.2.1, and Beautiful Soup 4 beta 6

140 views
Skip to first unread message

Leonard Richardson

unread,
Feb 16, 2012, 8:49:49 AM2/16/12
to beauti...@googlegroups.com
Not one but two releases today. First, the first real 3.x release in
almost two years.

http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz

This fixes a bug that can allow cross-site scripting attacks if
Beautiful Soup is used to sanitize HTML:

https://bugs.launchpad.net/beautifulsoup/+bug/868921

On output, angle brackets and bare ampersands are now escaped to XML
entities in strings. Previously they were only escaped in attribute
values. Beautiful Soup 4 escapes XML entities by default, so the
problem does not exist there unless you deliberately cause it (e.g. by
setting formatter=None).

-----

Now, on to the BS4 beta.

http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.0.0b6.tar.gz

It's almost done at this point. All the reported bugs are fixed except
the lack of namespace support. I'd like to add that before the
release, but I don't know how much work it'll be.

Changelog:

* Multi-valued attributes like "class" always have a list of values,
even if there's only one value in the list.

* Added a number of multi-valued attributes defined in HTML5.

* Stopped generating a space before the slash that closes an
empty-element tag. This may come back if I add a special XHTML mode
(http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
useless.

* Passing text along with tag-specific arguments to a find* method:

find("a", text="Click here")

will find tags that contain the given text as their
.string. Previously, the tag-specific arguments were ignored and
only strings were searched.

* Fixed a bug that caused the html5lib tree builder to build a
partially disconnected tree. Generally cleaned up the html5lib tree
builder.

* If you restrict a multi-valued attribute like "class" to a string
that contains spaces, Beautiful Soup will only consider it a match
if the values correspond to that specific string.

That last one is implemented as a big hack, but I can remove the hack
later without changing the API.

Leonard

leonardr

unread,
Feb 16, 2012, 9:15:44 AM2/16/12
to beautifulsoup
BTW, this would be a great time to try and port your BS3 scripts to
BS4, and let me know how difficult it was and what you had to change.

Leonard

Bruce Eckel

unread,
Feb 16, 2012, 3:57:14 PM2/16/12
to beauti...@googlegroups.com
Beta 6 installed without problem using pip.

One issue I came across when running my new app with beta 6 is the lists returned by find_all, based on the new search behavior. Here's my code:

    for tag in soup.body.find_all(True, klass):
        if type(klass) == list:
            klass = klass[0]
        tag['class'].remove(klass)

So here, klass can be either a string (I think with only a single class id in it, right?) or a list of strings (each string with an individual class id).

find_all will find all tags with any of those class ids in them. remove(), however, requires that klass be a single item, not a list. In my case klass could only be a list of one, so I pulled off the only element.

This is not a bug report, just an observation.


Leonard

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.


Reply all
Reply to author
Forward
0 new messages