Beautiful Soup 3.1.0 released

646 views
Skip to first unread message

Leonard Richardson

unread,
Dec 27, 2008, 5:01:33 PM12/27/08
to beauti...@googlegroups.com
The only thing special about this version is that it can be made to
run under Python 3. The code is ripped up but it'll improve over the
course of a few minor versions.

http://www.crummy.com/software/BeautifulSoup/#Download

Leonard

CHANGELOG:

A hybrid version that supports 2.4 and can be automatically converted
to run under Python 3.0. There are three backwards-incompatible
changes you should be aware of, but no new features or deliberate
behavior changes.

1. str() may no longer do what you want. This is because the meaning
of str() inverts between Python 2 and 3; in Python 2 it gives you a
byte string, in Python 3 it gives you a Unicode string.

The effect of this is that you can't pass an encoding to .__str__
anymore. Use encode() to get a string and decode() to get Unicode, and
you'll be ready (well, readier) for Python 3.

2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
which is gone in Python 3. There's some bad HTML that SGMLParser
handled but HTMLParser doesn't, usually to do with attribute values
that aren't closed or have brackets inside them:

<a href="foo</a>, </a><a href="bar">baz</a>
<a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>

A later version of Beautiful Soup will allow you to plug in different
parsers to make tradeoffs between speed and the ability to handle bad
HTML.

3. In Python 3 (but not Python 2),HTMLParser converts entities within
attributes to the corresponding Unicode characters. In Python 2 it's
possible to parse this string and leave the &eacute; intact.

<a href="http://crummy.com?sacr&eacute;&bleu">

In Python 3, the &eacute; is always converted to \xe9 during
parsing.

Reply all
Reply to author
Forward
0 new messages