bugs

Kovid

unread,

Oct 9, 2007, 2:30:08 PM10/9/07

to beautifulsoup

1) Handling of  
  by itself is converted to a regular space
unicode(BeautifulSoup(' ', convertEntities='html'))
u' '
unicode(BeautifulSoup(' ', convertEntities='html'))
u' '
However when there is other text in the tag it works correctly
unicode(BeautifulSoup('a ', convertEntities='html'))
u'a\xa0'

2) Handling of hex numeric entities
Doesn't handle hex numric entities of the for ካ
Patch to handle_charref
if ref.lower().startswith('x'):
ref = int(ref[1:], 16)

tr

unread,

Dec 2, 2007, 11:23:16 AM12/2/07

to beautifulsoup

3) Entities in attributes aren't converted.

print BeautifulSoup('Foe!',
convertEntities="html")
Foe!
Should be:
Foe!

4) Handling of numeric entities differs from browsers:

Seen while parsing <http://www.alexanderblum.de/cgi-bin/weblog.php.cgi?
weblog=1> with
soup = BeautifulSoup(
urlopen(URL),
parseOnlyThese=SoupStrainer("div", align="center"),
convertEntities="html", # I tried xml, too
fromEncoding="windows-1252")

Test it with print(unicode(soup)) - browsers interpret the as a
cp1252 char 128 (EURO), but soup treats it like unicode char 128. From
what I read such use of entities is incorrect, but "I don't care if
it's valid html", I just want it to work :-).

Kent Johnson

unread,

Dec 2, 2007, 11:53:38 AM12/2/07

to beauti...@googlegroups.com

tr wrote:
> 3) Entities in attributes aren't converted.
>
> print BeautifulSoup('Foe!',
> convertEntities="html")
> Foe!
> Should be:
> Foe!

Your example works for me:
In [1]: from BeautifulSoup import BeautifulSoup as bs
In [4]: print bs('Foe!', convertEntities="html")
Foe!

It fails if the entity is not ascii:

In [5]: print bs('Foß!', convertEntities="html")
------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xdf in position 0: ordinal not in range(128)

Here is a fix:

In [9]: class MySoup(bs):
...: def convert_codepoint(self, codepoint):
...: return unichr(codepoint)
...:
In [10]: print MySoup('Foß!',
convertEntities="html")
Foß!

>
> 4) Handling of numeric entities differs from browsers:

> Test it with print(unicode(soup)) - browsers interpret the  as a

> cp1252 char 128 (EURO), but soup treats it like unicode char 128.

FWIW both HTML 4 and XML specify that numeric entities represent Unicode
(ISO 10646) code points:
http://www.w3.org/TR/html4/charset.html#h-5.3.1
http://www.w3.org/TR/2006/REC-xml-20060816/#dt-charref

tr

unread,

Dec 3, 2007, 8:39:32 AM12/3/07

to beautifulsoup

Hi,

On Dec 2, 5:53 pm, Kent Johnson <ken...@tds.net> wrote:
> tr wrote:
> > 3) Entities in attributes aren't converted.
>
> > print BeautifulSoup('Foe!',
> > convertEntities="html")
> > Foe!
> > Should be:
> > Foe!
>
> Your example works for me:
> In [1]: from BeautifulSoup import BeautifulSoup as bs
> In [4]: print bs('Foe!', convertEntities="html")
> Foe!

I got python 2.4.4 - maybe it's some difference in the SGML-parser?

> > 4) Handling of numeric entities differs from browsers:
> > Test it with print(unicode(soup)) - browsers interpret the  as a
> > cp1252 char 128 (EURO), but soup treats it like unicode char 128.
>
> FWIW both HTML 4 and XML specify that numeric entities represent Unicode
> (ISO 10646) code points:http://www.w3.org/TR/html4/charset.html#h-5.3.1http://www.w3.org/TR/2006/REC-xml-20060816/#dt-charref

Yes, and this is not the only problem this particular site has. Would
it be possible to switch sites that use code points from the "C1
Controls"-range to cp1252 code points? A request from the "Just parse
it, Dammit!"-department :-).

Regards,
Thomas

Reply all

Reply to author

Forward