bugs

87 views
Skip to first unread message

Kovid

unread,
Oct 9, 2007, 2:30:08 PM10/9/07
to beautifulsoup
1) Handling of  
  by itself is converted to a regular space
unicode(BeautifulSoup(' ', convertEntities='html'))
u' '
unicode(BeautifulSoup('<p>&nbsp;</p>', convertEntities='html'))
u'<p> </p>'
However when there is other text in the tag it works correctly
unicode(BeautifulSoup('<p>a&nbsp;</p>', convertEntities='html'))
u'<p>a\xa0</p>'

2) Handling of hex numeric entities
Doesn't handle hex numric entities of the for &#x12ab;
Patch to handle_charref
if ref.lower().startswith('x'):
ref = int(ref[1:], 16)

tr

unread,
Dec 2, 2007, 11:23:16 AM12/2/07
to beautifulsoup
3) Entities in attributes aren't converted.

print BeautifulSoup('<b class="&#101;">Fo&#101;!</b>',
convertEntities="html")
<b class="&#101;">Foe!</b>
Should be:
<b class="e">Foe!</b>

4) Handling of numeric entities differs from browsers:

Seen while parsing <http://www.alexanderblum.de/cgi-bin/weblog.php.cgi?
weblog=1> with
soup = BeautifulSoup(
urlopen(URL),
parseOnlyThese=SoupStrainer("div", align="center"),
convertEntities="html", # I tried xml, too
fromEncoding="windows-1252")

Test it with print(unicode(soup)) - browsers interpret the &#128; as a
cp1252 char 128 (EURO), but soup treats it like unicode char 128. From
what I read such use of entities is incorrect, but "I don't care if
it's valid html", I just want it to work :-).

Kent Johnson

unread,
Dec 2, 2007, 11:53:38 AM12/2/07
to beauti...@googlegroups.com
tr wrote:
> 3) Entities in attributes aren't converted.
>
> print BeautifulSoup('<b class="&#101;">Fo&#101;!</b>',
> convertEntities="html")
> <b class="&#101;">Foe!</b>
> Should be:
> <b class="e">Foe!</b>

Your example works for me:
In [1]: from BeautifulSoup import BeautifulSoup as bs
In [4]: print bs('<b class="&#101;">Fo&#101;!</b>', convertEntities="html")
<b class="e">Foe!</b>

It fails if the entity is not ascii:

In [5]: print bs('<b class="&#223;">Fo&#223;!</b>', convertEntities="html")
------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xdf in position 0: ordinal not in range(128)

Here is a fix:

In [9]: class MySoup(bs):
...: def convert_codepoint(self, codepoint):
...: return unichr(codepoint)
...:
In [10]: print MySoup('<b class="&#223;">Fo&#223;!</b>',
convertEntities="html")
<b class="ß">Foß!</b>

>
> 4) Handling of numeric entities differs from browsers:

> Test it with print(unicode(soup)) - browsers interpret the &#128; as a


> cp1252 char 128 (EURO), but soup treats it like unicode char 128.

FWIW both HTML 4 and XML specify that numeric entities represent Unicode
(ISO 10646) code points:
http://www.w3.org/TR/html4/charset.html#h-5.3.1
http://www.w3.org/TR/2006/REC-xml-20060816/#dt-charref

tr

unread,
Dec 3, 2007, 8:39:32 AM12/3/07
to beautifulsoup
Hi,

On Dec 2, 5:53 pm, Kent Johnson <ken...@tds.net> wrote:
> tr wrote:
> > 3) Entities in attributes aren't converted.
>
> > print BeautifulSoup('<b class="&#101;">Fo&#101;!</b>',
> > convertEntities="html")
> > <b class="&#101;">Foe!</b>
> > Should be:
> > <b class="e">Foe!</b>
>
> Your example works for me:
> In [1]: from BeautifulSoup import BeautifulSoup as bs
> In [4]: print bs('<b class="&#101;">Fo&#101;!</b>', convertEntities="html")
> <b class="e">Foe!</b>

I got python 2.4.4 - maybe it's some difference in the SGML-parser?

> > 4) Handling of numeric entities differs from browsers:
> > Test it with print(unicode(soup)) - browsers interpret the &#128; as a
> > cp1252 char 128 (EURO), but soup treats it like unicode char 128.
>
> FWIW both HTML 4 and XML specify that numeric entities represent Unicode
> (ISO 10646) code points:http://www.w3.org/TR/html4/charset.html#h-5.3.1http://www.w3.org/TR/2006/REC-xml-20060816/#dt-charref

Yes, and this is not the only problem this particular site has. Would
it be possible to switch sites that use code points from the "C1
Controls"-range to cp1252 code points? A request from the "Just parse
it, Dammit!"-department :-).

Regards,
Thomas
Reply all
Reply to author
Forward
0 new messages