2) Handling of hex numeric entities
Doesn't handle hex numric entities of the for ካ
Patch to handle_charref
if ref.lower().startswith('x'):
ref = int(ref[1:], 16)
Your example works for me:
In [1]: from BeautifulSoup import BeautifulSoup as bs
In [4]: print bs('<b class="e">Foe!</b>', convertEntities="html")
<b class="e">Foe!</b>
It fails if the entity is not ascii:
In [5]: print bs('<b class="ß">Foß!</b>', convertEntities="html")
------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xdf in position 0: ordinal not in range(128)
Here is a fix:
In [9]: class MySoup(bs):
...: def convert_codepoint(self, codepoint):
...: return unichr(codepoint)
...:
In [10]: print MySoup('<b class="ß">Foß!</b>',
convertEntities="html")
<b class="ß">Foß!</b>
>
> 4) Handling of numeric entities differs from browsers:
> Test it with print(unicode(soup)) - browsers interpret the € as a
> cp1252 char 128 (EURO), but soup treats it like unicode char 128.
FWIW both HTML 4 and XML specify that numeric entities represent Unicode
(ISO 10646) code points:
http://www.w3.org/TR/html4/charset.html#h-5.3.1
http://www.w3.org/TR/2006/REC-xml-20060816/#dt-charref