Hi Joe,
I think there is a bug in the unicode conversion which only occurs when
convertEntities is specified. You can see this quickly by simply
removing this parameter.
>>> soup = BeautifulSoup(page)
>>> len(soup.prettify())
82127
The page we are talking about is not UTF-8 as it declares.
>>> page.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
7586: unexpected code byte
That character is a "smart quote" (characters in the range \x80-\x9f):
>>> page[7580:7600]
' so it\x92s completely '
BeautifulSoup will try a number of encodings before giving up. In this
case it tries UTF-8, ascii and windows-1252. Normally, when it tries
windows-1252 or similar encodings it will convert these smart quotes
into a suitable entity reference, e.g. &%92; before decoding. However,
whenever convertEntities is specified smart quote conversion is never done:
if self.convertEntities:
# It doesn't make sense to convert encoded characters to
# entities even while you're converting entities to Unicode.
# Just convert it all to Unicode.
self.smartQuotesTo = None
# SNIP
Commenting out line 1063 (# self.smartQuotesTo = None) seems to resolve
the issue for me. BUT I don't fully understand the ramifications of this
change.
Hope this helps.
- Zulq