This is a bug in UnicodeDammit. The _detectEncoding method transforms
the UTF-16LE into UTF-8, and then _convert_from tries to convert the
UTF-8 into Unicode as though it were UTF-16LE.
https://bugs.launchpad.net/beautifulsoup/+bug/988980
I don't plan to fix the bug in the 3.2 series, but the code below
would make a good starting point for fixing it in 3.2--I think the
code is mostly the same. I've committed a fix which will be released
in the next 4.0 release.
Leonard
=== modified file 'bs4/dammit.py'
--- bs4/dammit.py 2012-04-16 14:35:13 +0000
+++ bs4/dammit.py 2012-04-26 16:03:59 +0000
@@ -187,16 +187,24 @@
self.original_encoding = None
return
- self.markup, document_encoding, sniffed_encoding = \
+ new_markup, document_encoding, sniffed_encoding = \
self._detectEncoding(markup, is_html)
+ self.markup = new_markup
u = None
- for proposed_encoding in (
- override_encodings + [document_encoding, sniffed_encoding]):
- if proposed_encoding is not None:
- u = self._convert_from(proposed_encoding)
- if u:
- break
+ if new_markup != markup:
+ # _detectEncoding modified the markup, then converted it to
+ # Unicode and then to UTF-8. So convert it from UTF-8.
+ u = self._convert_from("utf8")
+ self.original_encoding = sniffed_encoding
+
+ if not u:
+ for proposed_encoding in (
+ override_encodings + [document_encoding, sniffed_encoding]):
+ if proposed_encoding is not None:
+ u = self._convert_from(proposed_encoding)
+ if u:
+ break
> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to
beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
beautifulsou...@googlegroups.com.
> For more options, visit this group at
http://groups.google.com/group/beautifulsoup?hl=en.
>