BS3 can't decode utf-16le strings

41 views
Skip to first unread message

csantos

unread,
Apr 22, 2012, 7:24:37 PM4/22/12
to beautifulsoup
Hello there,
it seems I have a serious problem when dealing with UTF-16le strings.
To reproduce: create a utf8 file with the content "<a>áé</a>" (without
quotes). Convert it to UTF-16LE (on linux: iconv input.txt -t unicode
> output.txt). Open output.txt with a python script and try to parse
it with beautifulsoup:
print BeautifulSoup(unparsed, fromEncoding='utf-16le')
When I execute this command, python prints out the following string: 愼쌾
쎡㲩愯ਾ . However, if I change the fromEncoding parameter to 'utf8', the
string is printed as expected. To make sure the problem was in
BeautifulSoup, I tried to convert the string from utf-16le to utf8
with:
print unparsed.decode('utf-16le').encode('utf8')
which produced the expected output (áé)

Additional info: I have the chardet package installed, as suggested by
the BS3 documentation. Also, according to `locate BeautifulSoup`, I
have only the BS 3.2.0 version installed on my machine.

Is that a bug? Or am I missing something?
Thanks,
.csantos

Leonard Richardson

unread,
Apr 26, 2012, 12:40:11 PM4/26/12
to beauti...@googlegroups.com
This is a bug in UnicodeDammit. The _detectEncoding method transforms
the UTF-16LE into UTF-8, and then _convert_from tries to convert the
UTF-8 into Unicode as though it were UTF-16LE.

https://bugs.launchpad.net/beautifulsoup/+bug/988980

I don't plan to fix the bug in the 3.2 series, but the code below
would make a good starting point for fixing it in 3.2--I think the
code is mostly the same. I've committed a fix which will be released
in the next 4.0 release.

Leonard

=== modified file 'bs4/dammit.py'
--- bs4/dammit.py 2012-04-16 14:35:13 +0000
+++ bs4/dammit.py 2012-04-26 16:03:59 +0000
@@ -187,16 +187,24 @@
self.original_encoding = None
return

- self.markup, document_encoding, sniffed_encoding = \
+ new_markup, document_encoding, sniffed_encoding = \
self._detectEncoding(markup, is_html)
+ self.markup = new_markup

u = None
- for proposed_encoding in (
- override_encodings + [document_encoding, sniffed_encoding]):
- if proposed_encoding is not None:
- u = self._convert_from(proposed_encoding)
- if u:
- break
+ if new_markup != markup:
+ # _detectEncoding modified the markup, then converted it to
+ # Unicode and then to UTF-8. So convert it from UTF-8.
+ u = self._convert_from("utf8")
+ self.original_encoding = sniffed_encoding
+
+ if not u:
+ for proposed_encoding in (
+ override_encodings + [document_encoding, sniffed_encoding]):
+ if proposed_encoding is not None:
+ u = self._convert_from(proposed_encoding)
+ if u:
+ break
> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
Reply all
Reply to author
Forward
0 new messages