Unable to fix encoding with UnicodeDammit detwingle

59 views
Skip to first unread message

Mauro Meloni

unread,
Mar 8, 2016, 8:07:27 PM3/8/16
to beautifulsoup
Hi everyone. I was trying to use UnicodeDammit's method detwingle to fix some of the dreadful encoding mixes in the Spanish pages I'm scraping from the web. 

In some cases there are sequences like the following:

text = 'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xfablico' # summarized for brevity 

which, as you may have guessed, are a mixture of UTF-8 and Windows-1252 byte sequences.

UnicodeDammit(text).unicode_markup   # returns
u'C\u0102\u0104mara, Denominaci\u0102\u0142n, Regi\u0102\u0142n, Pa\u0102\xads, A\u0144os, T\xedtulo, P\xfablico'

whereas

UnicodeDammit.detwingle(text)   # returns
'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xc3\xbablico'

none of which is the adequate representation of the string, which should be

'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xc3\xb1os, T\xc3\xadtulo, P\xc3\xbablico'

Probably the conversion fails as the string contains UTF-8 multibyte sequence markers that are really ISO-8858/Windows-1252 bytes.

I'm asking here if I'm missing something when doing the conversion, and if anyone has faced this behavior in the past.

In the meantime, I solved this by tweaking the method to double test for valid sequence bytes that fall between the MB markers.
I can send a patch if someone is interested.

Thanks,

- maurom


PS: Also, in dammit.py:737

        0xe1 : b'\xa1',     # á

shouldn't be...

        0xe1 : b'\xc3\xa1',     # á

?

Reply all
Reply to author
Forward
0 new messages