Unable to fix encoding with UnicodeDammit detwingle

59 views

Skip to first unread message

Mauro Meloni

unread,

Mar 8, 2016, 8:07:27 PM3/8/16

to beautifulsoup

Hi everyone. I was trying to use UnicodeDammit's method detwingle to fix some of the dreadful encoding mixes in the Spanish pages I'm scraping from the web.

In some cases there are sequences like the following:

text = 'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xfablico' # summarized for brevity

which, as you may have guessed, are a mixture of UTF-8 and Windows-1252 byte sequences.

UnicodeDammit(text).unicode_markup # returns

u'C\u0102\u0104mara, Denominaci\u0102\u0142n, Regi\u0102\u0142n, Pa\u0102\xads, A\u0144os, T\xedtulo, P\xfablico'

whereas

UnicodeDammit.detwingle(text) # returns

'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xc3\xbablico'

none of which is the adequate representation of the string, which should be

'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xc3\xb1os, T\xc3\xadtulo, P\xc3\xbablico'

Probably the conversion fails as the string contains UTF-8 multibyte sequence markers that are really ISO-8858/Windows-1252 bytes.

I'm asking here if I'm missing something when doing the conversion, and if anyone has faced this behavior in the past.

In the meantime, I solved this by tweaking the method to double test for valid sequence bytes that fall between the MB markers.

I can send a patch if someone is interested.

Thanks,

- maurom

PS: Also, in dammit.py:737

0xe1 : b'\xa1', # á

shouldn't be...

0xe1 : b'\xc3\xa1', # á

Reply all

Reply to author

Forward

0 new messages