Hi everyone. I was trying to use UnicodeDammit's method detwingle to fix some of the dreadful encoding mixes in the Spanish pages I'm scraping from the web.
In some cases there are sequences like the following:
text = 'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xfablico' # summarized for brevity
which, as you may have guessed, are a mixture of UTF-8 and Windows-1252 byte sequences.
UnicodeDammit(text).unicode_markup # returns
u'C\u0102\u0104mara, Denominaci\u0102\u0142n, Regi\u0102\u0142n, Pa\u0102\xads, A\u0144os, T\xedtulo, P\xfablico'
whereas
UnicodeDammit.detwingle(text) # returns
'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xf1os, T\xedtulo, P\xc3\xbablico'
none of which is the adequate representation of the string, which should be
'C\xc3\xa1mara, Denominaci\xc3\xb3n, Regi\xc3\xb3n, Pa\xc3\xads, A\xc3\xb1os, T\xc3\xadtulo, P\xc3\xbablico'
Probably the conversion fails as the string contains UTF-8 multibyte sequence markers that are really ISO-8858/Windows-1252 bytes.
I'm asking here if I'm missing something when doing the conversion, and if anyone has faced this behavior in the past.
In the meantime, I solved this by tweaking the method to double test for valid sequence bytes that fall between the MB markers.
I can send a patch if someone is interested.
Thanks,
- maurom
PS: Also, in dammit.py:737
0xe1 : b'\xa1', # á
shouldn't be...
0xe1 : b'\xc3\xa1', # á
?