Unicode conversion

Edward K. Ream

unread,

Oct 3, 2002, 9:10:44 AM10/3/02

to

My app presently will write Unicode in any format the user desires as long
as it is UTF-8 ;-)

Here is the code that I use to translate from the UTF-8 delivered by the Tk
Text widget to the desired encoding:

print `xml_encoding`
# Tk always uses utf-8 encoding.
print `s`,"tk"
s = s.encode("utf-8") # result is a string.
print `s`,"utf-8"
s = s.decode(xml_encoding) # result is unicode.
s = s.encode(xml_encoding) # result is a string.
print `s`,`xml_encoding`

If I start with:

aAكةd

a
U+0102(Latin Capital Letter A with Breve)
U+00df(Latin Small Letter Sharp S)
U+00c9(Latin Capital Letter E with Acute)
d

and delete the trailing d the output is:

u'a\u0102\xdf\xc9\n' tk
'a\xc4\x82\xc3\x9f\xc3\x89\n' utf-8
'a\xc4\x82\xc3\x9f\xc3\x89\n' 'ISO-8859-1'

As you can see, the result of the two "encodes" are identical. My app writes
the result of the second encode to the file. Viewing a file (say with MS
Word) with these characters works properly only if UTF-8 is used. Weird
characters appear when the desired ISO-8859-1 encoding is used.

BTW, with out the first encode/decode pair I can take exceptions in the last
encode.

Can anyone explain what is happening and what I should be doing? I'm totally
confused. Thanks.

Edward
--------------------------------------------------------------------
Edward K. Ream email: edr...@tds.net
Leo: Literate Editor with Outlines
Leo: http://personalpages.tds.net/~edream/front.html
--------------------------------------------------------------------

Martin v. Löwis

unread,

Oct 3, 2002, 1:17:08 PM10/3/02

to

"Edward K. Ream" <edr...@tds.net> writes:

> # Tk always uses utf-8 encoding.

You may get that impression, but it is slightly wrong. It is more
reliable if you pass Unicode strings to Tk, instead of UTF-8 encoded
byte strings.

For a byte string, Tk will guess the encoding. If it looks like UTF-8,
it is taken treated UTF-8. Otherwise, it is treated as the locale's
encoding. Unfortunately, if you ever manage to mix the two, you get
byte salad that you can't ever chew. By using Unicode strings to
interface with Tk only, you can avoid those problems.

> print `s`,"tk"
> s = s.encode("utf-8") # result is a string.
> print `s`,"utf-8"
> s = s.decode(xml_encoding) # result is unicode.
> s = s.encode(xml_encoding) # result is a string.

Since xml_encoding is iso-8859-1, you are making a mistake here. You
have UTF-8 data, but you are decoding them as Latin-1. This will
succeed, but it will give an incorrect result. It will succeed since
iso-8859-1 is an single-byte code where every byte value is valid.
That means an arbitrary byte sequence can be interpreted as Latin-1,
but for many byte sequences, the resulting string is non-sense
(mojibake, as the Japanese say).

> BTW, with out the first encode/decode pair I can take exceptions in
> the last encode.

Nevertheless, this is the correct processing. If you have a Unicode
object, as originally obtained from Tk, you should encode as Latin-1
using

s = s.encode("iso-8859-1")

Now, for this specific string (u'a\u0102\xdf\xc9\n'), you will get a
Unicode error. The reason is that one character (\u0102) is not
supported in Latin-1 - this encoding supports only the first 256
Unicode characters.

So, when saving this as XML, the proper representation would be

'aĂ\xdf\xc9\n'

i.e. you'll have to use a character reference. Python 2.2 does not
support generating such text very well - you'll have to catch the
Unicode error yourself, find the offending character, encode it as a
character reference, and encode all other characters as requested.

Alternatively, you can refuse encoding a document in a certain
encoding (such as Latin-1), and fall back to UTF-8.

PEP 293 (http://www.python.org/peps/pep-0293.html) will provide a
mechanism to generate character references more conveniently - in
Python 2.3, you can specify

s.encode('iso-8859-1',errors='xmlcharrefreplace')

HTH,
Martin

Edward K. Ream

unread,

Oct 3, 2002, 3:24:41 PM10/3/02

to

Many, many thanks, Martin, for all this good information. If ever you come
to Madison Wisconsin you may redeem this coupon for a dinner at L'etoile
restaurant, one of the best in the United States :-)

Edward