Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Passing Unicode strings to Tk widgets on Windows XP

67 views
Skip to first unread message

andrey.nakin

unread,
Nov 8, 2011, 5:02:59 AM11/8/11
to
Hi!

I'm developing cross-platform (Linux & Windows) Tcl/Tk application.
All sources are in utf-8 encoding. Here is a short example of code:

package require Tk
encoding system utf-8
label .l -text "some text with international characters "
pack .l -expand no

Label .l contains some non-ASCII text which is natively written in my
text editor and stored as utf-8 in source file.

When I run this example in Debian Linux 6, label displays correct
text.

When I run this example on Windows XP with default fonts, label text
is totally incorrect. It seems like Windows treats text in utf-8 as
some 8-bit character set, e.g. cp1252.

Is it possible to pass "native" utf-8 texts to Tk widgets?

Thanks

JarekL

unread,
Nov 8, 2011, 5:36:05 AM11/8/11
to
On 8 Lis, 11:02, "andrey.nakin" <andrey.na...@gmail.com> wrote:
> Hi!
>
> I'm developing cross-platform (Linux & Windows) Tcl/Tk application.
> All sources are in utf-8 encoding. Here is a short example of code:
>
> package require Tk
> encoding system utf-8
> label .l -text "some text with international characters "

I have utf-encoder:

#!/bin/sh
# the next line restarts using tclsh \
exec tclsh "$0" "$@"

set s $argv
set res ""
foreach i [split $s ""] {
scan $i %c c
if {$c<128} {append res $i} else {append res \\u[format %04.4X
$c]}
}
puts -nonewline $res

This script encodes ex. "łódź" to "\u0142\u00F3d\u017A". This is cross-
platform

<jl>

Mark Janssen

unread,
Nov 8, 2011, 6:13:23 AM11/8/11
to
Do *not* use [encoding system] it does not work (as you noticed). Tk
and Tcl are very capable with Unicode on all platforms with fairly
little effort. In order to make Unicode source files work you have
several options:

1) Start wish with the -encoding command line option
2) Use the -encoding utf-8 option with source
3) Escape the Unicode as explained elsewhere in the thread and safe
the file as plain ASCII

In all cases the [encoding system utf-8] call should be removed. And
promise to never use encoding system again, it's never the right
answer.

If you are wondering why it did work on Linux, my guess is that the
system encoding on Linux is already utf-8 so the encoding system call
dosn't change it.

Mark

andrey.nakin

unread,
Nov 8, 2011, 6:40:51 AM11/8/11
to
Mark,

thanks for helpful note!

I tried "wish -encoding utf-8 myscript.tcl" on Windows XP and it
works! Only question is: why do I need step 3 of your instruction? I
did not convert my sources, because it is not convenient to me. I'd
like to edit sources natively in utf-8 and then run them immediately
without conversion.

BWT I then modified first script line to following:

#!/usr/bin/wish -encoding utf-8

and now my script does not work in Linux :) I have to manually type
"wish -encoding utf-8 myscript.tcl" rather than simply "./
myscript.tcl". What is a proper header for Tk files in utf-8 encoding?

Regards,
Andrey

Donal K. Fellows

unread,
Nov 8, 2011, 6:46:36 AM11/8/11
to
On 08/11/2011 11:40, andrey.nakin wrote:
> BWT I then modified first script line to following:
>
> #!/usr/bin/wish -encoding utf-8
>
> and now my script does not work in Linux :) I have to manually type
> "wish -encoding utf-8 myscript.tcl" rather than simply "./
> myscript.tcl". What is a proper header for Tk files in utf-8 encoding?

If that 'wish' is 8.4, then it won't work indeed. The -encoding option
was added in 8.5 to allow this sort of thing to be sorted.

I advise encoding your files so they're pure ASCII, or perhaps it would
be better to say only using ASCII directly in scripts (the \u syntax
helps) and then keeping all the human-readable parts in message catalogs
(which are always UTF-8). The msgcat package (supplied with Tcl) makes
this easy (and you're also then much more prepared to support multiple
languages).

Donal.

Mark Janssen

unread,
Nov 8, 2011, 6:49:20 AM11/8/11
to
You don't need step 3, because there was no step 3 :) I prefer option
3 because it prevents problems such as the one you were having. It is
also more resilient to editing from different machines with different
encodings (although a decent editor helps a lot in that respect).

Mark

Andreas Leitgeb

unread,
Nov 8, 2011, 7:10:39 AM11/8/11
to
andrey.nakin <andrey...@gmail.com> wrote:
> #!/usr/bin/wish -encoding utf-8

That is some kind of unix-(mis?)feature:
the hash-bang line doesn't separate the options - to the
effect that what really gets called is effectively:
/usr/bin/wish "-encoding utf-8" myscript.tcl
but wish doesn't have a single option "-encoding utf-8".

There is however a "plan B" to circumvent this hashbang limitation:

#!/bin/sh
# a dummy comment ending with a backslash-character: \
exec /usr/bin/wish -encoding utf-8 "$0" "$@"

The trick is, that the exec line is only seen by /bin/sh, but treated
as part of the preceding comment by tcl.

andrey.nakin

unread,
Nov 8, 2011, 7:38:59 AM11/8/11
to
Andreas,

your solution works for me, thanks.

Regards,
Andrey

Jay

unread,
Dec 25, 2011, 12:29:16 PM12/25/11
to
I am running window XP activestate Tcl8.6b2 / Tk8.6b2

I have following one line in my jj.tcl ( saved as utf-8 encoding )
puts "अच"

Launching wish86.exe -encoding utf-8

(Administrator) 1 % source -encoding utf-8 Desktop/jj.tcl
invalid command name "puts"
(Administrator) 2 %

however, If do
(Administrator) 2 % puts "अच"
अच

So why tcl file saved in utf-8 encoding does not work for me?

Any help appreciated.
Jay

Andreas Leitgeb

unread,
Dec 25, 2011, 5:33:54 PM12/25/11
to
Jay <apna...@gmail.com> wrote:
> I am running window XP activestate Tcl8.6b2 / Tk8.6b2
> I have following one line in my jj.tcl ( saved as utf-8 encoding )
> puts "अच"
> Launching wish86.exe -encoding utf-8
> (Administrator) 1 % source -encoding utf-8 Desktop/jj.tcl
> invalid command name "puts"
> (Administrator) 2 %
> however, If do
> (Administrator) 2 % puts "अच"
> अच
> So why tcl file saved in utf-8 encoding does not work for me?

Perhaps, jj.tcl isn't really utf-8? What tool or editor did
you use to create it? Can you do a hexdump on jj.tcl, and post
all the hex-bytes? I suspect your editor to have written some
extra bytes before the content (during saving as utf-8 encoding)

If you post the exact bytes, then we can probably decide if
your editor wrote bogus, or if tcl is supposed to understand
the input.

if you don't have a hexdump-tool then use this script:

set fd [open Desktop/jj.tcl "r"]
fconfigure $fd -translation binary
set data [read $fd]; close $fd
binary scan $data H* bytes
puts $bytes

as an approximation.

Jay

unread,
Dec 25, 2011, 9:11:41 PM12/25/11
to
On Dec 25, 2:33 pm, Andreas Leitgeb <a...@gamma.logic.tuwien.ac.at>
wrote:
I was using windows notepad for quick experiment. notepad save-as says
utf-8.... Normally I don't use notepad. MS is misleading.

notepad utf-8 format hexdump: efbbbf707574732022e0a485e0a49a22
gedit utf-8 format hexdump: 707574732022e0a485e0a49a220d0a

(Administrator) 199 % source -encoding utf-8 Desktop/jj.tcl
invalid command name "puts"
(Administrator) 200 % source -encoding utf-8 Desktop/jjj.tcl
अच

Thanks a lot for you help...
Jay

tomas

unread,
Dec 26, 2011, 6:16:25 AM12/26/11
to
Jay <apna...@gmail.com> writes:

> On Dec 25, 2:33 pm, Andreas Leitgeb <a...@gamma.logic.tuwien.ac.at>
> wrote:
>> Jay <apnay...@gmail.com> wrote:
>> > I am running window XP activestate Tcl8.6b2 / Tk8.6b2
[...]

>>
>> Perhaps, jj.tcl isn't really utf-8?  What tool or editor did
>> you use to create it?
[...]

> I was using windows notepad for quick experiment. notepad save-as says
> utf-8.... Normally I don't use notepad. MS is misleading.
>
> notepad utf-8 format hexdump: efbbbf707574732022e0a485e0a49a22
> gedit utf-8 format hexdump: 707574732022e0a485e0a49a220d0a

Argh. This is the well-known Microsoft's UTF-8 Idiocy. One of the
advantages of UTF-8 is that plain ASCII files don't need a change:
they still are plain ASCII and they are at the same time UTF-8.

Not so in the Wonderful World of Microsoft: their tools stupidly insist
on inserting a Byte Order Mark (BOM) at the beginning of the file. You'd
need this BOM[1] if you are reading/writing 16 bit encodings (the reader
needs to know which byte sex the file was written with, to know whether
to swap the bytes on each 16 bit word). The BOM has the hexadecimal
value 0xfeff, so if you are reading 0xffef you know to swap.

Now UTF-8 is a byte stream, so it doesn't need a BOM. The cited
Wikipedia page is more polite than me ("The Unicode Standard does
permit the BOM in UTF-8,[2] but does not require or recommend its
use"). My take is that Microsoft should be banned from standard bodies
until they know to behave themselves. Brrr.

[1] <https://secure.wikimedia.org/wikipedia/en/wiki/Byte_order_mark>

Regards
-- tomás

Andreas Leitgeb

unread,
Dec 26, 2011, 2:27:34 PM12/26/11
to
tomas <to...@axelspringer.de> wrote:
> Jay <apna...@gmail.com> writes:
>> I was using windows notepad for quick experiment. notepad save-as says
>> utf-8.... Normally I don't use notepad. MS is misleading.
>> notepad utf-8 format hexdump: efbbbf707574732022e0a485e0a49a22
>> gedit utf-8 format hexdump: 707574732022e0a485e0a49a220d0a
> Argh. This is the well-known Microsoft's UTF-8 Idiocy.
>
> Now UTF-8 is a byte stream, so it doesn't need a BOM. The cited
> Wikipedia page is more polite than me ("The Unicode Standard does
> permit the BOM in UTF-8,[2] but does not require or recommend its
> use").

If it *is* allowed, (no matter what the reasoning) then Tcl *should*
allow it, imho.

There have been already discussions, iirc, about filtering away BOM
automatically for certain encodings, but I think they were declined
for some reasons I don't remember.

It might be worth a discussion, though, to ignore the BOM specifically
for [source ...] if the specified (or system's default) encoding *allows*
for a BOM (even if unrecommended).

Code that does something like: eval [read [open $script]] would still
have to check for the BOM manually, like that:
eval [string trimleft $BOM [read [open $script]]]
if the authors really cared.

tomas

unread,
Dec 27, 2011, 5:53:06 PM12/27/11
to
Andreas Leitgeb <a...@gamma.logic.tuwien.ac.at> writes:

> tomas <to...@axelspringer.de> wrote:
[...]
>> Argh. This is the well-known Microsoft's UTF-8 Idiocy.
>>
>> Now UTF-8 is a byte stream, so it doesn't need a BOM [...]
[...]
> If it *is* allowed, (no matter what the reasoning) then Tcl *should*
> allow it, imho.

No question. I was just lamenting that we're stuck with Yet Another
Broken Standard thanks to some corporate stupidity. Life as usual, sigh.

> There have been already discussions, iirc, about filtering away BOM
> automatically for certain encodings, but I think they were declined
> for some reasons I don't remember.

Yes, some heuristics seem to make sense here. I guess the UTF-8 BOM trio
is sufficiently unlikely at the beginning of a file that it'd be a safe
bet.

Regards
-- tomás
0 new messages