encoding questions

EL

unread,

Jun 1, 2008, 2:14:00 PM6/1/08

to

Hi,
I am a bit confused about encodings in Tcl. Until now I thought it was
utf-8 by default, but obviously...

% puts [encoding convertto utf-8 München]
MÃ¼nchen

Is this a bug in puts? What encoding has the string before [encoding
convertto], and what exactly does [encoding convertto] do to it?

TIA

--
Eckhard

bill...@alum.mit.edu

unread,

Jun 1, 2008, 2:58:44 PM6/1/08

to

Your terminal's encoding is not UTF-8 - it is ISO-8859-1 or Microsoft
CP1252. Tcl is correctly generating a UTF-8 string. The umlauted u is U
+00FC. In UTF-8 this becomes the two byte sequence 0xC3 0xBC. In
ISO-8859-1 and MS CP1252 0xC3 is upper case A with tilde; 0xBC is
"vulgar fraction one quarter".

If you are using an MS Windows system, the default system encoding is,
I believe, CP1252.

EL

unread,

Jun 1, 2008, 3:53:44 PM6/1/08

to

bill...@alum.mit.edu schrieb:

> Your terminal's encoding is not UTF-8 - it is ISO-8859-1 or Microsoft
> CP1252.

Hmm, on my Mac OSX Leopard Terminal:

$ locale
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=
$

Looks pretty much like UTF-8, or no? Setting LC_ALL does not help.
According to the preferences the Terminal encoding is also set to utf-8.

Strange...

--
Eckhard

Jan Kandziora

unread,

Jun 1, 2008, 4:19:15 PM6/1/08

to

EL schrieb:

>
> % puts [encoding convertto utf-8 München]

> MÃ?nchen

>
> Is this a bug in puts? What encoding has the string before [encoding
> convertto], and what exactly does [encoding convertto] do to it?
>

The statement above does double encoding. puts uses the stdout channel, and
as long as you don't do

& fconfigure stdout -translation binary

strings are automatically translated to the system encoding, regardless of
tcl's internal representation.

Kind regards

Jan

bill...@alum.mit.edu

unread,

Jun 1, 2008, 4:19:48 PM6/1/08

to

Hmm. This is odd. What do you get if you type:

encoding system

to tclsh?

Also, how did you enter the original string Muenchen?

EL

unread,

Jun 1, 2008, 4:43:01 PM6/1/08

to

Jan Kandziora schrieb:

Ah, okay... thanks.
Does that mean that the internal representation is already UTF-8? If so,
why does [encoding convertto] the conversion again?

--
Eckhard

Kevin Kenny

unread,

Jun 1, 2008, 8:41:55 PM6/1/08

to

EL wrote:

> Ah, okay... thanks.
> Does that mean that the internal representation is already UTF-8? If so,
> why does [encoding convertto] the conversion again?

Yes, the internal representation is already UTF-8[*]. The output of
[encoding convertto] is an array of bytes - which internally are
then encoded in UTF-8 again[**].

[encoding convertto] is almost never the right thing to do. (The right
thing is usually to [fconfigure] a file channel to accept the
encoding that you want in the file, and then not worry. (And all
your file channels will be in the system encoding by default, so
you usually don't need to worry about encodings at all!)

The reason for [encoding convertto] is that some applications have
to deal with binary data that include encoded strings as a subset.
They can use [binary format] to produce things like integers
and floats, and [encoding convertto] to produce byte arrays in
a known encoding.

[*] Well, it differs from UTF-8 in one respect: the null byte is
represented by the illegal UTF-8 sequence C0 80. If you don't
know what this means, you surely don't have an application
that would care. In any case, at script level, Everything Is
A String, and a string can contain any Unicode character.

[**] More precisely, they behave as if the bytes of the string
were interpreted as ISO8859-1 characters and re-encoded in
UTF-8. The "behave as if" in the previous sentence reflects
the fact that the UTF-8 representation will not be generated
unless it is needed. A useful subset of things you could
do with the byte array (including writing it to a channel,
passing it through [binary format], and executing operations
such as [string range]) functions happily without the
UTF-8 conversion. This is an optimization; at script level,
Everything Is A String.

--
73 de ke9tv/2, Kevin

yahalom

unread,

Jun 1, 2008, 9:12:06 PM6/1/08

to

I asked this in a different thread but as this is related I post it
here also:
I have an xml request that I am getting over a tcp socket. I do not
know in advance what will be the xml encoding. when I configure my
channel with:
fconfigure $sock -encoding utf-8
on both client/server) I get the xml properly.
but what will happen if the client socket/xml is iso8859-8? or other
encoding?
how can I work with the socket in the server without forcing encoding
and then use
the proper encoding based on the xml header tag? is there a way to
know what encoding my client use (without agreeing in advance)? do I
need to have a server listener per encoding?

EL

unread,

Jun 2, 2008, 4:03:07 AM6/2/08

to

yahalom schrieb:

> I asked this in a different thread but as this is related I post it
> here also:
> I have an xml request that I am getting over a tcp socket. I do not
> know in advance what will be the xml encoding. when I configure my
> channel with:
> fconfigure $sock -encoding utf-8
> on both client/server) I get the xml properly.
> but what will happen if the client socket/xml is iso8859-8? or other
> encoding?

My problem is actually quite of a similar nature.
Just that it is XHTML, which is sent as utf-8 from the browser and then
seems to be encoded again to utf-8 from Tclhttpd. At least the XHTML has
a <meta http-equiv="Content-Type" ... charset="utf-8"/>.

--
Eckhard

schlenk

unread,

Jun 2, 2008, 4:55:10 AM6/2/08

to

Appendix F of the XML specification provides some algorithm for
detecting the encoding of the xml. If you happen to use tdom, it
provides some procs in the tdom.tcl file that implement this.
tDOM::xmlOpenFile or so...

Michael

yahalom

unread,

Jun 2, 2008, 7:08:20 AM6/2/08

to

the problem is not finding the xml encoding which is written
explicitly. it is knowing in advance how to fconfigure the socket so
the xml will arrive properly. tdom parses the xml fine when it arrives
in the correct encoding. so how can I know how to fconfigure my socket/
channel in advance? I am looking for channel encoding "sniffer" if
these is such thing. and if not then do I neet a socket server per
encoding if I want to support request xml in multiple encodings?
doesn't this makes the xml encoding header unhelpful (anyway I need to
know the socket(xml) encoding in advance)?

schlenk

unread,

Jun 2, 2008, 7:41:21 AM6/2/08

to

you have two cases:
1. the <?xml version="1.0" encoding="xyz" ?> header is correct
2. the xml header is incorrect

In case 1. you can use the Appendix F algorithm you can steal from the
tdom procs i mentioned above, configure as binary, read the beginning
of the data, than reconfigure if you know whats right, look at the
tdom.tcl file to see how its done.

In case 2. your in trouble and need to fix your sender (or make wild
guesses similar to the stuff mozilla does to guess encodings of web
pages). Your setup is broken and the only sane assumption would be to
use utf-8 as the encoding (default for xml without encoding
declaration).

Michael

yahalom

unread,

Jun 2, 2008, 8:29:25 AM6/2/08

to

> Michael- Hide quoted text -
>
> - Show quoted text -

thanks for the reply - it is what I was looking for.
I am interested in option 1. xml header is correct. in case 2 we can
tell the client to fix his code.
I looked at tdom.tcl and I saw the implementation. it looks a bit
clumsy (not because of the code - just the way to do it). is this the
only option? must I fconfigure the channel? or can I get the client
encoding configuration by some magical tcp command? my problem is on
windows where we use a tcl server. on linux we use xinetd to run our
tcl process for each request. so xinetd manages communication. how is
it that xinetd works well with the encoding (at least the utf-8) - I
assume it does not decide based on my xml declaration.

Fredderic

unread,

Jun 2, 2008, 3:01:41 PM6/2/08

to

On Mon, 2 Jun 2008 05:29:25 -0700 (PDT),
yahalom <yaha...@gmail.com> wrote:

> I looked at tdom.tcl and I saw the implementation. it looks a bit
> clumsy (not because of the code - just the way to do it). is this the
> only option? must I fconfigure the channel? or can I get the client
> encoding configuration by some magical tcp command? my problem is on
> windows where we use a tcl server. on linux we use xinetd to run our
> tcl process for each request. so xinetd manages communication. how is
> it that xinetd works well with the encoding (at least the utf-8) - I
> assume it does not decide based on my xml declaration.

If I remember correctly, xinetd just passes the socket through to your
application, so the data comes through exactly as it's sent, and
xinetd has nothing further to do with it. No translation or encoding
going on at all.

As I understand the problem, you basically read the first line, parse
out the encoding, [fconfigure] the channel with that encoding, and
continue on your merry way. If there's no encoding specified, you
assume it's UTF-8. Not sure what it does if the first line isn't a
header line but is the start of the data... That may be where it
starts to look clumsy...?

In any case using [fconfigure] to switch encodings is fine. It's much
better than reading everthing in binary and using [encoding convertto]
to convert it. Especially when you get multi-byte characters split
between reads and similar ugly issues. It's never a particularly clean
issue to deal with in any case.

Fredderic

Donald Arseneau

unread,

Jun 2, 2008, 3:53:52 PM6/2/08

to

On Jun 2, 5:29 am, yahalom <yahal...@gmail.com> wrote:
> > > > > I have an xml request that I am getting over a tcp socket.
> > > > > I do not know in advance what will be the xml encoding.

> > > > > how can I work with the socket in the server without
> > > > > forcing encoding and then use
> > > > > the proper encoding based on the xml header tag?

> > tdom procs i mentioned above, configure as binary, read the beginning

> > of the data, than reconfigure if you know whats right,

> only option? must I fconfigure the channel? or can I get the client

> encoding configuration by some magical tcp command?

It sounds like exactly what you asked for. I suppose you could
keep reading "binary" and explicitly convert the encoding of the
data, but that seems more clumsy. (It could be necessary if you
get your data in a chunk some other way.)

> windows where we use a tcl server. on linux we use xinetd to run our
> tcl process for each request. so xinetd manages communication. how is
> it that xinetd works well with the encoding (at least the utf-8)

You may as well ask how a German-made tape recorder speaks English
so well! It just reproduces data without understanding it at all.

Donald Arseneau as...@triumf.ca

yahalom

unread,

Jun 4, 2008, 5:07:09 AM6/4/08

to

I made some progress but this issue pops up in different places. now
with tclhttp tml pages.
I made a simple test.tml page with no code at all:
שלום

I run this page both on linux and on windows. on linux it works fine.
on windows this Hebrew שלום is gibberish.
when I change the test.tml content to:
[encoding convertto [encoding system] שלום] the hebrew is fine.
why is the different behaviour? what should be changed so the windows
will work fine?

EL

unread,

Jun 4, 2008, 2:02:15 PM6/4/08

to

yahalom schrieb:

What encoding is your HTML or xhtml header refering to? Did you try
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8" />
in <head>?

Also, do you have [Doc_Dynamic] enabled? I saw something on the wiki
recently (currently it seems to be down, together with google, youtube
and probably a few other sites in NA... don't know why).

--
Eckhard

yahalom

unread,

Jun 5, 2008, 5:46:48 AM6/5/08

to

my meta is cp-1255. I do not have Doc_Dynamic enabled as I want the
page to be dynamic.
What I want to know it why the difference in behaviour between the
window/linux tcl server. they are totally the same tcl/libraries/code.
I have the same difference in other places when I work with directUrl.
on the windows I need to add the "encoding convertto" and on linux it
works fine without it. I can add of course if
{$tcl_platfrom(os)==windows} {... and handle things differently
between the platforms but this is not why we use tcl.
is this a bug? we use the latest 8.5.2?

EL

unread,

Jun 5, 2008, 8:22:33 AM6/5/08

to

yahalom schrieb:

> On Jun 5, 3:02 am, EL <eckhardnos...@gmx.de> wrote:
>> yahalom schrieb:
>>
>>> I made some progress but this issue pops up in different places. now
>>> with tclhttp tml pages.
>>> I made a simple test.tml page with no code at all:
>>> שלום
>>> I run this page both on linux and on windows. on linux it works fine.
>>> on windows this Hebrew שלום is gibberish.
>>> when I change the test.tml content to:
>>> [encoding convertto [encoding system] שלום] the hebrew is fine.
>>> why is the different behaviour? what should be changed so the windows
>>> will work fine?
>> What encoding is your HTML or xhtml header refering to? Did you try
>> <meta http-equiv="Content-Type" content="text/html; charset="UTF-8" />
>> in <head>?
>>
>> Also, do you have [Doc_Dynamic] enabled? I saw something on the wiki
>> recently (currently it seems to be down, together with google, youtube
>> and probably a few other sites in NA... don't know why).
>>
>> --
>> Eckhard
>
> my meta is cp-1255. I do not have Doc_Dynamic enabled as I want the
> page to be dynamic.

You need [Doc_Dynamic] to have /always/ dynamic content. The page is
cached if you leave it out.

> What I want to know it why the difference in behaviour between the
> window/linux tcl server.

How do you access the server, with which browsers?
Overall I assume you talk about form values? That's where I have trouble
as well, as I described above.

--
Eckhard

yahalom

unread,

Jun 5, 2008, 9:01:36 AM6/5/08

to

so probabaly it is disabled. my pages are not being cached

>
> > What I want to know it why the difference in behaviour between the
> > window/linux tcl server.
>
> How do you access the server, with which browsers?

in both cases with internet explorer 7.

> Overall I assume you talk about form values? That's where I have trouble
> as well, as I described above.

my problem is even more simple. just a simple .tml file that should
show correct text. the text is written in the file. for some reason on
windows it works differently. truely I do not mind how it will work as
long as it will work the same on both platforms. I try to understand
what can make the difference.