% puts [encoding convertto utf-8 München]
München
Is this a bug in puts? What encoding has the string before [encoding
convertto], and what exactly does [encoding convertto] do to it?
TIA
--
Eckhard
If you are using an MS Windows system, the default system encoding is,
I believe, CP1252.
Hmm, on my Mac OSX Leopard Terminal:
$ locale
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=
$
Looks pretty much like UTF-8, or no? Setting LC_ALL does not help.
According to the preferences the Terminal encoding is also set to utf-8.
Strange...
--
Eckhard
& fconfigure stdout -translation binary
strings are automatically translated to the system encoding, regardless of
tcl's internal representation.
Kind regards
Jan
encoding system
to tclsh?
Also, how did you enter the original string Muenchen?
Ah, okay... thanks.
Does that mean that the internal representation is already UTF-8? If so,
why does [encoding convertto] the conversion again?
--
Eckhard
> Ah, okay... thanks.
> Does that mean that the internal representation is already UTF-8? If so,
> why does [encoding convertto] the conversion again?
Yes, the internal representation is already UTF-8[*]. The output of
[encoding convertto] is an array of bytes - which internally are
then encoded in UTF-8 again[**].
[encoding convertto] is almost never the right thing to do. (The right
thing is usually to [fconfigure] a file channel to accept the
encoding that you want in the file, and then not worry. (And all
your file channels will be in the system encoding by default, so
you usually don't need to worry about encodings at all!)
The reason for [encoding convertto] is that some applications have
to deal with binary data that include encoded strings as a subset.
They can use [binary format] to produce things like integers
and floats, and [encoding convertto] to produce byte arrays in
a known encoding.
[*] Well, it differs from UTF-8 in one respect: the null byte is
represented by the illegal UTF-8 sequence C0 80. If you don't
know what this means, you surely don't have an application
that would care. In any case, at script level, Everything Is
A String, and a string can contain any Unicode character.
[**] More precisely, they behave as if the bytes of the string
were interpreted as ISO8859-1 characters and re-encoded in
UTF-8. The "behave as if" in the previous sentence reflects
the fact that the UTF-8 representation will not be generated
unless it is needed. A useful subset of things you could
do with the byte array (including writing it to a channel,
passing it through [binary format], and executing operations
such as [string range]) functions happily without the
UTF-8 conversion. This is an optimization; at script level,
Everything Is A String.
--
73 de ke9tv/2, Kevin
My problem is actually quite of a similar nature.
Just that it is XHTML, which is sent as utf-8 from the browser and then
seems to be encoded again to utf-8 from Tclhttpd. At least the XHTML has
a <meta http-equiv="Content-Type" ... charset="utf-8"/>.
--
Eckhard
Appendix F of the XML specification provides some algorithm for
detecting the encoding of the xml. If you happen to use tdom, it
provides some procs in the tdom.tcl file that implement this.
tDOM::xmlOpenFile or so...
Michael
the problem is not finding the xml encoding which is written
explicitly. it is knowing in advance how to fconfigure the socket so
the xml will arrive properly. tdom parses the xml fine when it arrives
in the correct encoding. so how can I know how to fconfigure my socket/
channel in advance? I am looking for channel encoding "sniffer" if
these is such thing. and if not then do I neet a socket server per
encoding if I want to support request xml in multiple encodings?
doesn't this makes the xml encoding header unhelpful (anyway I need to
know the socket(xml) encoding in advance)?
you have two cases:
1. the <?xml version="1.0" encoding="xyz" ?> header is correct
2. the xml header is incorrect
In case 1. you can use the Appendix F algorithm you can steal from the
tdom procs i mentioned above, configure as binary, read the beginning
of the data, than reconfigure if you know whats right, look at the
tdom.tcl file to see how its done.
In case 2. your in trouble and need to fix your sender (or make wild
guesses similar to the stuff mozilla does to guess encodings of web
pages). Your setup is broken and the only sane assumption would be to
use utf-8 as the encoding (default for xml without encoding
declaration).
Michael
thanks for the reply - it is what I was looking for.
I am interested in option 1. xml header is correct. in case 2 we can
tell the client to fix his code.
I looked at tdom.tcl and I saw the implementation. it looks a bit
clumsy (not because of the code - just the way to do it). is this the
only option? must I fconfigure the channel? or can I get the client
encoding configuration by some magical tcp command? my problem is on
windows where we use a tcl server. on linux we use xinetd to run our
tcl process for each request. so xinetd manages communication. how is
it that xinetd works well with the encoding (at least the utf-8) - I
assume it does not decide based on my xml declaration.
> I looked at tdom.tcl and I saw the implementation. it looks a bit
> clumsy (not because of the code - just the way to do it). is this the
> only option? must I fconfigure the channel? or can I get the client
> encoding configuration by some magical tcp command? my problem is on
> windows where we use a tcl server. on linux we use xinetd to run our
> tcl process for each request. so xinetd manages communication. how is
> it that xinetd works well with the encoding (at least the utf-8) - I
> assume it does not decide based on my xml declaration.
If I remember correctly, xinetd just passes the socket through to your
application, so the data comes through exactly as it's sent, and
xinetd has nothing further to do with it. No translation or encoding
going on at all.
As I understand the problem, you basically read the first line, parse
out the encoding, [fconfigure] the channel with that encoding, and
continue on your merry way. If there's no encoding specified, you
assume it's UTF-8. Not sure what it does if the first line isn't a
header line but is the start of the data... That may be where it
starts to look clumsy...?
In any case using [fconfigure] to switch encodings is fine. It's much
better than reading everthing in binary and using [encoding convertto]
to convert it. Especially when you get multi-byte characters split
between reads and similar ugly issues. It's never a particularly clean
issue to deal with in any case.
Fredderic
> > tdom procs i mentioned above, configure as binary, read the beginning
> > of the data, than reconfigure if you know whats right,
> only option? must I fconfigure the channel? or can I get the client
> encoding configuration by some magical tcp command?
It sounds like exactly what you asked for. I suppose you could
keep reading "binary" and explicitly convert the encoding of the
data, but that seems more clumsy. (It could be necessary if you
get your data in a chunk some other way.)
> windows where we use a tcl server. on linux we use xinetd to run our
> tcl process for each request. so xinetd manages communication. how is
> it that xinetd works well with the encoding (at least the utf-8)
You may as well ask how a German-made tape recorder speaks English
so well! It just reproduces data without understanding it at all.
Donald Arseneau as...@triumf.ca
I run this page both on linux and on windows. on linux it works fine.
on windows this Hebrew שלום is gibberish.
when I change the test.tml content to:
<b>[encoding convertto [encoding system] שלום]</b> the hebrew is fine.
why is the different behaviour? what should be changed so the windows
will work fine?
What encoding is your HTML or xhtml header refering to? Did you try
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8" />
in <head>?
Also, do you have [Doc_Dynamic] enabled? I saw something on the wiki
recently (currently it seems to be down, together with google, youtube
and probably a few other sites in NA... don't know why).
--
Eckhard
my meta is cp-1255. I do not have Doc_Dynamic enabled as I want the
page to be dynamic.
What I want to know it why the difference in behaviour between the
window/linux tcl server. they are totally the same tcl/libraries/code.
I have the same difference in other places when I work with directUrl.
on the windows I need to add the "encoding convertto" and on linux it
works fine without it. I can add of course if
{$tcl_platfrom(os)==windows} {... and handle things differently
between the platforms but this is not why we use tcl.
is this a bug? we use the latest 8.5.2?
You need [Doc_Dynamic] to have /always/ dynamic content. The page is
cached if you leave it out.
> What I want to know it why the difference in behaviour between the
> window/linux tcl server.
How do you access the server, with which browsers?
Overall I assume you talk about form values? That's where I have trouble
as well, as I described above.
--
Eckhard
in both cases with internet explorer 7.
> Overall I assume you talk about form values? That's where I have trouble
> as well, as I described above.
my problem is even more simple. just a simple .tml file that should
show correct text. the text is written in the file. for some reason on
windows it works differently. truely I do not mind how it will work as
long as it will work the same on both platforms. I try to understand
what can make the difference.