encoding problem - handling a char in cp1252 on unix

Robert Karen

unread,

Dec 24, 2013, 6:00:12 PM12/24/13

to

I have a file on unix that has a 'Z with caron' character 8E in cp1252 and
017D in unicode. I am able to read the file, and write to another file and
the character is unchanged. however for some reason when I pass it to a tcl
command we created in c, the character looks like it has another character
prepended to it. as if it's passing through a conversion to utf somewhere.
it didn't help to change encoding, like this:
set fp [open $file r]
fconfigure $fp -encoding cp1252 (also tried binary)
set cmd [read $fp]
eval $cmd

Thanks for any help!

Robert Karen

tomás zerolo

unread,

Dec 25, 2013, 5:41:34 AM12/25/13

to

Tcl's internal encoding is UTF-8. In this encoding, the Z with caron
(official name: "LATIN CAPITAL LETTER Z WITH CARON", formerly called
"LATIN CAPITAL LETTER Z HACEK" is represented by two bytes, in
hexadecimal notation "c5 bd".

So what you first have to find out is what encoding your file is
in. Hexdumping it will help with that. There are so many possibilities
that it's best to prepare a minimal example and post it here.

Then, to your command implemented in C: C is in itself naïve about
encodings. If you don't do explicitly anything about it, it'll just
"see" the byte stream in whatever encoding is there. That might be the
right thing or not, depending on what you are doing. If you are,
e.g. trying to loop over *characters* as opposed to *bytes*, you'll be
in for surprises when working on UTF-8 encoding strings, since
characters now occupy a variable amount of bytes (and those beyond the
ASCII set occupy always more than one). For example, if you try viewing
the UTF-8 encoded Ž as if it were Latin-1, you will "see" two
characters: an A with a ring above and an "one-half" like so: Å½
(hopefully your news reader doesn't play games on that ;-D, because c5
interpreted as Latin-1 is the A-ring and bd the one-half.

If this is what you are trying to do, see, e.g. the functions Tcl_Uni*
and Tcl_Utf*. For example, Tcl_UtfNext and Tcl_UtfPrev allow you to move
along an UTF-8 encoded buffer, character by character.

If you want to use the standard C functions instead of Tcl, there are
the "wide character" functions like mbsrtowcs and their ilk.

Just come back with more details if you need help on specifics.

Regards
-- tomás

tomás zerolo

unread,

Dec 25, 2013, 5:46:03 AM12/25/13

to

Following up on myself (yeah, bad style: may the Net Gods forgive
me). Here's a tutorial on multibyte characters in C which seems quite
nice:

<http://www.cprogramming.com/tutorial/unicode.html>

Caveat emptor: I didn't go through it thoroughly. So you come back and
tell us ;-)

regards
-- tomás

Robert Karen

unread,

Dec 26, 2013, 11:46:39 AM12/26/13

to

Thanks for taking time to answer.
here is some more info. in this file on unix the 'Zena Deti'
has octal 216 (142)
# grep CENTRUM /dbdev/elc/cspx0713/nfyth.dlg | od -c
0000000 1 \t C E N T R U M . C Z - 216
0000020 e n a D e t i ( M o n t h l
0000040 y N e t ) ~ n 1 1 1 3 0 8 0 3
0000060 \n
0000061

this is pasted from a cp1252 page:
8E Ž 017D LATIN CAPITAL LETTER Z WITH CARON
8E (in hex) = 142 (decimal) = 216 (octal), right?
so I think that this unix file is edited by someone using a pc and it
inserts cp1252 code '8E' for the Z w/ caron. The unix file above is part of
a database. I later send this data to a tcl
client program running on a pc and it is handled ok. then I send it
back to the server from the pc. I have to keep the same encoding so the
database on the server recognizes it and I mangaged to do that,
keeping the same cp1252 encoding for this character. When sent back to unix
server it is run in a command - like:
save_selections {CENTRUM.CZ - Žena Deti (Monthly Net)~n11130803}
The character still has cp1252 in the file that arrives on the client,
but, for some reason once inside the tcl command 'save_selections',
it gets converted to utf-8
and it sees the characters that you mentinoed: A w/ half circle and 1/2.
I'd like to pass the cp1252 encoding of the string to the tcl command.
I hope that makes sense. It doesn't make a difference whether that tcl
command is sourced in a file or read in with encoding cp1252 and then eval'd
(as I said in original post) .
Does that make sense?

RK

#
On Wednesday, 25 December 2013 05:41:34 UTC-5, tomás zerolo wrote:

Robert Karen

unread,

Dec 26, 2013, 12:47:01 PM12/26/13

to

In last post, I wasn't clear about how I was running the command on the server.
What I should have said is that I have tried sourceing the command from a file,
which I know uses local os' encoding (iso-8859-1) . so I tried reading the file
and with either of these encodings:
#fconfigure $in -encoding cp1252
fconfigure $in -encoding binary
set useEvalNotSource 1
set runningStyle eval
set entireFile [read $in]
logputs "getReturnMessage: using encoding of [fconfigure $in -encoding] for localfile's contents. "
... eval $entireFile

but in the log files I don't see it right for either of these. the first
diagnostic from within the command has a differnt character than the 8E (cp1252
encoding):
with binary encoding :
12:34:51:<tclserver> getReturnMessage: using encoding of binary for localfile's contents.
12:34:51:<tclserver> getReturnMessage: running command ADD_MULTI_DATA_FILE 2 5 {^Tn11130803~CENTRUM.CZ - \216ena Det...thly Net)}
(multiline input).
12:34:51:t 2 n 5 C n11130803 T CENTRUM.CZ - \302\216ena Deti (Monthly Net)

with cp1252 encoding:
12:21:15:<tclserver> getReturnMessage: using encoding of cp1252 for localfile's contents.
12:21:15:<tclserver> getReturnMessage: running command ADD_MULTI_DATA_FILE 2 4 {^Tn11130803~CENTRUM.CZ - ?ena Det...thly Net)}
(multiline input).
12:21:15:t 2 n 4 C n11130803 T CENTRUM.CZ - \305\275ena Deti (Monthly Net)

DrS

unread,

Dec 26, 2013, 3:48:22 PM12/26/13

to

On 12/26/2013 12:47 PM, Robert Karen wrote:
> In last post, I wasn't clear about how I was running the command on the server.
> What I should have said is that I have tried sourceing the command from a file,
> which I know uses local os' encoding (iso-8859-1) . so I tried reading the file

You mentioned that someone must have edited the file. In that case,
have you tried using the local os' encoding: iso-8859-1?

You may also try the unicode encoding. That did the trick for me a
while back.

DrS

tomás zerolo

unread,

Dec 26, 2013, 3:52:00 PM12/26/13

to

Robert Karen <robert....@gmail.com> writes:

> Thanks for taking time to answer.

You're welcome

> here is some more info. in this file on unix the 'Zena Deti'
> has octal 216 (142)
> # grep CENTRUM /dbdev/elc/cspx0713/nfyth.dlg | od -c
> 0000000 1 \t C E N T R U M . C Z - 216
> 0000020 e n a D e t i ( M o n t h l
> 0000040 y N e t ) ~ n 1 1 1 3 0 8 0 3
> 0000060 \n
> 0000061
>
> this is pasted from a cp1252 page:
> 8E Ž 017D LATIN CAPITAL LETTER Z WITH CARON
> 8E (in hex) = 142 (decimal) = 216 (octal), right?
> so I think that this unix file is edited by someone using a pc and it
> inserts cp1252 code '8E' for the Z w/ caron.

Agreed so far.

> The unix file above is part of a database. I later send this data to
> a tcl client program running on a pc and it is handled ok. then I send it
> back to the server from the pc. I have to keep the same encoding so the
> database on the server recognizes it and I mangaged to do that,
> keeping the same cp1252 encoding for this character. When sent back to unix
> server it is run in a command - like:
> save_selections {CENTRUM.CZ - Žena Deti (Monthly Net)~n11130803}
> The character still has cp1252 in the file that arrives on the client,
> but, for some reason once inside the tcl command 'save_selections',
> it gets converted to utf-8
> and it sees the characters that you mentinoed: A w/ half circle and
> 1/2.

Hm. I fear I got a bit lost.

> I'd like to pass the cp1252 encoding of the string to the tcl command.
> I hope that makes sense. It doesn't make a difference whether that tcl
> command is sourced in a file or read in with encoding cp1252 and then eval'd
> (as I said in original post) .
> Does that make sense?

I think it would help if you made an account of the different places
this string passes through and the encodings you want it to be at each
of these places (note that whereas the "official internal encoding" of
Tcl is UTF-8, nobody forces you to keep your strings UTF-8 encoded: Tcl
is happy to keep any binary hunk of data in a string. But then string
operations might not work as you expect).

Regards
-- tomás

tomás zerolo

unread,

Dec 26, 2013, 4:31:54 PM12/26/13

to

Robert Karen <robert....@gmail.com> writes:

> In last post, I wasn't clear about how I was running the command on the server.
> What I should have said is that I have tried sourceing the command from a file,
> which I know uses local os' encoding (iso-8859-1) . so I tried reading the file
> and with either of these encodings:
> #fconfigure $in -encoding cp1252
> fconfigure $in -encoding binary
> set useEvalNotSource 1
> set runningStyle eval
> set entireFile [read $in]
> logputs "getReturnMessage: using encoding of [fconfigure $in -encoding] for localfile's contents. "
> ... eval $entireFile
>
> but in the log files I don't see it right for either of these. the first
> diagnostic from within the command has a differnt character than the 8E (cp1252
> encoding):
> with binary encoding :
> 12:34:51:<tclserver> getReturnMessage: using encoding of binary for localfile's contents.
> 12:34:51:<tclserver> getReturnMessage: running command ADD_MULTI_DATA_FILE 2 5 {^Tn11130803~CENTRUM.CZ - \216ena Det...thly Net)}
> (multiline input).
> 12:34:51:t 2 n 5 C n11130803 T CENTRUM.CZ - \302\216ena Deti (Monthly Net)

This \302\216 looks quite strange. Trying to interpret it as UTF-8
sequence leads to:

\302\216 -> 110 00010 10 001110
-> 000 10 00 1110 -> 8e (octal 216)
It is as if something is trying to represent code point 216 as UTF-8
(which technically is possible, but not really a useful Unicode
character).

I think that happens for reading a string as binary and outputting it as
UTF-8. I'd guess the output for the log file is trying to do UTF-8. If
you logged in binary (or iso-8859-1, or cp1251 or any other unibyte
encoding) on output, you'd see just \216 and not \302\216

> with cp1252 encoding:
> 12:21:15:<tclserver> getReturnMessage: using encoding of cp1252 for localfile's contents.
> 12:21:15:<tclserver> getReturnMessage: running command ADD_MULTI_DATA_FILE 2 4 {^Tn11130803~CENTRUM.CZ - ?ena Det...thly Net)}
> (multiline input).
> 12:21:15:t 2 n 4 C n11130803 T CENTRUM.CZ - \305\275ena Deti (Monthly Net)

One question: which encoding is the log file channel set to? (the
logputs funtion calls somewhere puts? Or is it implemented in C?

What would you expect to be in the log file: cp1252 or UTF-8?
In the latter case this one looks good.

BTW, Ž is (strictly) not representable in iso-8859-1; cp1252 is just an
extension of iso-8859-1, using some of the "control space" for extra
glyphs, like the Ž. But most applications just don't care.

Sorry all of this is a bit handwavy, but it's important to look at each
single step to get an idea on what's going on. Watch out especially for
transcoding at debug and log outputs, since those could send you off the
false path.

Besides, you can keep strings "in" Tcl in an encoding different to
UTF-8, but you should know what you are doing then.

Don't hesitate to ask again. Perhaps we can shed some light in that.

Regards
-- tomás

tomás zerolo

unread,

Dec 27, 2013, 4:25:15 AM12/27/13

to

DrS <drsc...@gmail.com> writes:

> On 12/26/2013 12:47 PM, Robert Karen wrote:
>> In last post, I wasn't clear about how I was running the command on the server.
>> What I should have said is that I have tried sourceing the command from a file,
>> which I know uses local os' encoding (iso-8859-1) . so I tried reading the file
>
> You mentioned that someone must have edited the file. In that case,
> have you tried using the local os' encoding: iso-8859-1?

The encoding is (judged by the evidence we see) cp1251 [1], which is a
bastardized^H^H^H superset of iso8859-1. Ahhh Microsoft, always
embrace-n-extend.

Anyway, it doesn't make a big difference. Unibyte, that's it.

> You may also try the unicode encoding. That did the trick for me a
> while back.

Be careful -- there be dragons. Officially, there is no "unicode"
encoding. Unicode specifies a mapping of numbers to glyphs (at the
moment there are over one million of "code points" available, of which
about one-hundred-thousand are mapped to glyphs). This is usually called
the "character set". In the Unicode case, this mapping is simply "UCS"
for Unicode Character Set. It seems we are slowly settling on this one
as The Only One (note that one can see ASCII and iso-8859-1 as "partial
mappings" of UCS, but *not cp1252*)!

Then there is the mapping of those integers to byte sequences: one of
those mappings is UTF-8, then there's UTF-16 (an extension of the now
dead UCS-2). This is what these days is called "character encoding",
often shortened to "encoding". Some encodings (like UTF-8) strive to be
a bit of "self healing": if you start in the middle of a multibyte
sequence you'll get a spurious character, but the algorithm will
re-synchronize eventually.

When talking about "Unicode encoding", the meaning is often this
"two-byte-a-char" encoding of the good ol' Apple/Microsoft/Sun times.

This one is spelled (more-or-less) UTF-16 these days, AFAIK.

People thought 16 bit were enough to represent all glyphs of
humankind. A fatal repetition of "64K should be sufficient for yadda
yadda".

Java carries the scars of that time still, with surrogate chars and all
that goodnes.

Enjoy reading <https://en.wikipedia.org/wiki/Unicode>

Regards
-- tomás

DrS

unread,

Dec 27, 2013, 10:27:09 AM12/27/13

to

On 12/27/2013 4:25 AM, tomás zerolo wrote:
>
> Be careful -- there be dragons. Officially, there is no "unicode"
> encoding. Unicode specifies a mapping of numbers to glyphs (at the
> moment there are over one million of "code points" available, of which
> about one-hundred-thousand are mapped to glyphs). This is usually called

Himmm...

% lsearch [encoding names] unicode
79

DrS

> Regards
> -- tomás

tomás zerolo

unread,

Dec 27, 2013, 11:53:50 AM12/27/13

to

DrS <drsc...@gmail.com> writes:

> On 12/27/2013 4:25 AM, tomás zerolo wrote:
>>
>> Be careful -- there be dragons. Officially, there is no "unicode"
>> encoding.

Correction: there's no *single* Unicode encoding. Quoth the Wikipedia
[1]:

"Unicode can be implemented by different character encodings.
The most commonly used encodings are UTF-8, UTF-16 and the
now-obsolete UCS-2."

> Himmm...
>
>
> % lsearch [encoding names] unicode
> 79

This is unfortunate. My best guess (from the few experiments I've done)
is that it is being used as an alias for "UTF-8 encoded Unicode", also
known as UTF-8.

But thanks for the puzzle. I'll keep digging :-)

Regards
-- tomás

Robert Karen

unread,

Dec 27, 2013, 12:10:56 PM12/27/13

to

You really know your encodings :)
In those log lines below, the one which prints it out 'correctly' with \216
is in a script that calls our tcl command to log it.
the last one with the broken up \216 into 2 separate codes comes from
within c and has no tcl in it. both do the writing to log file w/ no tcl. that's
what's puzzling. it seems to me that the only thing between the log line
which has it correct (with the \216) and the log line which has it as \302\216
are the lines of the tcl command itself:
Add_multi_data_file ( ClientData clientData, Tcl_Interp *interp,
int objc, Tcl_Obj *CONST objv[] )
{
char *code_ptr;
...
code_ptr = Tcl_GetStringFromObj(objv[3], NULL);

It sounds like you think Tcl_GetStringFromObj() would not do any encoding
changes and just take \216 in the string as-is. You sure? Thanks again for
your trouble.

On Thursday, 26 December 2013 16:31:54 UTC-5, tomás zerolo wrote:

Robert Karen

unread,

Dec 27, 2013, 12:16:44 PM12/27/13

to

On Thursday, 26 December 2013 15:48:22 UTC-5, DrS wrote:
> On 12/26/2013 12:47 PM, Robert Karen wrote:
>
> > In last post, I wasn't clear about how I was running the command on the server.
>
> > What I should have said is that I have tried sourceing the command from a file,
>
> > which I know uses local os' encoding (iso-8859-1) . so I tried reading the file
>
>
>
> You mentioned that someone must have edited the file. In that case,
>
> have you tried using the local os' encoding: iso-8859-1?

as Tomas said, there is no char for the Z w/ caron.

>
>
>
> You may also try the unicode encoding. That did the trick for me a
>
> while back.
>

thanks. that sounds like an idea. what do I enter in vi (ascii editor),
the octal codes?

>
>
>
> DrS

DrS

unread,

Dec 27, 2013, 1:18:25 PM12/27/13

to

On 12/27/2013 12:16 PM, Robert Karen wrote:
>
> thanks. that sounds like an idea. what do I enter in vi (ascii editor),
> the octal codes?
>

Well, I guess it depends on what kind of your keyboard you have. I
thought your problem was in programmatically processing that kind of
input. I am not sure why you need to edit the file in vi manually. You
can look at vi options to see how you can change the default encoding
and how you enter those characters.

DrS

Robert Karen

unread,

Dec 27, 2013, 1:38:40 PM12/27/13

to

My problem is the programming. but if I can make it easier by editing
the original char to make is unicode, I can do that too. thanks for the idea.

BTW, do you know why the utf-16 for Z w/ caron (c5 bd) is different
than the utf-8 (017d) for this char?

tomás zerolo

unread,

Dec 27, 2013, 4:31:05 PM12/27/13

to

to...@tuxteam.de (tomás zerolo) writes:

> DrS <drsc...@gmail.com> writes:
>
>> On 12/27/2013 4:25 AM, tomás zerolo wrote:
>>>
>>> Be careful -- there be dragons. Officially, there is no "unicode"
>>> encoding.

[...]

>> Himmm...
>>
>>
>> % lsearch [encoding names] unicode
>> 79
>
> This is unfortunate. My best guess (from the few experiments I've done)
> is that it is being used as an alias for "UTF-8 encoded Unicode", also
> known as UTF-8.

HAH. How wrong I was...

% tomas@rasputin:~$ tclsh
tclsh8.5 [~]set str "Železný_Brod"
Železný Brod
tclsh8.5 [~]binary scan $str {H*} hex
1
tclsh8.5 [~]set hex
7d656c657a6efd2042726f64
# See below [default encoding] for deconstruction
tclsh8.5 [~]set str1 [encoding convertto unicode $str]
}elezný_Brod
tclsh8.5 [~]binary scan $str1 {H*} hex1
1
tclsh8.5 [~]set hex1
7d0165006c0065007a006e00fd002000420072006f006400
# see below [unicode] for deconstruction

default encoding:
7d 65 6c 65 7a 6e fd 20 42 72 6f 64
Ž e l e z n ý B r o d

First surprise for me at least: this is not UTF-8. The code points stand
all by themselves. An artifact of [binary scan]? Besides, the
higher-order part of Ž has been cut off (remember: it was hex 17d)

unicode encoding:

This is obviously a 16 bit encoding:

7d01 6500 6c00 6500 7a00 6e00 fd00 2000 4200 7200 6f00 6400

Nothing really surprising -- since I'm on a little endian machine, the
bytes are always in lo-hi order. This should be either the obsolete
UCS-2 or the more modern UTF-16 encoding (I guess the latter).

OK, next try:

tclsh8.5 [~]set str2 [encoding convertto utf-8 $str]
Å½eleznÃ½ Brod
# Ah, this looks familiar :-)
tclsh8.5 [~]binary scan $str2 {H*} hex2
1
tclsh8.5 [~]set hex2
c5bd656c657a6ec3bd5f42726f64

Now this looks like utf8
c5bd 65 6c 65 7a 6e c3bd 5f 42 72 6f 64

Conclusion? Well, however Tcl represents its strings internally, to the
user they are presented as a sequence of Unicode code points, as if each
character were an int (as big as necessary).

Heh: tomas: look then post :-)

Merry Xmas to all
-- tomás

tomás zerolo

unread,

Dec 27, 2013, 4:51:14 PM12/27/13

to

Robert Karen <robert....@gmail.com> writes:

> On Friday, 27 December 2013 13:18:25 UTC-5, DrS wrote:
>> On 12/27/2013 12:16 PM, Robert Karen wrote:

[...]

I'll try to answer your other questions tomorrow, but this one is easier
for me:

> BTW, do you know why the utf-16 for Z w/ caron (c5 bd) is different
> than the utf-8 (017d) for this char?

Hex 17d is the Unicode code point of Ž. Since it is comfortably
representable as a 16 bit binary number (and doesn't fall in the range
d800-dfff, which are special), it can be represented as 017d right away.
(cf. <http://en.wikipedia.org/wiki/Utf-16>).

Now to utf-8 The nicest overview I know is the Linux manpage. Here's an
excerpt:

NAME
UTF-8 - an ASCII compatible multibyte Unicode encoding
[... very readable description snipped ...]
Encoding
The following byte sequences are used to represent a character.
The sequence to be used depends on the UCS code number of the
character:

0x00000000 - 0x0000007F:
0xxxxxxx

0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
[... continued ...]

For Ž, we are in the second case: 17d in binary looks like so:

1 0111 1101

Since that's more than 7f, we are in the second case above. We need two
bytes, the first having the prefix 110 and carrying the 5 higher order
bits of our code, the second has prefix 10 and carries the six
lower-order bits (we must left-extend our 9 bit code to be 11 bit wide):

1100 0101 1011 1101

that makes c5 bd in hex.

Regards
-- tomás

Schelte Bron

unread,

Dec 28, 2013, 9:53:57 AM12/28/13

to

tomás zerolo wrote:
> default encoding:
> 7d 65 6c 65 7a 6e fd 20 42 72 6f 64
> Ž e l e z n ý B r o d
>

This is not default encoding, it's binary encoding.

> First surprise for me at least: this is not UTF-8. The code points
> stand all by themselves. An artifact of [binary scan]? Besides,
> the higher-order part of Ž has been cut off (remember: it was hex
> 17d)
>

It's not really an artifact of [binary scan]. It's just that binary
scan expects its first argument to be binary data (a ByteArray) and
you feed it a String. As a result the String is shimmered to a
ByteArray, which is done by simply taking the low byte of each
character. Same as when you would write the string to a file
configured with -encoding binary.

The proper way to turn a String into a ByteArray is to use [encoding
convertto]. As you noticed, that gives the expected results.

Schelte.

tomás zerolo

unread,

Dec 29, 2013, 11:56:49 AM12/29/13

to

Schelte Bron <nos...@wanadoo.nl> writes:

> tomás zerolo wrote:
>> default encoding:
>> 7d 65 6c 65 7a 6e fd 20 42 72 6f 64
>> Ž e l e z n ý B r o d
>>
> This is not default encoding, it's binary encoding.
>
>> First surprise for me at least: this is not UTF-8. The code points
>> stand all by themselves. An artifact of [binary scan]? Besides,
>> the higher-order part of Ž has been cut off (remember: it was hex
>> 17d)

OK. Quite a bit of experimenting and of code reading later seems I'm a
tad smarter now.

> It's not really an artifact of [binary scan]. It's just that binary
> scan expects its first argument to be binary data (a ByteArray) and
> you feed it a String.

Spot on. Tcl has different types for string and byte array.

> As a result the String is shimmered to a
> ByteArray, which is done by simply taking the low byte of each
> character.

(but first the internal representation, which is UTF-8 is converted to
code points, or to UCS-2, since Tcl doesn't handle by default characters
with code points >= 2^16).

TCL_UTF_MAX, is defined to be 3 in tcl.h; there's a comment there that
core Tcl would work for 6 too, which is the official maximum length of
an utf-8 group, but that some extensions might break on that).

The upshot of all this is that Tcl can only represent 16 bit wide
characters (three UTF-8 bytes, minus 4+2+2 bits prefix makes 16
bit). This makes clear what is meant by "Unicode encoding"
throughout. It's UCS-2 (fixed 16 bit encoding), which at 16 bit in
indistinguishable from UTF-16.

Put in other terms: whenever Tcl wants to convert a string to a
ByteArray, it just uses "unicode encoding", aka UCS-2 and slashes the
higher-order byte.

> Same as when you would write the string to a file
> configured with -encoding binary.

Yes: here I can see that happening: it outputs the "code point number"
modulo 256, i.e. the lowest byte of the Unicode code point. In our case,
and as long as TCL_UTF_MAX == 3, that's the lower half.

> The proper way to turn a String into a ByteArray is to use [encoding
> convertto]. As you noticed, that gives the expected results.

Hmm. I'd say there's no "proper way". That depends on what one expects.

Thanks for your insights and for prodding me to look into this.

Regards
-- tomaś

tomás zerolo

unread,

Dec 29, 2013, 1:13:49 PM12/29/13

to

Robert Karen <robert....@gmail.com> writes:

[...]

> Add_multi_data_file ( ClientData clientData, Tcl_Interp *interp,
> int objc, Tcl_Obj *CONST objv[] )
> {
> char *code_ptr;
> ...
> code_ptr = Tcl_GetStringFromObj(objv[3], NULL);
>
> It sounds like you think Tcl_GetStringFromObj() would not do any encoding
> changes and just take \216 in the string as-is. You sure? Thanks again for
> your trouble.

Tcl_GetStringFromObj will return the "string representation" of the
object. And this is, as far as Tcl can help it, an UTF-8 string.

I just cobbled together a little Tcl extension which just calls the
above function and constructs a string with the hexadecimal
representation of what it sees. The core function is like so:

static int Dumpstring_Cmd(
ClientData __attribute__((__unused__)) cdata,
Tcl_Interp *interp,
int objc,
Tcl_Obj *const objv[])
{
const char *src;
int srclen, i;
char buf[4];
Tcl_Obj *res;

if(objc != 2) { /* we expect exactly one arg */
Tcl_WrongNumArgs(interp, 1, objv, "string");
return TCL_ERROR;
}
src = Tcl_GetStringFromObj(objv[1], &srclen);

res = Tcl_NewStringObj("", 0);
for(i=0; i<srclen; i++) {
sprintf(buf, "%02hhx ", src[i]);
Tcl_AppendToObj(res, buf, -1);
}
Tcl_SetObjResult(interp, res);

return TCL_OK;
}

Otherwise, I followed quite slavishly <http://www2.tcl.tk/11153>

This is an example session. First, the files:

tomas@rasputin:~$ ls -l /tmp/zelezny*
-rw-r--r-- 1 tomas tomas 7 Dec 29 18:07 /tmp/zelezny-cp1252
-rw-r--r-- 1 tomas tomas 7 Dec 29 18:06 /tmp/zelezny-latin2
-rw-r--r-- 1 tomas tomas 16 Dec 29 18:07 /tmp/zelezny-utf16
-rw-r--r-- 1 tomas tomas 9 Dec 29 18:08 /tmp/zelezny-utf8

Their hexdumps:

tomas@rasputin:~$ for f in /tmp/zelezny* ; do echo $f ; hexdump -C $f ; done
/tmp/zelezny-cp1252
00000000 8e 65 6c 65 7a 6e fd |.elezn.|
/tmp/zelezny-latin2
00000000 ae 65 6c 65 7a 6e fd |.elezn.|
/tmp/zelezny-utf16
00000000 fe ff 01 7d 00 65 00 6c 00 65 00 7a 00 6e 00 fd |...}.e.l.e.z.n..|
/tmp/zelezny-utf8
00000000 c5 bd 65 6c 65 7a 6e c3 bd |..elezn..|

(Hint: a really good text editor, in my case a modern Emacs is
an invaluable help in doing this without getting nuts).

Btw, an attempt at saving this in iso-8859-1 (aka latin-1) grants me a
nag page from my editor telling me that the chosen encoding system can't
do that. Thanks, Emacs :-)

So let's play with the freshly made extension:

tomas@rasputin:~/tcltk/dumpstring$ tclsh
tclsh8.5 [~/tcltk/dumpstring]load ./dumpstring[info sharedlibextension]
tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-cp1252]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding cp1252
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
Železný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
c5 bd 65 6c 65 7a 6e c3 bd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Everything as expected. Tcl_GetStringFromObj is giving us a clean,
# UTF-8 encoded string. Ž is seen as c5bd, and ý as c3bd. All is well.

tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-latin2]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding iso8859-2
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
Železný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
c5 bd 65 6c 65 7a 6e c3 bd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Exactly as above.

tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding unicode
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
�紁攀氀攀稀渀ﴀ
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
ef bf be e7 b4 81 e6 94 80 e6 b0 80 e6 94 80 e7 a8 80 e6 b8 80 ef b4 80
tclsh8.5 [~/tcltk/dumpstring]close $f
tclsh8.5 [~/tcltk/dumpstring]
# Now this is curious. It seems Tcl is trying to transform the contents
# of the file into UTF-8, but in a strange way. Let's UTF-8 back it:
# ef bf be -> ff fe: that's the leading "byte order mark".
# As for the rest, at the moment I'm at a loss But it looks too much
# like an UTF-8 sequence as to be just chance.

tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding identity
tclsh8.5 [~/tcltk/dumpstring]
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
þÿ}elezný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
fe ff 01 7d 00 65 00 6c 00 65 00 7a 00 6e 00 fd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Now this looks better. It's a plain, straight UCS-2 sequence.
# Surprising for me is, that Tcl just chose this one instead of
# converting internally to UTF-8?

Back to your question: Tcl_GetStringFromObj will give you UTF-8 unless
you're getting 16-bit encodings (which is not your case).

Regards
-- tomás

tomás zerolo

unread,

Dec 29, 2013, 3:35:30 PM12/29/13

to

to...@tuxteam.de (tomás zerolo) writes:

[...]

> tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16]
> file6
> tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding unicode
> tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
> �紁攀氀攀稀渀ﴀ
> tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
> ef bf be e7 b4 81 e6 94 80 e6 b0 80 e6 94 80 e7 a8 80 e6 b8 80 ef b4 80
> tclsh8.5 [~/tcltk/dumpstring]close $f
> tclsh8.5 [~/tcltk/dumpstring]
> # Now this is curious. It seems Tcl is trying to transform the contents
> # of the file into UTF-8, but in a strange way. Let's UTF-8 back it:
> # ef bf be -> ff fe: that's the leading "byte order mark".
> # As for the rest, at the moment I'm at a loss But it looks too much
> # like an UTF-8 sequence as to be just chance.

OK -- I've got an explanation for this one. The UTF-16 file above was
big endian (note the "fe ff" in the hexdumps in the above post). Tcl is
treating it just as little endian. Here's a comparison between a big
endian and a litle endian version of the file:

tclsh8.5 [~/tcltk/dumpstring]load ./dumpstring[info sharedlibextension]

tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16-le]

file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding unicode

tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
Železný
tclsh8.5 [~/tcltk/dumpstring]close $f
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
ef bb bf c5 bd 65 6c 65 7a 6e c3 bd
# plain, clean UTF-8 (with leading BOM, alas).

tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16-be]

file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding unicode
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
�紁攀氀攀稀渀ﴀ

tclsh8.5 [~/tcltk/dumpstring]close $f

tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
ef bf be e7 b4 81 e6 94 80 e6 b0 80 e6 94 80 e7 a8 80 e6 b8 80 ef b4 80

# the hodgepodge of the post above

Corollary: Tcl's "unicode" encoding just "assumes" little endian (or
perhaps platform-endian?), ignoring the byte order mark.

Corollary 2: If you manage to get the [fconfigure] and/or the
[convertXX] right, the Tcl strings tend to be in UTF-8. At least as long
as ther're in the three-byte range.

Regards
-- tomás

Robert Karen

unread,

Jan 3, 2014, 12:18:28 PM1/3/14

to

Thanks for your help!! It turns out the the cp1252 character for Z/caron
in the unix file that started this whole thing cannot be replaced
with utf-8, so I'll to read it w encoding cp1252 and when I upload the
string from pc pack to unix server, again force a cp1252 encoding.

I was wondering if I use same code to upload a binary file, if I risk
corrupting it when I use encoding cp1252 on a binary file. e.g.
fconfigure $binaryOrTextFilePtr -encoding cp1252
fcopy $binaryOrTextFilePtr $socket

Thanks again.

RK

Donald Arseneau

unread,

Jan 9, 2014, 2:15:10 AM1/9/14

to

to...@tuxteam.de (tom�s zerolo) writes:

> bastardized^H^H^H superset of iso8859-1. Ahhh Microsoft, always

"bastardi" hmmm... Is that Italian?

Maybe you meant "bastardized^H^H^H^H"

:-)

--
Donald Arseneau as...@triumf.ca

Uwe Klein

unread,

Jan 9, 2014, 3:04:30 AM1/9/14

to

Donald Arseneau wrote:
> to...@tuxteam.de (tom�s zerolo) writes:
>
>
>>bastardized^H^H^H superset of iso8859-1. Ahhh Microsoft, always
>
>
> "bastardi" hmmm... Is that Italian?
>
> Maybe you meant "bastardized^H^H^H^H"
>
> :-)
>

percussus interruptus

uwe