Help me understanding "encoding"...

drhs...@gmail.com

unread,

Dec 17, 2017, 4:49:27 PM12/17/17

to

My test script (named x1.txt) is this:

puts [encoding system]
puts [binary encode hex {русский}]
puts [binary encode hex [encoding convertto utf-8 {русский}]]
puts "$tcl_version $tcl_patchLevel"

The output I get from running "tclsh x1.txt" is this:

utf-8
404341413a3839
d180d183d181d181d0bad0b8d0b9
8.6 8.6.5

System is Ubuntu. I was hoping that the second line of output would be the same as the third line. Why do I need the extra [encoding convertto] step to actually see UTF-8 results? What encoding is TCL using internally? Is there anything I can do to get TCL to use UTF-8 internally for everything, so that I don't have to fear missing a required [encoding convertto] call?

Just to be clear, the input script is UTF-8, as evidenced by the following hex dump:

00000000: 70 75 74 73 20 5b 65 6e 63 6f 64 69 6e 67 20 73 puts [encoding s
00000010: 79 73 74 65 6d 5d 0a 70 75 74 73 20 5b 62 69 6e ystem].puts [bin
00000020: 61 72 79 20 65 6e 63 6f 64 65 20 68 65 78 20 7b ary encode hex {
00000030: d1 80 d1 83 d1 81 d1 81 d0 ba d0 b8 d0 b9 7d 5d ..............}]
00000040: 0a 70 75 74 73 20 5b 62 69 6e 61 72 79 20 65 6e .puts [binary en
00000050: 63 6f 64 65 20 68 65 78 20 5b 65 6e 63 6f 64 69 code hex [encodi
00000060: 6e 67 20 63 6f 6e 76 65 72 74 74 6f 20 75 74 66 ng convertto utf
00000070: 2d 38 20 7b d1 80 d1 83 d1 81 d1 81 d0 ba d0 b8 -8 {............
00000080: d0 b9 7d 5d 5d 0a 70 75 74 73 20 22 24 74 63 6c ..}]].puts "$tcl
00000090: 5f 76 65 72 73 69 6f 6e 20 24 74 63 6c 5f 70 61 _version $tcl_pa
000000a0: 74 63 68 4c 65 76 65 6c 22 0a tchLevel".

Andreas Leitgeb

unread,

Dec 17, 2017, 5:23:02 PM12/17/17

to

drhs...@gmail.com <drhs...@gmail.com> wrote:
> My test script (named x1.txt) is this:
> puts [encoding system]
> puts [binary encode hex {русский}]
> puts [binary encode hex [encoding convertto utf-8 {русский}]]
> puts "$tcl_version $tcl_patchLevel"

Tcl has two different internal objects relevant to this topic:
a bytearray and a unicode-string. And tcl converts anything to
what some command wants, as long as possible.

The string русский is internally a unicode-string, (kept as either utf-8
or in something like ucs-2 in the respective native byteorder)

The "binary" command wants a bytearray, so the string gets converted
to one, which means stripping all but the 8 least significant bits
from each char's codepoint.

To pick a specific conversion, there is the "encoding" command, whose
effect you already saw.

Just for further illustration:
puts [binary encode hex [encoding convertto unicode {русский}]]
# "unicode" here is really ucs-2 in native byteorder.
On an intel machine (LE): 40044304410441043a0438043904
On a sparc machine (BE): 0440044304410441043a04380439

> The output I get from running "tclsh x1.txt" is this:
> utf-8
> 404341413a3839
> d180d183d181d181d0bad0b8d0b9
> 8.6 8.6.5
>

> [...] Is there anything I can do to get TCL to use UTF-8 internally

> for everything, so that I don't have to fear missing a required
> [encoding convertto] call?

I'm a frayed knot :-(

Andreas Leitgeb

unread,

Dec 17, 2017, 5:29:34 PM12/17/17

to

Andreas Leitgeb <a...@logic.at> wrote:

> drhs...@gmail.com <drhs...@gmail.com> wrote:
>> [...] Is there anything I can do to get TCL to use UTF-8 internally
>> for everything, so that I don't have to fear missing a required
>> [encoding convertto] call?
> I'm a frayed knot :-(

Well you can of course create a procedure to wrap both the encoding and
binary conversions, so you only call the proc at each spot.