drhs...@gmail.com <
drhs...@gmail.com> wrote:
> My test script (named x1.txt) is this:
> puts [encoding system]
> puts [binary encode hex {русский}]
> puts [binary encode hex [encoding convertto utf-8 {русский}]]
> puts "$tcl_version $tcl_patchLevel"
Tcl has two different internal objects relevant to this topic:
a bytearray and a unicode-string. And tcl converts anything to
what some command wants, as long as possible.
The string русский is internally a unicode-string, (kept as either utf-8
or in something like ucs-2 in the respective native byteorder)
The "binary" command wants a bytearray, so the string gets converted
to one, which means stripping all but the 8 least significant bits
from each char's codepoint.
To pick a specific conversion, there is the "encoding" command, whose
effect you already saw.
Just for further illustration:
puts [binary encode hex [encoding convertto unicode {русский}]]
# "unicode" here is really ucs-2 in native byteorder.
On an intel machine (LE): 40044304410441043a0438043904
On a sparc machine (BE): 0440044304410441043a04380439
> The output I get from running "tclsh x1.txt" is this:
> utf-8
> 404341413a3839
> d180d183d181d181d0bad0b8d0b9
> 8.6 8.6.5
>
> [...] Is there anything I can do to get TCL to use UTF-8 internally
> for everything, so that I don't have to fear missing a required
> [encoding convertto] call?
I'm a frayed knot :-(