keithv <
kve...@gmail.com> wrote:
> I just ran into a situation where writing to a file in binary mode
> changes the data. Below is a short snippet showing the problem:
Harold answered your question, I'll add a bit more info.
> set data \u2019
Variable data contains a single 'character', but more than a single
'byte' (because you used a unicode code point that has a value over
255). Note, in the Unicode world, character != byte.
> puts "data: '$data' length: [string length $data] bytelength: [string bytelength $data]"
> set fout [open foo wb]
> fconfigure $fout -translation binary
> puts -nonewline $fout $data
You are asking to write a string, containing a single character (where
that character is represented by plural bytes) into a channel that
*only* writes single bytes. In order for this to work, Tcl has to
'convert' the multiple bytes into a single byte. In this case, when
you do not ask for a specific conversion, it simply truncates to the
low 8 bits of the value and that is what is output (this is documented
in one of the man pages).
This is where you messed up. This line (the puts) is where your data
is being corrupted.
> What am I doing wrong?
Not informing Tcl of exactly how you want it to convert character 2019
into a stream of bytes.
> Is it that data is a "string" of length 1, only 1 byte gets written out?
No, it is because Tcl's default conversion simply "truncates" to a
single byte.
You have two fixes:
1) Harold's fix. Inform Tcl that you want the channel to write UTF-8
encoded data by performing "fconfigure $fout -encoding utf-8" after you
open the channel.
2) Tell Tcl how you want it to convert into binary directly, by using
the 'encoding' command:
puts -nonewline $fout [encoding converto utf-8 $data]
'encoding' outputs binary strings, so setting the channel to binary is
proper if you use [encoding] to convert to binary before you 'puts' the
data.
Also, please read the 'string' man page portion on bytelength
carefully:
OBSOLETE SUBCOMMANDS
These subcommands are currently supported, but are likely to go
away in a future release as their functionality is either
virtually never used or highly misleading.
string bytelength string
Returns a decimal string giving the number of bytes used
to rep- resent string in memory. Because UTF-8 uses one
to three bytes to represent Unicode characters, the byte
length will not be the same as the character length in
general. The cases where a script cares about the byte
length are rare.
In almost all cases, you should use the string length
operation (including determining the length of a Tcl byte
array value). Refer to the Tcl_NumUtfChars manual entry
for more details on the UTF-8 representation.
Compatibility note: it is likely that this subcommand
will be withdrawn in a future version of Tcl. It is
better to use the encoding convertto command to convert a
string to a known encod- ing and then apply string length
to that.
'bytelength' is listed as 'obsolete' for a reason. You should ignore
its presence for any new code you create.