writing binary data problem

keithv

unread,

Dec 21, 2018, 12:16:19 AM12/21/18

to

I just ran into a situation where writing to a file in binary mode changes the data. Below is a short snippet showing the problem:

set data \u2019
puts "data: '$data' length: [string length $data] bytelength: [string bytelength $data]"
set fout [open foo wb]
fconfigure $fout -translation binary
puts -nonewline $fout $data
close $fout
puts "file size: [file size foo]"

set fin [open foo rb]
set newData [read $fin]
close $fin
puts "newData: '$newData' length: [string length $newData] bytelength: [string bytelength $newData]"
if {$data ne $newData} { puts "mismatch" } else { puts "good" }

What am I doing wrong? Is it that data is a "string" of length 1, only 1 byte gets written out?

Keith

Harald Oehlmann

unread,

Dec 21, 2018, 2:39:18 AM12/21/18

to

Keith,

thank you for the posting.

I suppose, the main issue is the followiong:

- you take a unicode codepoint:
set data \u2019
- now you write this to a binary channel.

I would advise you to use a channel which supports unicode codepoints or
to code your data in binary:

a) unicode-aware channel
fconfigure $fout -encoding utf-8

b) convert unicode to bytes
set databin [encoding convertto utf-8 $data]

The command "bytelength" does not what it says. It does not give the
byte count. To get the byte count:
- first transform to bytes
- then use string length

% string length [encoding convertto utf-8 \u2019]
3

If you read, go the other way around.

Unfortunately, tcl is very powerful here and not easy to understand what
really happens...

Hope this helps,
Harald

Rich

unread,

Dec 21, 2018, 11:23:32 AM12/21/18

to

keithv <kve...@gmail.com> wrote:
> I just ran into a situation where writing to a file in binary mode
> changes the data. Below is a short snippet showing the problem:

Harold answered your question, I'll add a bit more info.

> set data \u2019

Variable data contains a single 'character', but more than a single
'byte' (because you used a unicode code point that has a value over
255). Note, in the Unicode world, character != byte.

> puts "data: '$data' length: [string length $data] bytelength: [string bytelength $data]"
> set fout [open foo wb]
> fconfigure $fout -translation binary
> puts -nonewline $fout $data

You are asking to write a string, containing a single character (where
that character is represented by plural bytes) into a channel that
*only* writes single bytes. In order for this to work, Tcl has to
'convert' the multiple bytes into a single byte. In this case, when
you do not ask for a specific conversion, it simply truncates to the
low 8 bits of the value and that is what is output (this is documented
in one of the man pages).

This is where you messed up. This line (the puts) is where your data
is being corrupted.

> What am I doing wrong?

Not informing Tcl of exactly how you want it to convert character 2019
into a stream of bytes.

> Is it that data is a "string" of length 1, only 1 byte gets written out?

No, it is because Tcl's default conversion simply "truncates" to a
single byte.

You have two fixes:

1) Harold's fix. Inform Tcl that you want the channel to write UTF-8
encoded data by performing "fconfigure $fout -encoding utf-8" after you
open the channel.

2) Tell Tcl how you want it to convert into binary directly, by using
the 'encoding' command:

puts -nonewline $fout [encoding converto utf-8 $data]

'encoding' outputs binary strings, so setting the channel to binary is
proper if you use [encoding] to convert to binary before you 'puts' the
data.

Also, please read the 'string' man page portion on bytelength
carefully:

OBSOLETE SUBCOMMANDS
These subcommands are currently supported, but are likely to go
away in a future release as their functionality is either
virtually never used or highly misleading.

string bytelength string
Returns a decimal string giving the number of bytes used
to rep- resent string in memory. Because UTF-8 uses one
to three bytes to represent Unicode characters, the byte
length will not be the same as the character length in
general. The cases where a script cares about the byte
length are rare.

In almost all cases, you should use the string length
operation (including determining the length of a Tcl byte
array value). Refer to the Tcl_NumUtfChars manual entry
for more details on the UTF-8 representation.

Compatibility note: it is likely that this subcommand
will be withdrawn in a future version of Tcl. It is
better to use the encoding convertto command to convert a
string to a known encod- ing and then apply string length
to that.

'bytelength' is listed as 'obsolete' for a reason. You should ignore
its presence for any new code you create.