length of binary data

Benno.N...@exorbyte.com

unread,

Jul 4, 2006, 9:53:07 AM7/4/06

to

Although I looked around quite a lot, there doesn't seem to be an
efficient solution for the following problem.
We read data from a binary stream into a variable. Now: What is the
length of the buffer actually read? Problem:
If the binary data is actually a utf-8 string, we should use [string
bytelength], which in fact doesn't really return the length of the
buffer, but the length of a buffer that is needed to store a string in
utf-8 with the given content. [string length] will not do, obviously.
If the binary data is actually a say iso8859-1 string, we should use
[string length], since this is the length of the string in chars, which
in this case is the same as in bytes. [string bytelength] will not do,
since somehow (and that is the strange behaviour) tcl assumes, that an
umlaut has to be stored as utf-8 as multi-char, which it shouldn't be
possible to know, since I didn't tell it, is was iso.

Is there a common operation that doesn't care for what the data
represents and counts its length, since when we want to send data, we
would like to do something like:
set nstring [encoding covertto <actual coding> $string]
set len [<operation looked for> $nstring]
.... send string with length in header

We tested many things, but we only came up with formating the string
using [binary a* ...] and then running string length. This is
conceptually not clean. There should be a function REALLY RETURNING THE
BUFFER LENGTH of an objects content. Shouldn't it?
Any solution?

suchenwi

unread,

Jul 4, 2006, 9:59:43 AM7/4/06

to

Benno.Niesw...@exorbyte.com schrieb:

> Is there a common operation that doesn't care for what the data
> represents and counts its length, since when we want to send data, we
> would like to do something like:
> set nstring [encoding covertto <actual coding> $string]
> set len [<operation looked for> $nstring]
> .... send string with length in header
>

You can do it in two steps:
(1) turn string to byte-array (this requires all characters are on page
U+00, ISO Latin 1)
This is done with the slightly obscure "encoding convertfrom identity"

(2) string bytelength

Demo:
% string bytelength [encoding convertfrom identity "süßöl"]
5

Best regards, Richard (sitting at your former desk :^)

Michael Schlenker

unread,

Jul 4, 2006, 10:29:13 AM7/4/06

to

Benno.N...@exorbyte.com schrieb:

No. Either you work with binary data (byte arrays) by creating the with
binary format or reading from a channel with -translation binary. In
that case you can use string length, but have to take care yourself for
the encoding you use.

If you use text data, you have to specifiy the encoding, otherwise you
only have a glob of binary data which could mean anything. If you know
the encoding of a string you know how to calculate its bytelength.
Internally its always 'string bytelength' as all non-binary input is
converted into (a slightly modified) utf-8 internally, if you use the
Tcl encoding system.

So you have to make a choice:
1. use the internal utf-8 encoding and manage your encoding conversion
when passing data to channels or c-code.
2. use byte arrays and binary data and manage encodings yourself

In case 1: use string bytelength on the utf-8 string, or string length
on the result of [encoding convertto].

In case 2: use string length on the bytearray and convert as necessary.

If you read from a channel configured with -translation binary you
always get a byte array, whose length is given by string length. Only if
you use [encoding convertfrom utf-8] on the binary data and transforming
it into the internal Tcl encoding, you get a string.

Michael

Donal K. Fellows

unread,

Jul 4, 2006, 10:55:52 AM7/4/06

to

Benno.N...@exorbyte.com wrote:
> Is there a common operation that doesn't care for what the data
> represents and counts its length, since when we want to send data, we
> would like to do something like:
> set nstring [encoding covertto <actual coding> $string]
> set len [<operation looked for> $nstring]

> ..... send string with length in header

Sounds like you've got *very* confused. What you want to get is the
binary representation of the encoded string, which you do on the sending
side like this:

set theTransferEncoding utf-8 ;# For example
set encodedBytes [encoding convertto $theTransferEncoding $string]
set encodedLen [string length $encodedBytes]

Now you have everything you need to send the data as bytes (remember to
set the channel to use the binary -translation option when you do this).
On the receiving side once you've read the required number of bytes you
can then use:

set string [encoding convertfrom $theTransferEncoding $encodedBytes]

to convert back into normal characters. It really is that easy. And you
can change to other encodings just as easily.

Trying to use [string bytelength] is trickier than it looks because Tcl
actually uses a denormalized UTF-8 internally, and it is almost always a
bug to be using the identity encoding. :-\ The combination can be made
to work - I admit I've done it in my own code in the past - but it is
very confusing. Use what I recommend and you'll be much happier.

Donal.

suchenwi

unread,

Jul 4, 2006, 11:33:01 AM7/4/06

to

Donal K. Fellows schrieb:

> Trying to use [string bytelength] is trickier than it looks because Tcl
> actually uses a denormalized UTF-8 internally, and it is almost always a
> bug to be using the identity encoding. :-\ The combination can be made
> to work - I admit I've done it in my own code in the past - but it is
> very confusing

My understanding is that [encoding convertfrom identity $str] produces
a byte array, where each byte corresponds to one character in $str &&
0xFF...

Michael Schlenker

unread,

Jul 4, 2006, 11:47:46 AM7/4/06

to

suchenwi schrieb:

That is correct, but you see artifacts of 0x00 being converted into 0xc0
0x80 when you do this, which may change your byte count (this is the
denormalized encoding Donal refers to).

Michael

suchenwi

unread,

Jul 4, 2006, 12:11:31 PM7/4/06

to

Michael Schlenker schrieb:

> That is correct, but you see artifacts of 0x00 being converted into 0xc0
> 0x80 when you do this, which may change your byte count (this is the
> denormalized encoding Donal refers to).
>

Hmm.. looks like "from identity" renormalizes again...
32 % string bytelength [encoding convertfrom identity "a\u0000b"]
3
If the c080 artefact had happened, we'd see a length of 4.

suchenwi

unread,

Jul 4, 2006, 12:38:26 PM7/4/06

to

suchenwi schrieb:

> 32 % string bytelength [encoding convertfrom identity "a\u0000b"]
> 3
> If the c080 artefact had happened, we'd see a length of 4.

To introspect the byte array better, I had to find that single-char
splitting may return "" on \x00, so my hex-dumper had to be fixed. Here
goes:

proc hexdump str {
set res {}
foreach c [split $str ""] {
set i [scan $c %c]
if {$i eq ""} {set i 0}
lappend res [format %02x $i]
}
set res
}
127 % hexdump [encoding convertfrom identity a\u0000b]
61 00 62
So the \x00 byte is really normalized in the byte array.

Michael Schlenker

unread,

Jul 4, 2006, 12:56:04 PM7/4/06

to

suchenwi schrieb:

Could be. The denormalized encoding can happen, but i don't know exactly
on which code paths. There were some fixes to keep the internal detail
to leak out over channels, could be that some occurrences of
denormalization have been removed with those.

Michael

Joe English

unread,

Jul 4, 2006, 12:34:10 PM7/4/06

to

Benno.Nieswand wrote:
>
>Although I looked around quite a lot, there doesn't seem to be an
>efficient solution for the following problem.
>We read data from a binary stream into a variable. Now: What is the
>length of the buffer actually read?

You can use [string length] for this.

In Tcl, you can treat "binary" strings the same as any other string.
The only distinguishing features of "binary strings" are that they
only contain characters with code points in the range 0x00 - 0xFF
(including ones that aren't official Unicode characters, but Tcl doesn't
care about that), and that they may have an optimized internal
representation.

>If the binary data is actually a utf-8 string, we should use [string
>bytelength],

No, you should not use [string bytelength]. There might be valid
reasons to use [string bytelength] -- I don't know of any myself --
but this definitely isn't one of them.

> which in fact doesn't really return the length of the
>buffer, but the length of a buffer that is needed to store a string in
>utf-8 with the given content. [string length] will not do, obviously.

Quite the opposite, [string length] is exactly what you want.
If $b is a binary string -- that is, something you got from
[binary format] or read from a channel with -encoding binary --
then [string length $b] returns the number of bytes in $b.

>If the binary data is actually a say iso8859-1 string, we should use
>[string length], since this is the length of the string in chars, which
>in this case is the same as in bytes. [string bytelength] will not do,
>since somehow (and that is the strange behaviour) tcl assumes, that an
>umlaut has to be stored as utf-8 as multi-char, which it shouldn't be
>possible to know, since I didn't tell it, is was iso.

This is exactly why you should not use [string bytelength] :-)
It can only lead to confusion.

--Joe English

Joe English

unread,

Jul 4, 2006, 12:50:28 PM7/4/06

to

Richard Suchenwirth wrote:
>Benno.Niesw... schrieb:

>
>> Is there a common operation that doesn't care for what the data
>> represents and counts its length, since when we want to send data, we
>> would like to do something like:
>> set nstring [encoding covertto <actual coding> $string]
>> set len [<operation looked for> $nstring]
>> .... send string with length in header

>You can do it in two steps:
>(1) turn string to byte-array (this requires all characters are on page
>U+00, ISO Latin 1)
>This is done with the slightly obscure "encoding convertfrom identity"

No, no! [encoding convertfrom identity] doesn't generate a
byte-array object. It generates a "pure string" with a
(probably) invalid string rep.

>(2) string bytelength
>
>Demo:
>% string bytelength [encoding convertfrom identity "süßöl"]
>5

This only *looks* like it works. It yields the expected
result, but only by accident. The intermediate value returned
by [encoding convertfrom identity] is an invalid Tcl_Obj*.

--Joe English

Donald Arseneau

unread,

Jul 4, 2006, 6:30:29 PM7/4/06

to

jeng...@flightlab.com (Joe English) writes:

> Richard Suchenwirth wrote:
> >% string bytelength [encoding convertfrom identity "süßöl"]
>

> This only *looks* like it works. It yields the expected
> result, but only by accident. The intermediate value returned
> by [encoding convertfrom identity] is an invalid Tcl_Obj*.

Isn't this the only thing that *does* work with
encoding convertfrom identity? That is, you can't
use the result as a string, but you can measure its
bytelength.

--
Donald Arseneau as...@triumf.ca

Donal K. Fellows

unread,

Jul 4, 2006, 6:31:08 PM7/4/06

to

Michael Schlenker wrote:
> That is correct, but you see artifacts of 0x00 being converted into 0xc0
> 0x80 when you do this, which may change your byte count (this is the
> denormalized encoding Donal refers to).

There is that, but it's really rather more confusing than that as the
identity encoding can be used to create binary strings in the sense of
Tcl 8.0, and that's where the confusion *really* comes in. That can have
some very odd effects indeed. Use the other encodings instead, define
what encoding you're *really* wanting to use to ship data about, and
stop worrying.

Donal.

Benjamin Riefenstahl

unread,

Jul 4, 2006, 7:18:38 PM7/4/06

to

Hi,

You write:
> We read data from a binary stream into a variable. Now: What is the
> length of the buffer actually read?

You have a byte array. At the Tcl level byte arrays are represented
as strings of pseudo-characters with values between 0 and 255, each
pseudo-character represents one byte. The length of that byte array
is calculated with [string length] the same as for strings of real
characters.

> If the binary data is actually a utf-8 string, we should use [string
> bytelength],

[string bytelength] will tell you how many bytes a string of
characters will take, when you encode it as UTF-8. It doesn't tell
you anything about the current object as such, much less, if you have
a byte array, because you never want to convert the bytes that you
already have into UTF-8 *again*.

> [string length] will not do, obviously.

Why not?

> If the binary data is actually a say iso8859-1 string,

Tcl doesn't know which encoding the byte array represents or if there
is text data in the byte array at all. You have to answer that
question first, e.g. by using [encoding convertfrom].

> [string bytelength] will not do, since somehow (and that is the
> strange behaviour) tcl assumes, that an umlaut has to be stored as
> utf-8 as multi-char, which it shouldn't be possible to know, since I
> didn't tell it, is was iso.

As I said above, [string bytelength] assumes it is handed a text
string (*not* a byte array) and that UTF-8 is the target encoding. It
is a specific shorthand for [string length [encoding convertto
utf-8]]. It is, strictly speaking, an anomaly.

> Is there a common operation that doesn't care for what the data
> represents and counts its length, since when we want to send data,
> we would like to do something like:

> set nstring [encoding covertto <actual coding> $string]
> set len [<operation looked for> $nstring]

Although it is slightly misnamed in this context, [string length] is
actually your answer.

benny

Fredderic

unread,

Jul 4, 2006, 11:01:06 PM7/4/06

to

On Tue, 04 Jul 2006 22:31:08 GMT,
"Donal K. Fellows" <donal.k...@manchester.ac.uk> wrote:

> There is that, but it's really rather more confusing than that as the
> identity encoding can be used to create binary strings in the sense of
> Tcl 8.0, and that's where the confusion *really* comes in. That can
> have some very odd effects indeed. Use the other encodings instead,
> define what encoding you're *really* wanting to use to ship data
> about, and stop worrying.

How would one go about reading a text socket with other things embedded
within it? I have a stream which lists in a text index the length and
content of various blocks, then dumps the blocks immediately following
one another, then sits there quiet for a bit before doing it all again
(another index, followed by a bunch of exact-length blocks). Most of
those blocks are binary blobs and images which need to be dumped to
disk, while others are scripts that describe what should be done with
them, and need to be interpreted as it goes.

At the moment it's being served by an old C program that no one wants
to maintain, let along improve. I'm hoping to bring it up to speed
with the new requirements, by starting from scratch with TCL. It's not
a big job, but it does need to keep track of blob boundaries rather
closely in an otherwise ASCII text with slack end-of-line formatting
(native of the source OS; Unix or Windoze) stream. There'll be even
more ASCII text floating around with the new requirements also
(status queries, configuration changes, etc.).

Fredderic

Bruce Hartweg

unread,

Jul 4, 2006, 11:54:40 PM7/4/06

to

you can change encoding/translation on the fly on a socket.
is the size of the text section known? is there a marker
to know the split between sections?

you can read as text to get your size, then change you encoding to
binary and read that many bytes - the trick is how to know what to read
number wise (always 3 bytes, 4 what?) if it is arbitrary then how can
you differentiate what something the starts "101....." is that a ten and
the first byte of the 10 byte blob just happens to be the same as an
ascii "1" or is it a 101 byte blob.

Bruce

Joe English

unread,

Jul 6, 2006, 2:05:56 PM7/6/06

to

Donald Arseneau wrote:

>Joe English writes:
>> Richard Suchenwirth wrote:
>> >% string bytelength [encoding convertfrom identity "süßöl"]
>>
>> This only *looks* like it works. It yields the expected
>> result, but only by accident. The intermediate value returned
>> by [encoding convertfrom identity] is an invalid Tcl_Obj*.
>
>Isn't this the only thing that *does* work with
>encoding convertfrom identity? That is, you can't
>use the result as a string, but you can measure its
>bytelength.

Since [string bytelength [encoding convertfrom identity $x]]
is always the same as [string length $x], there's no reason
to do so.

I'm not aware of any practical uses for [string bytelength].

--Joe English