Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

string bug?

13 views
Skip to first unread message

Lisa Pearlson

unread,
Dec 26, 2005, 1:49:14 PM12/26/05
to
Hi,

I have a binary string. The first 2 characters are hexadecimal values 0x00
and 0x17.

Odd enough,
"[string index $mystring 0][string index $mystring 1]"
returns correct data, and should be equal to
"[string range $mystring 0 1]"
... but it's not!

While the first one returns the expected 0x00 and 0x17, the latter returns
0xc0 and 0x80.
Any clues?

Lisa


Donal K. Fellows

unread,
Dec 26, 2005, 6:59:24 PM12/26/05
to
As the maintainer of the [string] and [binary] code, it looks like your
code is playing fast-and-loose with encodings. The thing is, internally
Tcl uses a (slightly) denormalized UTF-8 for most of its strings, and
that denormalization is in its handling of NUL characters (i.e. the
things you get from \x00 or \u0000) which it represents as \xc0\x80 so
that they pass through utility functions like strlen() without causing
problems. Most of the time, this is exactly what is wanted and it makes
the core both fast and easy to maintain. The down-side is that C code
that wishes to work with a bytestream can't (or at least shouldn't)
just grab the bytes directly; instead, it should use
Tcl_GetByteArrayFromObj which handles all the conversions for you. Or
failing that, Tcl_UtfToExternalDString with the correct encoding
(possibly the one named "utf-8"). But the Tcl core does provide all the
tools for working transparently with all this stuff.

Or at least it should do, and it'd be a bug if it was repeatable in
pure Tcl code (since Tcl's internal encoding scheme should be invisible
to scripts.) Although I couldn't duplicate your problem in a quick
try-out, if you can post some code that demonstrates the difficulty
you're having, I'd be only too happy to fix things.

Donal.

Lisa Pearlson

unread,
Dec 27, 2005, 12:20:03 PM12/27/05
to
I can't really put my finger on it either. When I 'isolate' these functions,
it seems to work. But inside my application, it doesn't.

I am developing a 'server' script, receiving BINARY data via STDIN using
XINETD as the socket server.
The binary data packets consist of a 2 byte 'size' header, followed by that
number of bytes.
So I have a function that reads the size of the package by getting the first
2 bytes of the stream.
This is my code:

proc parsebinpacketsize {bbody} {
if {[string length $bbody] < 2} {
return 0
}
binary scan [string range $bbody 0 1] S sz
return $sz
}

The function fails, saying variable sz does not exist in call 'return $sz'.
This is because binary scan somehow fails.

The function is called in an ugly inefficient expect call that reads this
header and then depending on its size, expects the rest of the bytes of the
data package (using expect to conveniently handle timeouts):

proc breceive {{size -1} {t 10}} {
set bdata {}
expect {
-timeout $t -re "." {
append bdata $expect_out(buffer)
set len [string length $bdata]
if { $size == -1 && $len == 2 } {
set size [expr 2 + [parsebinpacketsize
$bdata]]
} elseif { $size > -1 && $len >= $size } {
return $bdata
}
exp_continue
}

timeout { return $bdata }
}
}

Currently my work around is to use my own 'range' function:

# [range "abcdef" -3 3] => "defabc"
proc range {str start end} {
set len [string length $str]
if {$len == 0} { return "" }
if {$end >= $len} { set end [expr $len-1] }
if {$start < -$len} { set start -$len }
set result ""
set i $start
while {$i <= $end} {
if {$i < 0} {
append result [string index $str [expr $len + $i]]
} else {
append result [string index $str $i]
}
incr i
}
return $result
}

I don't know if this is sufficient to reproduce the problem and figure out
the issue. I thought that string functions were binary safe. [string length]
functions work fine on binary data. But [string range] obviously not. This
seems bit inconsistent to me. If UTF encoding is used, then [string length]
should provide number of "UTF" characters. But then I don't know how to get
number of 'bytes'. Perhaps [string] should not work on bytes but on 'UTF'
encoded characters, including [string length] and there should be a [binary
length $string] for binary operations. That would seem more consistent to
me.

Lisa


"Donal K. Fellows" <donal.k...@man.ac.uk> wrote in message
news:1135641564.2...@z14g2000cwz.googlegroups.com...

Donal K. Fellows

unread,
Dec 27, 2005, 7:49:33 PM12/27/05
to
Lisa Pearlson wrote:
> I can't really put my finger on it either. When I 'isolate' these functions,
> it seems to work. But inside my application, it doesn't.

Ick. That's going to be hard to hunt down. It'll probably be something
simple, but non-obvious. :-(

> I am developing a 'server' script, receiving BINARY data via STDIN using
> XINETD as the socket server.

[... and there's expect in the mix too ...]

Lions and tigers and bears! Oh my!

> The binary data packets consist of a 2 byte 'size' header, followed by that
> number of bytes.
> So I have a function that reads the size of the package by getting the first
> 2 bytes of the stream.
> This is my code:
>
> proc parsebinpacketsize {bbody} {
> if {[string length $bbody] < 2} {
> return 0
> }
> binary scan [string range $bbody 0 1] S sz
> return $sz
> }
>
> The function fails, saying variable sz does not exist in call 'return $sz'.
> This is because binary scan somehow fails.

I'd be tempted to rewrite that code like this:
proc parsebinpacketsize bbody {
if {[binary scan [string range $bbody 0 1] S sz] != 1} {
return 0
}
return $sz
}
This uses the fact that [binary scan] returns the number of variables it
filled.

> The function is called in an ugly inefficient expect call that reads this
> header and then depending on its size, expects the rest of the bytes of the
> data package (using expect to conveniently handle timeouts):
>
> proc breceive {{size -1} {t 10}} {
> set bdata {}
> expect {
> -timeout $t -re "." {
> append bdata $expect_out(buffer)
> set len [string length $bdata]
> if { $size == -1 && $len == 2 } {
> set size [expr 2 + [parsebinpacketsize $bdata]]
> } elseif { $size > -1 && $len >= $size } {
> return $bdata
> }
> exp_continue
> }
> timeout { return $bdata }
> }
> }

Hmm, I'd be so tempted to write that in plain Tcl using the [fileevent]
command.

proc breceive {{size -1} {t 10}} {

global br chan
set br(timeout) [after [expr {$t*1000}] set br(wait) timeout]
set br(data) {}
set br(size) $size
set br(chan) $chan

# Note that we *must* be binary and we *must* be non-blocking
fconfigure $chan -translation binary -blocking 0
fileevent $chan readable br_received

vwait br(wait)

after cancel $br(timeout)
fileevent $chan readable {}
return $br(data)
}

proc br_received {} {
global br

after cancel $br(timeout)
set br(timeout) [after [expr {$t*1000}] set br(wait) timeout]

if ($br(size) != -1 && $br(size) > [string length $br(data)]} {
set wanted [expr {$br(size) - [string length $br(data)]}]
# Grab as much as is available up to the amount we want
append br(data) [read $br(chan) $wanted]
} else {
# Grab one byte
append br(data) [read $br(chan) 1]
}

set len [string length $br(data)]
if {$br(size) == -1 && $len == 2} {
if {![binary scan $br(data) S br(size)]} {
error "failed to extract string length?!"
}
incr br(size) 2

} elseif {$br(size) > -1 && $len == $br(size)} {
set br(wait) done

} elseif {[eof $chan]} {
# Watch out for this case!
set br(wait) eof
}
}

What does the code do? Pretty much what that expect fragment does,
except that here you've got a lot better control over each bit of what
is going on. In particular, I'm sure it won't chew bytes about and it
will (or could be made to) handle many error cases gracefully.

For that matter, that code above is pretty much the core of what the
[expect] command does, except for the fancy matching. Not that that is
the only thing going on in the expect extension (terminal handling is
very tricky!) but it is still interesting. Which is in turn why I wrote
such a long example.

> I thought that string functions were binary safe.

So did I.

Donal.

Lisa Pearlson

unread,
Dec 27, 2005, 9:47:17 PM12/27/05
to
> I'd be tempted to rewrite that code like this:
> proc parsebinpacketsize bbody {
> if {[binary scan [string range $bbody 0 1] S sz] != 1} {
> return 0
> }
> return $sz
> }
> This uses the fact that [binary scan] returns the number of variables it
> filled.

But if there are not enough bytes to fill the variable with 'S', it'll throw
an error "not enough bytes" or something.. won't it?
That's what happened with me initially when bbody was < 2 bytes long.

Lisa


Donal K. Fellows

unread,
Dec 28, 2005, 6:39:59 PM12/28/05
to
Lisa Pearlson wrote:
> But if there are not enough bytes to fill the variable with 'S', it'll throw
> an error "not enough bytes" or something.. won't it?

No, I checked. :-) When [binary scan] doesn't have enough bytes to
satisfy even the first field, it doesn't assign to anything and instead
returns zero.

% info patch
8.4.7
% binary scan "" S x
0
% set x
can't read "x": no such variable

> That's what happened with me initially when bbody was < 2 bytes long.

You must have been doing something slightly different then. (Not wrong
in general, just different.)

Donal.

0 new messages