Binary data through socket does not convert back to readable string

tombert

unread,

Jul 30, 2018, 5:16:13 AM7/30/18

to

Hi all,

follow up a past suggestion to implement an xor function in C I followed your advice ... and [xor] some data back and forth works fine. But as soon as I send the data through a binary encoded socket I am unable to convert it back. I looks to me like an UTF issue?

Somebody has a clue? Thx in advance.

Doing locally in a console:
> set salt tcl8
> puts [xor [xor abcdefgh $salt] $salt]
abcdefgh

But doing this through socket, all I get is "¡⣴¡â30ef?"

Here is my test code:

**************************

package re Performance
set ::salt abc4

## new connection accept
proc accept {socket args} {
set ::ss $socket
chan configure $::ss -blocking 0 -encoding binary -translation binary -buffering none
puts "$::ss accepted"
}

## open the TCP connection
proc init {} {
socket -server accept 1234
set ::cs [socket localhost 1234]
## configure the channels binary
chan configure $::cs -blocking 0 -encoding binary -translation binary -buffering none
}

proc test {} {
set data "abcdefgh"
set scrambledData [xor $data $::salt]
puts "Original: $data; Scrambled: 0x[binary encode hex $scrambledData]; Back: [xor $scrambledData $::salt]"
chan puts -nonewline $::ss $scrambledData
set scrambledRead [chan read $::cs]
set result [xor $scrambledRead $::salt]
puts "Channel: 0x[binary encode hex $scrambledRead]: Back: $result"
}

************************

Here is the output of the [test]. Note that the hex code is the same:

Original: abcdefgh; Scrambled: 0x000000500404045c; Back: abcdefgh
Channel: 0x000000500404045c: Back: ¡⣴¡â30ef?

Here is the C code:

************************

int Tcl_xor_cmd(ClientData cdata, Tcl_Interp *interp, int objc, Tcl_Obj *CONST objv[]) {
// reset result
Tcl_ResetResult(interp);

// check argc
if (objc != 3) {
Tcl_AppendResult(interp, "Invalid command count, use: xor <string> <salt>", 0);
return(TCL_ERROR);
}

// get the string to xor
int textLen;
const char* text = Tcl_GetStringFromObj(objv[1], &textLen);
// get salt to xor with
int saltLen;
const char* salt = Tcl_GetStringFromObj(objv[2], &saltLen);
// init result string
char* result = malloc(textLen);

// xor the string
int si = 0;
int ti = 0;
for (ti = 0; ti < textLen; ti++) {
result[ti] = text[ti] ^ salt[si++];
//result[ti] = text[ti];
if (si >= saltLen) si = 0;
}

// fini
Tcl_SetObjResult(interp, Tcl_NewStringObj(result, textLen));
free(result);
return TCL_OK;
}

Christian Gollwitzer

unread,

Jul 30, 2018, 5:31:15 AM7/30/18

to

Am 30.07.18 um 11:16 schrieb tombert:

> Hi all,
>
> follow up a past suggestion to implement an xor function in C I followed your advice ... and [xor] some data back and forth works fine. But as soon as I send the data through a binary encoded socket I am unable to convert it back. I looks to me like an UTF issue?
>
> Somebody has a clue? Thx in advance.
>

When you xor bytes, you don't get valid UTF-8 as the output. Tcl
assumes, more or less, that the strings you feed from C using GetString
etc. are UTF-8 sequences (with a small modification).

Therefore I would suggest you work on bytearrays. Change your C code to
use the functions from https://www.tcl.tk/man/tcl8.4/TclLib/ByteArrObj.htm

Change your Tcl code to convert input and output to bytes, e.g. like this:

set input "this is a bäd string with öumläuts and ʃɪbəlɛθ"

set bytes [encoding convertto utf-8 $input]
set xored [xor $input $salt]

# now write it to the binary channel
# fconfigure $channel -translation binary -encoding binary

It should work when you do everything backwards.

Alternatively, of course, you could write two functions in C, xorencode
& xordecode, which accept/return strings on the "clear end". This would
make the Tcl code easier. Or you do them in Tcl once you have the plain
xor in C operating on bytes.

Christian

tombert

unread,

Jul 30, 2018, 6:25:48 AM7/30/18

to

thx that did it. I am now using Tcl_GetByteArrayFromObj and Tcl_NewByteArrayObj instead and it works. Though I did not change my TCL code.

Is it really necessary to do [encoding convertto utf-8 ...]?
Will I ran into troubles if i don't do it?

thx

Christian Gollwitzer

unread,

Jul 30, 2018, 7:23:04 AM7/30/18

to

Am 30.07.18 um 12:25 schrieb tombert:

> On Monday, 30 July 2018 11:31:15 UTC+2, Christian Gollwitzer wrote:

>> Therefore I would suggest you work on bytearrays. Change your C code to
>> use the functions from https://www.tcl.tk/man/tcl8.4/TclLib/ByteArrObj.htm
>>
>> Change your Tcl code to convert input and output to bytes, e.g. like this:
>>
>> set input "this is a bäd string with öumläuts and ʃɪbəlɛθ"
>>

>

> thx that did it. I am now using Tcl_GetByteArrayFromObj and Tcl_NewByteArrayObj instead and it works. Though I did not change my TCL code.
>
> Is it really necessary to do [encoding convertto utf-8 ...]?
> Will I ran into troubles if i don't do it?

Tcl ByteArrays behave a bit odd. "Strings" are sequences of Unicode
codepoints, and ByteArrays are defined as strings where each letter is
between 0 and 255. When you interpret a regular string as a bytearray,
then everything above 255 is cut off. For the IPA symbols ʃɪbəlɛθ this
will result in gibberish. You can try a round trip using the following
Tcl commands:

# The following interprets the string äöüʃɪbəlɛθ as a byte array
# and then converts the bytes back
(Tests) 63 % binary scan äöüʃɪbəlɛθ c* data
1
(Tests) 64 % binary format c* $data
äöüƒjbYl[¸

As you can see, äöü still survives, because it is within the 8bit
boundary of Unicode, whereas the IPA text comes out scrambled. Whereas,
if you do the encoding step, you will end up with a sequence of bytes:

Tests) 70 % binary scan [encoding convertto utf-8 äöüʃɪbəlɛθ] c* data
1
(Tests) 71 % binary format c* $data
Ã¤Ã¶Ã¼ÊƒÉªbÉ™lÉ›Î¸
(Tests) 72 % encoding convertfrom utf-8 [binary format c* $data]
äöüʃɪbəlɛθ

I expect that happens with your code, too, although I haven't read it
carefully enough.

Now, concerning your question, it depends on where the input comes from.
Is the input a Unicode string? Or is it a sequence of bytes? In the
former case, you need to do the encoding manually. In the second case,
e.g. if the input is the contents of a ZIP file or similar, or read from
a channel using "encoding binary", you don't need nor want the "encoding
convertto". By the way, the salt you feed into it is a string, not a
byte array, because it comes from a string in the source code. However
you will not notice the difference, because it only uses ASCII chars.

Christian