Am 30.07.18 um 12:25 schrieb tombert:
> On Monday, 30 July 2018 11:31:15 UTC+2, Christian Gollwitzer wrote:
>> Therefore I would suggest you work on bytearrays. Change your C code to
>> use the functions from
https://www.tcl.tk/man/tcl8.4/TclLib/ByteArrObj.htm
>>
>> Change your Tcl code to convert input and output to bytes, e.g. like this:
>>
>> set input "this is a bäd string with öumläuts and ʃɪbəlɛθ"
>>
>
> thx that did it. I am now using Tcl_GetByteArrayFromObj and Tcl_NewByteArrayObj instead and it works. Though I did not change my TCL code.
>
> Is it really necessary to do [encoding convertto utf-8 ...]?
> Will I ran into troubles if i don't do it?
Tcl ByteArrays behave a bit odd. "Strings" are sequences of Unicode
codepoints, and ByteArrays are defined as strings where each letter is
between 0 and 255. When you interpret a regular string as a bytearray,
then everything above 255 is cut off. For the IPA symbols ʃɪbəlɛθ this
will result in gibberish. You can try a round trip using the following
Tcl commands:
# The following interprets the string äöüʃɪbəlɛθ as a byte array
# and then converts the bytes back
(Tests) 63 % binary scan äöüʃɪbəlɛθ c* data
1
(Tests) 64 % binary format c* $data
äöüƒjbYl[¸
As you can see, äöü still survives, because it is within the 8bit
boundary of Unicode, whereas the IPA text comes out scrambled. Whereas,
if you do the encoding step, you will end up with a sequence of bytes:
Tests) 70 % binary scan [encoding convertto utf-8 äöüʃɪbəlɛθ] c* data
1
(Tests) 71 % binary format c* $data
äöüʃɪbəlɛθ
(Tests) 72 % encoding convertfrom utf-8 [binary format c* $data]
äöüʃɪbəlɛθ
I expect that happens with your code, too, although I haven't read it
carefully enough.
Now, concerning your question, it depends on where the input comes from.
Is the input a Unicode string? Or is it a sequence of bytes? In the
former case, you need to do the encoding manually. In the second case,
e.g. if the input is the contents of a ZIP file or similar, or read from
a channel using "encoding binary", you don't need nor want the "encoding
convertto". By the way, the salt you feed into it is a string, not a
byte array, because it comes from a string in the source code. However
you will not notice the difference, because it only uses ASCII chars.
Christian