Robert Karen <
robert....@gmail.com> writes:
[...]
> Add_multi_data_file ( ClientData clientData, Tcl_Interp *interp,
> int objc, Tcl_Obj *CONST objv[] )
> {
> char *code_ptr;
> ...
> code_ptr = Tcl_GetStringFromObj(objv[3], NULL);
>
> It sounds like you think Tcl_GetStringFromObj() would not do any encoding
> changes and just take \216 in the string as-is. You sure? Thanks again for
> your trouble.
Tcl_GetStringFromObj will return the "string representation" of the
object. And this is, as far as Tcl can help it, an UTF-8 string.
I just cobbled together a little Tcl extension which just calls the
above function and constructs a string with the hexadecimal
representation of what it sees. The core function is like so:
static int Dumpstring_Cmd(
ClientData __attribute__((__unused__)) cdata,
Tcl_Interp *interp,
int objc,
Tcl_Obj *const objv[])
{
const char *src;
int srclen, i;
char buf[4];
Tcl_Obj *res;
if(objc != 2) { /* we expect exactly one arg */
Tcl_WrongNumArgs(interp, 1, objv, "string");
return TCL_ERROR;
}
src = Tcl_GetStringFromObj(objv[1], &srclen);
res = Tcl_NewStringObj("", 0);
for(i=0; i<srclen; i++) {
sprintf(buf, "%02hhx ", src[i]);
Tcl_AppendToObj(res, buf, -1);
}
Tcl_SetObjResult(interp, res);
return TCL_OK;
}
Otherwise, I followed quite slavishly <
http://www2.tcl.tk/11153>
This is an example session. First, the files:
tomas@rasputin:~$ ls -l /tmp/zelezny*
-rw-r--r-- 1 tomas tomas 7 Dec 29 18:07 /tmp/zelezny-cp1252
-rw-r--r-- 1 tomas tomas 7 Dec 29 18:06 /tmp/zelezny-latin2
-rw-r--r-- 1 tomas tomas 16 Dec 29 18:07 /tmp/zelezny-utf16
-rw-r--r-- 1 tomas tomas 9 Dec 29 18:08 /tmp/zelezny-utf8
Their hexdumps:
tomas@rasputin:~$ for f in /tmp/zelezny* ; do echo $f ; hexdump -C $f ; done
/tmp/zelezny-cp1252
00000000 8e 65 6c 65 7a 6e fd |.elezn.|
/tmp/zelezny-latin2
00000000 ae 65 6c 65 7a 6e fd |.elezn.|
/tmp/zelezny-utf16
00000000 fe ff 01 7d 00 65 00 6c 00 65 00 7a 00 6e 00 fd |...}.e.l.e.z.n..|
/tmp/zelezny-utf8
00000000 c5 bd 65 6c 65 7a 6e c3 bd |..elezn..|
(Hint: a really good text editor, in my case a modern Emacs is
an invaluable help in doing this without getting nuts).
Btw, an attempt at saving this in iso-8859-1 (aka latin-1) grants me a
nag page from my editor telling me that the chosen encoding system can't
do that. Thanks, Emacs :-)
So let's play with the freshly made extension:
tomas@rasputin:~/tcltk/dumpstring$ tclsh
tclsh8.5 [~/tcltk/dumpstring]load ./dumpstring[info sharedlibextension]
tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-cp1252]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding cp1252
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
Železný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
c5 bd 65 6c 65 7a 6e c3 bd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Everything as expected. Tcl_GetStringFromObj is giving us a clean,
# UTF-8 encoded string. Ž is seen as c5bd, and ý as c3bd. All is well.
tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-latin2]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding iso8859-2
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
Železný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
c5 bd 65 6c 65 7a 6e c3 bd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Exactly as above.
tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding unicode
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
�紁攀氀攀稀渀ﴀ
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
ef bf be e7 b4 81 e6 94 80 e6 b0 80 e6 94 80 e7 a8 80 e6 b8 80 ef b4 80
tclsh8.5 [~/tcltk/dumpstring]close $f
tclsh8.5 [~/tcltk/dumpstring]
# Now this is curious. It seems Tcl is trying to transform the contents
# of the file into UTF-8, but in a strange way. Let's UTF-8 back it:
# ef bf be -> ff fe: that's the leading "byte order mark".
# As for the rest, at the moment I'm at a loss But it looks too much
# like an UTF-8 sequence as to be just chance.
tclsh8.5 [~/tcltk/dumpstring]set f [open /tmp/zelezny-utf16]
file6
tclsh8.5 [~/tcltk/dumpstring]fconfigure $f -encoding identity
tclsh8.5 [~/tcltk/dumpstring]
tclsh8.5 [~/tcltk/dumpstring]set foo [gets $f]
þÿ}elezný
tclsh8.5 [~/tcltk/dumpstring]dumpstring $foo
fe ff 01 7d 00 65 00 6c 00 65 00 7a 00 6e 00 fd
tclsh8.5 [~/tcltk/dumpstring]close $f
# Now this looks better. It's a plain, straight UCS-2 sequence.
# Surprising for me is, that Tcl just chose this one instead of
# converting internally to UTF-8?
Back to your question: Tcl_GetStringFromObj will give you UTF-8 unless
you're getting 16-bit encodings (which is not your case).
Regards
-- tomás