Phillip Brooks <
phil...@gmail.com> wrote:
> Thanks for the response, Rich. It was very helpful.
>
> On Tuesday, November 15, 2022 at 1:59:36 PM UTC-8, Rich wrote:
>> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?
>
>> Most likely, Tcl became more properly Unicode aware.
>
> When I look through the 8.4 and 8.5 Tcl release notes, I am not
> finding anything about Unicode. Similarly for the list of TIPs -
> there are several Unicode TIPs for 8.7/9.0, though.
The change might not necesarially referenced Unicode, it might have
refered to channel encodings, or other terms. Note I'm not saying you
are wrong, just that if changes did happen (and 8.4 to 8.6 is a wide
time window) they might not have used the word "Unicode" but still
might have been impactful.
>> Unless you can:
>> 1) be informed of what actual encoding was used; or
>> 2) write a bunch of code to try to infer the encoding used (and this
>> will likely be fragile)
>> then there is not really a general way to 'interpret' any possible
>> encoding.
>
> That's what I was thinking.
>
>> you could set the channels to 'binary' mode and that will disable
>> all the translating of bytes between encodings.
>
> The binary setting didn't help - rather it breaks 8.4 in the same way
> that 8.6 is broken. This was after calling:
>
> Tcl_SetChannelOption(interp, fc, "-encoding", "binary");
Interesting...
>> You need to look at the "fconfigure" command for adjusting the
>> encoding used for file channels (the C API equivalent is the
>> Tcl_SetChannelOption function). You may simply need to set the
>> input and output channels to utf-8 for things to work correctly
>> again.
>
> Thanks for that pointer, fconfigure and Tcl_Get/SetChannelOption have
> been very illuminating.
>
> In Tcl 8.4, the "C" Tcl_Channel seems to have "-encoding" set to
> "identity" by default. In Tcl 8.6, it is set to "iso8859-1" by
> default. In the Tcl script, however, fconfigure shows default
> "-encoding" set to "utf-8" for both Tcl 8.4 and Tcl 8.6.
If your users have been sneaking in UTF-8 encoded data, and the channel
is now set for iso8859-1, you'll get ugly messes out as a result.
I.e., if your users entered a Unicode right single quote (U+2019) but
the channel is set to iso8859-1, you get: в@Y out instead of a right
single quote mark.
But, if your users have been entering UTF-8 encoded text, you'd also be
safe setting the channels to UTF-8 as well.
> Setting "-encoding" to "identity" in Tcl 8.6 seems to reestablish the
> previous behavior. Also, setting it explicitly to "utf-8" works as
> well. Setting Tcl_SetSystemEncoding to "utf-8" changes the default
> to "utf-8" in both Tcl 8.4 and Tcl 8.6.
The Tcl wiki has this to say about the 'identity' encoding:
https://wiki.tcl-lang.org/page/encoding+system
Can soneone elaborate on the meaning of the 'identity' encoding?
When using freewrap I get:
% encoding system
identity
What is this and what is it used for?
schlenk 2005-06-27: The identity encoding is for testing purposes,
it should not be used without very good reasons. If you see your
encoding system set to identity, you are missing the proper encoding
files for your setup. This happens with tclkit-sh.exe on windows or
other wrapped applications which do not include the right encodings
for the local system they are running on.
Googie 2012-08-09: The 'identity' encoding is the default encoding
in my Tcl, even I use regular tclsh and not tclkit. Why is so? (I
use Linux)
PYK 2018-12-04: It is so because your Tcl configuration is borked.
Is your code running inside a 'wrapped' executable -- if the Wiki
statements here are correct, the fact that you get 'identity' on 8.4
would imply that the fact that "it worked" was more of a stroke of luck
than anything else.
If setting to UTF-8 'fixes things' then your likely best course is to
set the channels to UTF-8 and let it be. UTF-8 is all but the
'universal' encoding now for just about everything, so you'd be more
'future proof' to explictly set UTF-8 than not.
> I see this in the fconfigure doc page under -encoding:
>
> "The default encoding for newly opened channels is the same platform-
> and locale-dependent system encoding used for interfacing with the
> operating system, as returned by encoding system."
>
> Does that mean that the user can alter this behavior by setting an
> environment variable on Unix? Any idea where I can find out more
> about that?
Sadly, no. And the only real mention of LANG= in the wiki is that Tcl
uses it to guess what encoding to set as 'system' when it initializes.
> I am thinking that if I can provide the user with an environment
> variable setting, then I won't have to worry about breaking someone
> else's clever use of some other international strings in some other
> place by forcing it to utf-8. I tried explicitly setting
> LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid
> breaking things in new ways for Tcl 8.7 and Tcl 9.
Try LANG=C, which might 'trick' things. But if you do want to avoid
future breakage, if switching to 'utf-8' 'fixes' things now, then that
switch should cause less breakage in the future than not. Anything
else you to would just be a band-aid over another band-aid and itself
likely to subtly break in other ways in the future.