Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading Unicode text in a non-localized application

133 views
Skip to first unread message

Phillip Brooks

unread,
Nov 15, 2022, 3:22:48 PM11/15/22
to
Hi,

We have noticed a problem in our application that started occurring with our transition to Tcl 8.6 from Tcl 8.4. The problem is that we read some user provided text using Tcl that eventually gets printed by our application. Although our application is not localized, enterprising users found that they can enter Unicode text into the file and then when it prints out, it ends up the same way it came in when we print it out from C++. When they started using the Tcl 8.6 version of our product, that stopped working and now garbage is printed where the nice unicode output was printed previously.

Here is a small example program that illustrates this problem:

#include <iostream>
#include <sstream>
#include <fstream>
#include <tcl.h>
#include <string.h>

void dump_buffer( Tcl_Obj* read_obj_ptr ) {
size_t buflen = strlen( read_obj_ptr->bytes );
for( size_t i=0; i != buflen; ++i ) {
if ( i > 0 && ( i % 10 == 0 )) { std::cout << std::endl; }
unsigned char c = read_obj_ptr->bytes[i];
std::cout << (unsigned int)c << " ";
}
std::cout << std::endl;
}

int main()
{
Tcl_Interp * interp = Tcl_CreateInterp();
//Tcl_SetSystemEncoding(interp, "utf-8");

Tcl_Channel fc = Tcl_OpenFileChannel(interp, "file", "r", 0644);
if (!fc)
{
std::cout << "ERROR: Cannot open input TVF file for reading" << std::endl;
return 1;
}

Tcl_Obj *read_obj_ptr = Tcl_NewObj();
int chars_read = Tcl_ReadChars(fc, read_obj_ptr, -1, 0);
char* str = Tcl_GetStringFromObj( read_obj_ptr, nullptr );
std::cout << "TCL READ String\n";
std::cout << str << std::endl;
dump_buffer( read_obj_ptr );

Tcl_Close(interp, fc);

std::ifstream fc1("file");
if ( fc1.fail() ) {
std::cout << "ERROR: Cannot open input TVF file for reading" << std::endl;
fc1.close();
return 1;
}
std::stringstream buffer;
buffer << fc1.rdbuf();

if ( fc1.fail() || buffer.str().empty() )
{
std::cout << "ERROR: No data read from input TVF file" << std::endl;
fc1.close();
return 1;
}
fc1.close();

Tcl_Obj *read_obj_ptr1 = Tcl_NewObj();
Tcl_AppendToObj(read_obj_ptr1, buffer.str().c_str(), -1);
std::cout << "C++ READ\n";
std::cout << read_obj_ptr1->bytes << std::endl;
dump_buffer( read_obj_ptr1 );

return 0;
}

The file "file" contains unicode:
$ cat file
Korean : 서요한 가나다라 아야어여
Armenian : Թեստ
English : This line is redundant :)

A Tcl only version of the program is:

set f [ open file "r" ]
set lines [ read $f ]
puts "Tcl script READ Unicode"
puts $lines

It behaves as expected in both Tcl 8.4 and Tcl 8.6.

Note that the commented call to Tcl_SetSystemEncoding will cause the program to work the same way for Tcl 8.6 and Tcl 8.4.

The questions I have are:

What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior? It seems that with Tcl 8.4, we were able to get the original content of the strings, but that Tcl 8.6 is altering the input in some way that makes it incompatible with C++.

Is setting the Tcl_SetSystemEncoding call a reasonable fix for this, or will we run into other difficulties now or in the future (I notice that there are a lot of Unicode enhancements set up for Tcl 8.7 and Tcl 9)? What happens if someone gives us some non utf-8 encoded string? Is there a way to support that in this case?

Be patient - I am not, by any means, a Unicode expert.

Thanks!

Phillip Brooks

unread,
Nov 15, 2022, 3:33:18 PM11/15/22
to
Note that I am building and running this on Red Hat Enterprise Linux 6.

When I build and run this on Red Hat Enterprise Linux 8, the Tcl 8.4 case also fails to print properly.

Rich

unread,
Nov 15, 2022, 4:59:36 PM11/15/22
to
Phillip Brooks <phil...@gmail.com> wrote:
> What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?

Most likely, Tcl became more properly Unicode aware.

> It seems that with Tcl 8.4, we were able to get the original content
> of the strings, but that Tcl 8.6 is altering the input in some way
> that makes it incompatible with C++.

8.4 was likely using a setting that was transparent while 8.6 is likely
trying to convert the incoming data into Tcl's internal UTF-8 variant.

> Is setting the Tcl_SetSystemEncoding call a reasonable fix for this,
> or will we run into other difficulties now or in the future (I notice
> that there are a lot of Unicode enhancements set up for Tcl 8.7 and
> Tcl 9)?

The 'system' encoding is also used when passing strings to the OS API,
modifying it /may/ cause other strange issues.

> What happens if someone gives us some non utf-8 encoded string? Is
> there a way to support that in this case?

Unless you can:
1) be informed of what actual encoding was used; or
2) write a bunch of code to try to infer the encoding used (and this
will likely be fragile)
then there is not really a general way to 'interpret' any possible
encodinng.

However, if you just want the exact bytes present in the files to come
back out, you could set the channels to 'binary' mode and that will
disable all the translating of bytes between encodings.

You need to look at the "fconfigure" command for adjusting the encoding
used for file channels (the C API equivalent is the
Tcl_SetChannelOption function). You may simply need to set the
input and output channels to utf-8 for things to work correctly again.

Luc

unread,
Nov 15, 2022, 5:44:29 PM11/15/22
to
On Tue, 15 Nov 2022 21:59:32 -0000 (UTC), Rich wrote:

> Phillip Brooks <phil...@gmail.com> wrote:
> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?
>
> Most likely, Tcl became more properly Unicode aware.


At least that I can attest. I've had these two applications that I made for
myself for about 15 years, they use the clipboard and text widgets.
They never handled Unicode correctly in the 8.4 and 8.5 era, and I just
gave up on that, learning to live in resignation with some occasional
garbled content.

Only a few months ago I decided to try to fix them and it was very easy
because the old problems I used to have with Unicode just weren't there
anymore. I just removed the ugly kludges I had had in place to hide some
of the problem and everything just worked.

--
Luc
>>


Phillip Brooks

unread,
Nov 15, 2022, 7:34:11 PM11/15/22
to
Thanks for the response, Rich. It was very helpful.

On Tuesday, November 15, 2022 at 1:59:36 PM UTC-8, Rich wrote:
> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?

> Most likely, Tcl became more properly Unicode aware.

When I look through the 8.4 and 8.5 Tcl release notes, I am not finding anything about Unicode. Similarly for the list of TIPs - there are several Unicode TIPs for 8.7/9.0, though.

> Unless you can:
> 1) be informed of what actual encoding was used; or
> 2) write a bunch of code to try to infer the encoding used (and this
> will likely be fragile)
> then there is not really a general way to 'interpret' any possible
> encoding.

That's what I was thinking.

> you could set the channels to 'binary' mode and that will
> disable all the translating of bytes between encodings.

The binary setting didn't help - rather it breaks 8.4 in the same way that 8.6 is broken. This was after calling:

Tcl_SetChannelOption(interp, fc, "-encoding", "binary");

> You need to look at the "fconfigure" command for adjusting the encoding
> used for file channels (the C API equivalent is the
> Tcl_SetChannelOption function). You may simply need to set the
> input and output channels to utf-8 for things to work correctly again.

Thanks for that pointer, fconfigure and Tcl_Get/SetChannelOption have been very illuminating.

In Tcl 8.4, the "C" Tcl_Channel seems to have "-encoding" set to "identity" by default. In Tcl 8.6, it is set to "iso8859-1" by default. In the Tcl script, however, fconfigure shows default "-encoding" set to "utf-8" for both Tcl 8.4 and Tcl 8.6.

Setting "-encoding" to "identity" in Tcl 8.6 seems to reestablish the previous behavior. Also, setting it explicitly to "utf-8" works as well. Setting Tcl_SetSystemEncoding to "utf-8" changes the default to "utf-8" in both Tcl 8.4 and Tcl 8.6.

I see this in the fconfigure doc page under -encoding:

"The default encoding for newly opened channels is the same platform- and locale-dependent system encoding used for interfacing with the operating system, as returned by encoding system."

Does that mean that the user can alter this behavior by setting an environment variable on Unix? Any idea where I can find out more about that? I am thinking that if I can provide the user with an environment variable setting, then I won't have to worry about breaking someone else's clever use of some other international strings in some other place by forcing it to utf-8. I tried explicitly setting LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid breaking things in new ways for Tcl 8.7 and Tcl 9.


saitology9

unread,
Nov 15, 2022, 8:13:13 PM11/15/22
to
On 11/15/2022 3:22 PM, Phillip Brooks wrote:
> Hi,
>
> We have noticed a problem in our application that started occurring with our transition to Tcl 8.6 from Tcl 8.4. The problem is that we read some user provided text using Tcl that eventually gets printed by our application. Although our application is not localized, enterprising users found that they can enter Unicode text into the file and then when it prints out, it ends up the same way it came in when we print it out from C++. When they started using the Tcl 8.6 version of our product, that stopped working and now garbage is printed where the nice unicode output was printed previously.
>

You seem to have access to both versions of your application.
Therefore, you could find out the exact encoding that was in place 8.4
and enforce it in 8.6, or change it to something else.

# find out current encoding
% encoding system
cp1251

# change it to something else
% encoding system unicode
unicode

# check
% encoding system
unicode

# list all
% encoding names
...

Rich

unread,
Nov 15, 2022, 9:18:01 PM11/15/22
to
Phillip Brooks <phil...@gmail.com> wrote:
> Thanks for the response, Rich. It was very helpful.
>
> On Tuesday, November 15, 2022 at 1:59:36 PM UTC-8, Rich wrote:
>> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?
>
>> Most likely, Tcl became more properly Unicode aware.
>
> When I look through the 8.4 and 8.5 Tcl release notes, I am not
> finding anything about Unicode. Similarly for the list of TIPs -
> there are several Unicode TIPs for 8.7/9.0, though.

The change might not necesarially referenced Unicode, it might have
refered to channel encodings, or other terms. Note I'm not saying you
are wrong, just that if changes did happen (and 8.4 to 8.6 is a wide
time window) they might not have used the word "Unicode" but still
might have been impactful.

>> Unless you can:
>> 1) be informed of what actual encoding was used; or
>> 2) write a bunch of code to try to infer the encoding used (and this
>> will likely be fragile)
>> then there is not really a general way to 'interpret' any possible
>> encoding.
>
> That's what I was thinking.
>
>> you could set the channels to 'binary' mode and that will disable
>> all the translating of bytes between encodings.
>
> The binary setting didn't help - rather it breaks 8.4 in the same way
> that 8.6 is broken. This was after calling:
>
> Tcl_SetChannelOption(interp, fc, "-encoding", "binary");

Interesting...

>> You need to look at the "fconfigure" command for adjusting the
>> encoding used for file channels (the C API equivalent is the
>> Tcl_SetChannelOption function). You may simply need to set the
>> input and output channels to utf-8 for things to work correctly
>> again.
>
> Thanks for that pointer, fconfigure and Tcl_Get/SetChannelOption have
> been very illuminating.
>
> In Tcl 8.4, the "C" Tcl_Channel seems to have "-encoding" set to
> "identity" by default. In Tcl 8.6, it is set to "iso8859-1" by
> default. In the Tcl script, however, fconfigure shows default
> "-encoding" set to "utf-8" for both Tcl 8.4 and Tcl 8.6.

If your users have been sneaking in UTF-8 encoded data, and the channel
is now set for iso8859-1, you'll get ugly messes out as a result.

I.e., if your users entered a Unicode right single quote (U+2019) but
the channel is set to iso8859-1, you get: в@Y out instead of a right
single quote mark.

But, if your users have been entering UTF-8 encoded text, you'd also be
safe setting the channels to UTF-8 as well.

> Setting "-encoding" to "identity" in Tcl 8.6 seems to reestablish the
> previous behavior. Also, setting it explicitly to "utf-8" works as
> well. Setting Tcl_SetSystemEncoding to "utf-8" changes the default
> to "utf-8" in both Tcl 8.4 and Tcl 8.6.

The Tcl wiki has this to say about the 'identity' encoding:

https://wiki.tcl-lang.org/page/encoding+system

Can soneone elaborate on the meaning of the 'identity' encoding?
When using freewrap I get:

% encoding system
identity

What is this and what is it used for?

schlenk 2005-06-27: The identity encoding is for testing purposes,
it should not be used without very good reasons. If you see your
encoding system set to identity, you are missing the proper encoding
files for your setup. This happens with tclkit-sh.exe on windows or
other wrapped applications which do not include the right encodings
for the local system they are running on.

Googie 2012-08-09: The 'identity' encoding is the default encoding
in my Tcl, even I use regular tclsh and not tclkit. Why is so? (I
use Linux)

PYK 2018-12-04: It is so because your Tcl configuration is borked.

Is your code running inside a 'wrapped' executable -- if the Wiki
statements here are correct, the fact that you get 'identity' on 8.4
would imply that the fact that "it worked" was more of a stroke of luck
than anything else.

If setting to UTF-8 'fixes things' then your likely best course is to
set the channels to UTF-8 and let it be. UTF-8 is all but the
'universal' encoding now for just about everything, so you'd be more
'future proof' to explictly set UTF-8 than not.

> I see this in the fconfigure doc page under -encoding:
>
> "The default encoding for newly opened channels is the same platform-
> and locale-dependent system encoding used for interfacing with the
> operating system, as returned by encoding system."
>
> Does that mean that the user can alter this behavior by setting an
> environment variable on Unix? Any idea where I can find out more
> about that?

Sadly, no. And the only real mention of LANG= in the wiki is that Tcl
uses it to guess what encoding to set as 'system' when it initializes.

> I am thinking that if I can provide the user with an environment
> variable setting, then I won't have to worry about breaking someone
> else's clever use of some other international strings in some other
> place by forcing it to utf-8. I tried explicitly setting
> LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid
> breaking things in new ways for Tcl 8.7 and Tcl 9.

Try LANG=C, which might 'trick' things. But if you do want to avoid
future breakage, if switching to 'utf-8' 'fixes' things now, then that
switch should cause less breakage in the future than not. Anything
else you to would just be a band-aid over another band-aid and itself
likely to subtly break in other ways in the future.

Ralf Fassel

unread,
Nov 16, 2022, 11:23:52 AM11/16/22
to
* Rich <ri...@example.invalid>
| Phillip Brooks <phil...@gmail.com> wrote:
| > I am thinking that if I can provide the user with an environment
| > variable setting, then I won't have to worry about breaking someone
| > else's clever use of some other international strings in some other
| > place by forcing it to utf-8. I tried explicitly setting
| > LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid
| > breaking things in new ways for Tcl 8.7 and Tcl 9.
>
| Try LANG=C, which might 'trick' things. But if you do want to avoid
| future breakage, if switching to 'utf-8' 'fixes' things now, then that
| switch should cause less breakage in the future than not. Anything
| else you to would just be a band-aid over another band-aid and itself
| likely to subtly break in other ways in the future.

Linux/Opensuse 15.4:

$ env LANG=de_DE.UTF-8 tclsh
% fconfigure stdout -encoding
utf-8

$ env LANG=en_US.UTF-8 tclsh
% fconfigure stdout -encoding
utf-8

$ env LANG=C tclsh
% fconfigure stdout -encoding
iso8859-1

So LANG=C is probably not the Right Thing in the context of this thread.


If the LANG=en_US.UTF-8 did not work for the OP, most likely he had set
some other env-vars (namely LC_ALL or LC_CTYPE):

unix/tclUnixInit.c, Tcl_GetEncodingNameFromEnvironment():

/*
* Determine the current encoding from the LC_* or LANG environment
* variables.
--<snip-snip>--
encoding = getenv("LC_ALL");

if (encoding == NULL || encoding[0] == '\0') {
encoding = getenv("LC_CTYPE");
}
if (encoding == NULL || encoding[0] == '\0') {
encoding = getenv("LANG");
}

R'

Phillip Brooks

unread,
Nov 16, 2022, 12:16:31 PM11/16/22
to

The encoding system command doesn't seem to yield anything meaningful in terms of my observed behavior of what default encoding is present. I am finding various builds of tcl, both 8.4 and 8.6, that seem to set it different ways - possibly by something in the install tree?

From my 8.4 product install tree:
$MGC_HOME/bin/tclsh
% encoding system
iso8859-1

From my generic 8.4 build:
$ /usr/local/tcl8.4b/bin/tclsh8.4
% encoding system
utf-8

As mentioned previously, we don't see issues in a pure Tcl script (see main.tcl in the original post), but only when creating a Tcl interpreter from C/C++ code.

Perhaps it is something that gets handled during initialization and isn't being initialized properly for Tcl 8.6?

I do note that there are a lot of references to iso8859-1 in the Tcl source tree. One of them is in unix/README regarding the configure script:

--with-encoding=ENCODING Specifies the encoding for compile-time
configuration values. Defaults to iso8859-1,
which is also sufficient for ASCII.

Might it be that I can ask the customer to use iso8859-1 encoding instead of utf-8 for their localized comments?

Harald Oehlmann

unread,
Nov 16, 2022, 12:33:08 PM11/16/22
to
Am 16.11.2022 um 18:16 schrieb Phillip Brooks:

I don't know, if it was mentioned before.
The tcl initialization code changed. To initialze static stuff, first:
Tcl_FindExecutable(argv)
should be called.

Hope this helps,
Harald

Phillip Brooks

unread,
Nov 16, 2022, 2:03:51 PM11/16/22
to
On Wednesday, November 16, 2022 at 9:33:08 AM UTC-8, Harald Oehlmann wrote:

> I don't know, if it was mentioned before.
> The tcl initialization code changed. To initialze static stuff, first:
> Tcl_FindExecutable(argv)
> should be called.

That helps immensely - If I add the Tcl_FindExecutable(argv) call before creating the interpreter, it resolves the issue in my small testcase. We'll try that in the main application and see how it goes.

Thanks!

Harald Oehlmann

unread,
Nov 17, 2022, 2:40:55 AM11/17/22
to
Great to hear. Cudos to the TCL designers, which worked a lot on the
embedded issue.
Harald

Ralf Fassel

unread,
Nov 17, 2022, 5:33:06 AM11/17/22
to
* Phillip Brooks <phil...@gmail.com>
| Might it be that I can ask the customer to use iso8859-1 encoding
| instead of utf-8 for their localized comments?

Don't. UTF-8 is the way to go. iso8859-1 will not even transfer
properly to Windows, where the default codepage for Europe (cp1252)
is subtly different from iso8859-1 for 128ff.

R'

Phillip Brooks

unread,
Nov 17, 2022, 11:57:40 AM11/17/22
to
Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?

Ralf - thanks for the info. Also, in searching for info about iso8859-1, it isn't suitable for Korean anyway as it only covers Roman alphabet derivatives.

Harald Oehlmann

unread,
Nov 17, 2022, 12:12:07 PM11/17/22
to
Am 17.11.2022 um 17:57 schrieb Phillip Brooks:
> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?
>
> Ralf - thanks for the info. Also, in searching for info about iso8859-1, it isn't suitable for Korean anyway as it only covers Roman alphabet derivatives.

In one project
https://wiki.tcl-lang.org/page/Embedding+TCL+program+in+DLL
I debugged a lot the embedded stuff.
Tcl_FindExecutable(null) does a lot more.
I don't remember where the system encoding was set.
But it passed somewhere on the journey.
You may need to call Tcl_Init after creation of the interpreter...

Take care,
Harald

Ralf Fassel

unread,
Nov 17, 2022, 12:13:11 PM11/17/22
to
* Phillip Brooks <phil...@gmail.com>
| Unfortunately, using Tcl_FindExecutable(argv), which works in the
| small example program, is not working in our application. What is
| this call doing?

Read the source, Luke.

tcl8.6.13: generic/tclEncoding.c:1449
void
Tcl_FindExecutable(
const char *argv0) /* The value of the application's argv[0]
* (native). */
{
TclInitSubsystems();
TclpSetInitialEncodings();
TclpFindExecutable(argv0);
}

Could you you show the relevant code from your application (i.e. the
Tcl_Open* calls, the write calls etc) together with what happens, and
what you expect to happen?

| Ralf - thanks for the info. Also, in searching for info about
| iso8859-1, it isn't suitable for Korean anyway as it only covers Roman
| alphabet derivatives.

iso8859-1 also does not even contain €, you need iso8859-15 for that ;-)

R'

saitology9

unread,
Nov 17, 2022, 2:23:37 PM11/17/22
to
On 11/16/2022 12:16 PM, Phillip Brooks wrote:
>
> As mentioned previously, we don't see issues in a pure Tcl script (see main.tcl in the original post), but only when creating a Tcl interpreter from C/C++ code.
>

Well, the idea was that you'd find out which encoding works on your
client side and enforce that everywhere. However, ....


> Perhaps it is something that gets handled during initialization and isn't being initialized properly for Tcl 8.6?
>

This is interesting. You are embedding Tcl in a larger C/C++
application and as you state, Tcl takes care of things fine. So, if you
still have the issue, it would behoove you to look at the rest of the
C/C++ program. Namely, I would expect that you'd have to handle the
encoding there as well. I am not sure if the embedded Tcl interpreter's
control reaches outwards into the embedding system.



briang

unread,
Nov 17, 2022, 3:39:54 PM11/17/22
to
On Thursday, November 17, 2022 at 8:57:40 AM UTC-8, Phillip Brooks wrote:
> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?

Are you running multi-threaded? Are you running multiple interps in multiple threads? I think you need to call Tcl_FindExecutable(NULL) in each thread, before creating any interps in the thread.

-Brian

Christian Werner

unread,
Nov 17, 2022, 4:21:59 PM11/17/22
to
> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application......

A larger C++ based program? Does it have global constructors? Which run before main()? Which call Tcl_SomeThing()?

Phillip Brooks

unread,
Nov 28, 2022, 11:14:44 AM11/28/22
to
It turns out that the difference between our main C++ application and the smaller test program is that there is a wrapper script that launches the main application that also unsets the LANG variable. I think this was done in response to a previous case where some particular setting of LANG was causing problems with our non-localized Tk gui code. Having LANG unset or set to blank also causes whatever initialization was happening in Tcl_FindExecutable not to happen anymore. I think we'll need to hard-wire LANG to en_US.UTF-8 or some such.

Thanks for all the help in tracking this down.
0 new messages