Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Tcl and Unicode...

67 views
Skip to first unread message

Georgios Petasis

unread,
Oct 8, 2018, 5:39:16 AM10/8/18
to
Hi all,

Unfortunately I am at the situation that I need to parse texts, and in
some cases, Unicode characters beyond the ones "supported" by Tcl exist
(mainly emoticons).

Most Tcl distributions are compiled with TCL_UTF_MAX = 4. But how can I
safely detect if Tcl has been compiled with TCL_UTF_MAX = 4 or more?
Is there a way? I was expecting tcl_platform to have something, but I
couldn't figure out.

The other test is to use "\u" notation with more than 4 digits, and see
if in the resulting string, the digits > 4 were used or not.

Is there a better way?

George

Donal K. Fellows

unread,
Oct 8, 2018, 6:16:42 AM10/8/18
to
There's *some* rudimentary support for non-BMP characters (via surrogate
pairs) in 8.7, where TCL_UTF_MAX is still definitely 4, though you need
to use "\U" to be able to use six hex digits rather than just four:

% info patchlevel
8.7a2
% puts \U01f600
😀
% string length 😀
2
% scan 😀 %c%c a b
1
% set a
128512
% format %c 128512
😀

As you can see, [string length] is wrong, but [scan] and [format] are
right (as is output to the console on at least OSX). If what you're
doing is passing things through from user input or a file to user output
or a file, then it's probably enough. (The above test was with the
current tip of the core-8-branch.)

You particularly need TIP 388
(https://core.tcl.tk/tips/doc/trunk/tip/388.md) and TIP 389
(https://core.tcl.tk/tips/doc/trunk/tip/389.md) in order to make
progress. The latter is an 8.7 one. (Fixing [string length] requires
changing TCL_UTF_MAX so that's probably going to be a thing we do in 9.0
rather than 8.7.)

Donal.
--
Donal Fellows — Tcl user, Tcl maintainer, TIP editor.

Georgios Petasis

unread,
Oct 8, 2018, 3:14:52 PM10/8/18
to
Dear Donal,

Thanks for the info, I was afraid it was going to be difficult. And I
have an additional problem, I have the data in an ELasticSearch, I use
python3 for the query part (using the DSL lib), and then I process it in
cl through tkinter bridge. I still have not found out if the error
message "character U+1f603 is above the range (U+0000-U+FFFF) allowed by
Tcl" comes from tcl or python (although I thinks it comes from python).

What about a 8.7.a1 build? Is it going t work?

8.7.a2 is difficult to build, I don't know how to get the additional
packages in the "pkgs" directory. Is there a tool that collects them?

George

Donal K. Fellows

unread,
Oct 8, 2018, 3:42:41 PM10/8/18
to
On 08/10/2018 20:14, Georgios Petasis wrote:
> What about a 8.7.a1 build? Is it going t work?

TIP 389 is a change that is subsequent to 8.7a1, so no.

> 8.7.a2 is difficult to build, I don't know how to get the additional
> packages in the "pkgs" directory. Is there a tool that collects them?

I wouldn't call Don Porter “a tool”. :-)

But seriously, you could probably just copy those straight out of an
8.7a1 distribution. It might not be a perfect solution, but it is a
simple one. However, 8.7 is currently experiencing some small issues on
some platforms with buildability; I'm not quite sure what they are (as
it all works for me with my toolchain) but they do seem to be an issue.
0 new messages