Google 网上论坛不再支持新的 Usenet 帖子或订阅项。历史内容仍可供查看。

Tcl and Unicode...

已查看 67 次
跳至第一个未读帖子

Georgios Petasis

未读,
2018年10月8日 05:39:162018/10/8
收件人
Hi all,

Unfortunately I am at the situation that I need to parse texts, and in
some cases, Unicode characters beyond the ones "supported" by Tcl exist
(mainly emoticons).

Most Tcl distributions are compiled with TCL_UTF_MAX = 4. But how can I
safely detect if Tcl has been compiled with TCL_UTF_MAX = 4 or more?
Is there a way? I was expecting tcl_platform to have something, but I
couldn't figure out.

The other test is to use "\u" notation with more than 4 digits, and see
if in the resulting string, the digits > 4 were used or not.

Is there a better way?

George

Donal K. Fellows

未读,
2018年10月8日 06:16:422018/10/8
收件人
There's *some* rudimentary support for non-BMP characters (via surrogate
pairs) in 8.7, where TCL_UTF_MAX is still definitely 4, though you need
to use "\U" to be able to use six hex digits rather than just four:

% info patchlevel
8.7a2
% puts \U01f600
😀
% string length 😀
2
% scan 😀 %c%c a b
1
% set a
128512
% format %c 128512
😀

As you can see, [string length] is wrong, but [scan] and [format] are
right (as is output to the console on at least OSX). If what you're
doing is passing things through from user input or a file to user output
or a file, then it's probably enough. (The above test was with the
current tip of the core-8-branch.)

You particularly need TIP 388
(https://core.tcl.tk/tips/doc/trunk/tip/388.md) and TIP 389
(https://core.tcl.tk/tips/doc/trunk/tip/389.md) in order to make
progress. The latter is an 8.7 one. (Fixing [string length] requires
changing TCL_UTF_MAX so that's probably going to be a thing we do in 9.0
rather than 8.7.)

Donal.
--
Donal Fellows — Tcl user, Tcl maintainer, TIP editor.

Georgios Petasis

未读,
2018年10月8日 15:14:522018/10/8
收件人
Dear Donal,

Thanks for the info, I was afraid it was going to be difficult. And I
have an additional problem, I have the data in an ELasticSearch, I use
python3 for the query part (using the DSL lib), and then I process it in
cl through tkinter bridge. I still have not found out if the error
message "character U+1f603 is above the range (U+0000-U+FFFF) allowed by
Tcl" comes from tcl or python (although I thinks it comes from python).

What about a 8.7.a1 build? Is it going t work?

8.7.a2 is difficult to build, I don't know how to get the additional
packages in the "pkgs" directory. Is there a tool that collects them?

George

Donal K. Fellows

未读,
2018年10月8日 15:42:412018/10/8
收件人
On 08/10/2018 20:14, Georgios Petasis wrote:
> What about a 8.7.a1 build? Is it going t work?

TIP 389 is a change that is subsequent to 8.7a1, so no.

> 8.7.a2 is difficult to build, I don't know how to get the additional
> packages in the "pkgs" directory. Is there a tool that collects them?

I wouldn't call Don Porter “a tool”. :-)

But seriously, you could probably just copy those straight out of an
8.7a1 distribution. It might not be a perfect solution, but it is a
simple one. However, 8.7 is currently experiencing some small issues on
some platforms with buildability; I'm not quite sure what they are (as
it all works for me with my toolchain) but they do seem to be an issue.
0 个新帖子