Unicode vs ASCII

11 views
Skip to first unread message

Grégory Vanuxem

unread,
Nov 26, 2023, 11:20:49 AM11/26/23
to fricas...@googlegroups.com
Hi here,

I have read some discussions about using Unicode. Frankly speaking,
that reminds me of the past, when Debian developers did not want to
support 64 bits by default instead of 32 bits. They were wrong. From
my point of view Unicode is a must have. Otherwise the Lisp subsystem
is outdated, I think:

'a' ∈ "abcd"

must return true. We are in 2023 and almost 2024.

As of now with SBCL:

(1) -> 'a' ∈ "abcd"
Line 1: 'a' ∈ "abcd"
....AB
Error A: Improper syntax.
Error B: The character #\ELEMENT_OF is not a FriCAS character.
2 error(s) parsing

(3) -> "a" ∈ "abcd"
Line 1: "a" ∈ "abcd"
....AB
Error A: Improper syntax.
Error B: The character #\ELEMENT_OF is not a FriCAS character.
2 error(s) parsing

This is why I kept supporting Julia String. Even if this is not my aim
to keep supporting Julia String. Try that for syntax highlighting:

"((?:[[:alpha:]_\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{So}←-⇿])(?:[[:word:]_![:word:]_\\?\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{Mn}\u0001-¡]|[^\\P{Mc}\u0001-¡]|[^\\P{Nd}\u0001-¡]|[^\\P{Pc}\u0001-¡]|[^\\P{Sk}\u0001-¡]|[^\\P{Me}\u0001-¡]|[^\\P{No}\u0001-¡]|[′-‷⁗]|[^\\P{So}←-⇿])*)({(?:[^{}]|{(?:[^{}]|{[^{}]*})*})*})?\\??(\\()",

Just my two cents

__
Greg

Waldek Hebisch

unread,
Nov 26, 2023, 12:00:37 PM11/26/23
to fricas...@googlegroups.com
On Sun, Nov 26, 2023 at 05:20:11PM +0100, Grégory Vanuxem wrote:
> Hi here,
>
> I have read some discussions about using Unicode. Frankly speaking,
> that reminds me of the past, when Debian developers did not want to
> support 64 bits by default instead of 32 bits. They were wrong. From
> my point of view Unicode is a must have. Otherwise the Lisp subsystem
> is outdated, I think:
>
> 'a' ∈ "abcd"
>
> must return true. We are in 2023 and almost 2024.
>
> As of now with SBCL:
>
> (1) -> 'a' ∈ "abcd"
> Line 1: 'a' ∈ "abcd"
> ....AB
> Error A: Improper syntax.
> Error B: The character #\ELEMENT_OF is not a FriCAS character.
> 2 error(s) parsing
>
> (3) -> "a" ∈ "abcd"
> Line 1: "a" ∈ "abcd"
> ....AB
> Error A: Improper syntax.
> Error B: The character #\ELEMENT_OF is not a FriCAS character.
> 2 error(s) parsing

Well, you can do:

(4) -> α(x) == x + 1
Type: Void
(5) -> β := 2

(5) 2
Type: PositiveInteger
(6) -> α(β)
Compiling function α with type PositiveInteger -> PositiveInteger

(6) 3
Type: PositiveInteger

so, as you see Unicode is supported. But FriCAS has no definition
for ∈, so
(7) -> _∈ + β

(7) ∈ + 2
Type: Polynomial(Integer)
works because leading _ intructs FriCAS to treat ∈ as identifier,
but FriCAS has no idea that you want ∈ to be infix operator. This
does not differer significantly from:

(8) -> 'a' in "abcd"
Line 1: 'a' in "abcd"
....A
Error A: Improper syntax.
1 error(s) parsing

OK, FriCAS knows that 'in' is a keyword, so does not complain here.
But syntax does not allow 'in' as operator.

> This is why I kept supporting Julia String. Even if this is not my aim
> to keep supporting Julia String. Try that for syntax highlighting:
>
> "((?:[[:alpha:]_\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{So}←-⇿])(?:[[:word:]_![:word:]_\\?\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{Mn}\u0001-¡]|[^\\P{Mc}\u0001-¡]|[^\\P{Nd}\u0001-¡]|[^\\P{Pc}\u0001-¡]|[^\\P{Sk}\u0001-¡]|[^\\P{Me}\u0001-¡]|[^\\P{No}\u0001-¡]|[′-‷⁗]|[^\\P{So}←-⇿])*)({(?:[^{}]|{(?:[^{}]|{[^{}]*})*})*})?\\??(\\()",

I am not sure what you mean here. This certainly is not valid
definition of Vim highlighting.

--
Waldek Hebisch

Grégory Vanuxem

unread,
Nov 29, 2023, 8:02:09 AM11/29/23
to fricas...@googlegroups.com
Hello,

Sorry for being so late.
Thanks for making me remember greek letter support. In fact I also no
longer remember where this is implemented. But that's a good thing
this is supported. Since I sometimes interact with Julia I like the
way it supports Unicode in a terminal, for example \in plus <TAB> will
replace \in with ∈ automatically. It even completes functions or
unicode commands so \empt plus two <TAB> will complete to \emptyset
and after ∅. It could be interesting I think to add this type of
support in terminal supporting unicode in FriCAS. And even for Jupyter
notebook, Jfricas, why not.

> But FriCAS has no definition
> for ∈, so
> (7) -> _∈ + β
>
> (7) ∈ + 2
> Type: Polynomial(Integer)
> works because leading _ intructs FriCAS to treat ∈ as identifier,
> but FriCAS has no idea that you want ∈ to be infix operator.

Yes, and it's a pity I think. But when I speak of Unicode support in
terminal or spad/input file I do not think about all unicode
characters, just, say, greek letters and, grossly, mathematical
related characters. After, some 'look like" character can be
introduced. For example, Chrome, Firefox etc. refused to add unicode
URL support to their browser because of security concerns. There is a
cyrilic character that looks very like the 'l' (L), so gmail, google,
apple can easily be spoofed.

But what do you think of adding support to some mathematical operators
in Unicode notation?

> (8) -> 'a' in "abcd"
> Line 1: 'a' in "abcd"
> ....A
> Error A: Improper syntax.
> 1 error(s) parsing
>
> OK, FriCAS knows that 'in' is a keyword, so does not complain here.
> But syntax does not allow 'in' as operator.
>
> > This is why I kept supporting Julia String. Even if this is not my aim
> > to keep supporting Julia String. Try that for syntax highlighting:
> >
> > "((?:[[:alpha:]_\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{So}←-⇿])(?:[[:word:]_![:word:]_\\?\\p{Lu}\\p{Ll}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Nl}\\p{Sc}⅀-⅄∿⊾⊿⊤⊥∂∅-∇∎∏∐∑∞∟∫-∳⋀-⋃◸-◿♯⟘⟙⟀⟁⦰-⦴⨀-⨆⨉-⨖⨛⨜𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃ⁱ-⁾₁-₎∠-∢⦛-⦯℘℮゛-゜𝟎-𝟡]|[^\\P{Mn}\u0001-¡]|[^\\P{Mc}\u0001-¡]|[^\\P{Nd}\u0001-¡]|[^\\P{Pc}\u0001-¡]|[^\\P{Sk}\u0001-¡]|[^\\P{Me}\u0001-¡]|[^\\P{No}\u0001-¡]|[′-‷⁗]|[^\\P{So}←-⇿])*)({(?:[^{}]|{(?:[^{}]|{[^{}]*})*})*})?\\??(\\()",
>
> I am not sure what you mean here. This certainly is not valid
> definition of Vim highlighting.

Definitively no. I only use Vim in terminals for "quick" use. This is
just an unicode regular expression, the 'syntax highlighting' is
irrelevant here, sorry. But, again, It would be good I think to add
more support for unicode characters (and of course unicode based
operators). About this, I do not even have tested to add Unicode
special characters in Spad, say, greek letters.

The regular expression above comes from another project, and I still
use it for future use. I'm writing a VSCode [1] extension for FriCAS
but that's a big project. Even if it is time consuming it advances. I
will unhide it from GitHub in the next two month I hope.

Thanks for the response.

- Greg

[1] Codium for pure open source version

Waldek Hebisch

unread,
Nov 29, 2023, 11:12:53 AM11/29/23
to fricas...@googlegroups.com
On Wed, Nov 29, 2023 at 02:01:29PM +0100, Grégory Vanuxem wrote:
>
> Le dim. 26 nov. 2023 à 18:00, Waldek Hebisch <de...@fricas.org> a écrit :
> >
> > On Sun, Nov 26, 2023 at 05:20:11PM +0100, Grégory Vanuxem wrote:
> > >
> > so, as you see Unicode is supported.
>
> Thanks for making me remember greek letter support. In fact I also no
> longer remember where this is implemented.

Well, one thins is that we query Lisp to check if something is a
letter and if yes we allow it in identifiers. In Unicode-enabled
Lisp this means that we allow Unicode letters.

Other thing are strings where things work in natural way. And
'ucodeToString' and its dual which depend on implementation.

> But that's a good thing
> this is supported. Since I sometimes interact with Julia I like the
> way it supports Unicode in a terminal, for example \in plus <TAB> will
> replace \in with ∈ automatically. It even completes functions or
> unicode commands so \empt plus two <TAB> will complete to \emptyset
> and after ∅. It could be interesting I think to add this type of
> support in terminal supporting unicode in FriCAS. And even for Jupyter
> notebook, Jfricas, why not.

Such things have its place in Clef (which now can handle UTF-8) and
in case of Jfricas in Jupyter frontend. There is a question how
exactly this should work? In FriCAS '_' is an escape character,
so by analogy we should use this. But in FriCAS '_' means that
operators loose their special syntactic properties and you would
like almost the opposite. I guess that as user controlable option
this would be OK.

> > But FriCAS has no definition
> > for ∈, so
> > (7) -> _∈ + β
> >
> > (7) ∈ + 2
> > Type: Polynomial(Integer)
> > works because leading _ intructs FriCAS to treat ∈ as identifier,
> > but FriCAS has no idea that you want ∈ to be infix operator.
>
> Yes, and it's a pity I think. But when I speak of Unicode support in
> terminal or spad/input file I do not think about all unicode
> characters, just, say, greek letters and, grossly, mathematical
> related characters.

Yes, that is natural. My point was that we need a table of characters
that we want and their properties. In particular for operators we
need priorities. And corresponding function/keyword so that
in situations where Unicode is unavailable we can still use
all functions implemented in FriCAS.

> After, some 'look like" character can be
> introduced. For example, Chrome, Firefox etc. refused to add unicode
> URL support to their browser because of security concerns. There is a
> cyrilic character that looks very like the 'l' (L), so gmail, google,
> apple can easily be spoofed.

Well, IIUC several cyrilic glyphs are considered "the same" as
latin glyphs. But cyrilic using countries traditionally had
distinct code block for cyrilic characters, so Unicode decided
that Unicode cyrylic character codes also should be different from
latin character codes. More generally, I was early enthusiat of
Unicode, but now I have serious doubts about many Unicode decisions.
One promise of Unicode was that one should be able to simply
work with character codes, that is not true. There are combining
characters so properly interpret text one needs to look at
surrounding characters. There are normalization forms and
charactes essentially serving as abbreviations. And largish
tables of character properties. This all is error prone.


> But what do you think of adding support to some mathematical operators
> in Unicode notation?

Well, ATM interpreter parser is hand written and essentially
hardcodes syntax. This make it hard to add new operators.
OTOH old parser, modified version of which is used by Spad
compiler is table driven: you add new operator to the tables
and parser can handle it.

--
Waldek Hebisch
Reply all
Reply to author
Forward
0 new messages