Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Transliterion of (Variables) Names

66 views
Skip to first unread message

Jan Burse

unread,
Nov 9, 2012, 1:45:34 PM11/9/12
to
Dear All,

I am just browsing through the following document:

Coding Guidelines for Prolog
MICHAEL A. COVINGTON et al.
http://arxiv.org/abs/0911.2899

I am currently interested in the aspect of inter-
nationalization, and the writing of foreign variable
names in Prolog texts.

At the moment I leave aside the use of unicode, and
do rather look at the peculiarities of romanized names.
I find at the core of the above guidelines:

"If in doubt: Use underscores to separate words
in compound identifiers, i.e., write is well
formed, not isWellFormed. Prefer Result So Far
to Result so far."
Page 14

Which basically says that the underscore should play
the role of a space. This works fine as long as the
words are recognized as ASCII letter/digit sequences.

I also find in the above guidelines the suggestion
to use acronymic variables with caution, i.e. section
3.9 and 3.10 point in this direction.

But what to do with words that have other characters
in it than only ASCII letters/digits. Here are
some examples (Since we are in ASCII we drops
diacritics):

French: Vis-a-vis
Modern Hebrew: Sh'ba
Etc..

The above cannot be parsed as one variable words.
This problem is only seen in variable words, and
not for atoms. I find two proposals so far:

1) Expanding the role of the underscore. So
one would write something along:

Vis_a_vis
Sh_ba

2) Allow back quoted strings to play the role
of variables (currently found in Jekejeke
Prolog):

`Vis-a-vis`
`Sh'ba`

(Note: The ISO standard defines back quoted
strings but leaves its semantic use open
to the implementations)

Any comments on these proposals. Any other proposals
around?

Best Regards

Ulrich Neumerkel

unread,
Nov 9, 2012, 2:16:23 PM11/9/12
to
Jan Burse <janb...@fastmail.fm> writes:
> 2) Allow back quoted strings to play the role
> of variables (currently found in Jekejeke
> Prolog):
>
> `Vis-a-vis`
> `Sh'ba`
>
> (Note: The ISO standard defines back quoted
> strings but leaves its semantic use open
> to the implementations)

Your note does not fully reflect what ISO/IEC 13211-1
states. In 6.4.7 Back quoted strings, there is the following
note:

NOTE - This part of ISO/IEC 13211 does not define a token
(or term) based on a back quoted string.

It would be a valid extension of this part of ISO/IEC 13211 to
define a back quoted string as denoting a character string
constant.

(end quote)

So while this is not normative, the document is suggesting
a possible extension. It very much depends on the precise
(national) way how this will be interpreted. I would not
feel safe to go against it. The best is to ask someone
from your national member body to explain the role of
notes to you.

Jan Burse

unread,
Nov 9, 2012, 4:13:55 PM11/9/12
to
Ulrich Neumerkel schrieb:
>> (Note: The ISO standard defines back quoted
>> > strings but leaves its semantic use open
>> > to the implementations)
> Your note does not fully reflect what ISO/IEC 13211-1
> states. In 6.4.7 Back quoted strings, there is the following
> note:

What I began with Note is not mean to be a
recap of the ISO Note, but a Note in itself
in my post.

Bye

Jan Burse

unread,
Nov 9, 2012, 4:25:54 PM11/9/12
to
Ulrich Neumerkel schrieb:
> So while this is not normative, the document is suggesting
> a possible extension. It very much depends on the precise
> (national) way how this will be interpreted. I would not
> feel safe to go against it. The best is to ask someone
> from your national member body to explain the role of
> notes to you.

And what would be the result of such
an investigation? What does your national
body say, the Austrian one?

Bye

Jan Burse

unread,
Nov 9, 2012, 4:40:50 PM11/9/12
to
Jan Burse schrieb:
Maybe adopt the following process:

"People working in conformity assessment sometimes
have questions about how to interpret ISO/IEC standards
on this topic. In order to en sure consistency in
the interpretation of its standards, CASCO has set
up an interpretation process. An ad hoc group of
experts has been created to respond to each request
for interpretation, called an interpretation
panel.

The interpretation panel does not develop new
requirements, instead it clarifies existing ones.
These interpretations are considered for inclusion
in the relevant standards when they are revised.
Until they are included, they have no binding effect
and are implemented at the discretion of interested
parties."

http://www.iso.org/iso/home/about/conformity-assessment/conformity-assessment_resources.htm

(Very interesting reflective meta level here, a
standard about how to deal with standards...)

LudovicoVan

unread,
Nov 9, 2012, 10:59:30 PM11/9/12
to
"Ulrich Neumerkel" <ulr...@mips.complang.tuwien.ac.at> wrote in message
news:2012Nov...@mips.complang.tuwien.ac.at...
I'll try and explain that to my dev manager, that we have here a standard
where one has to ask a national member body about "the role of notes". --
He'll just retort I must be kidding, of course...

-LV


Jan Burse

unread,
Nov 16, 2012, 4:06:37 PM11/16/12
to
Dear All,

Should a Prolog processor downcase according to the
general Unicode rules or also to the special Unicode
rules. I get:

(In case somebody doesn't see the character under consideration in
its browser, it is the I-dot from Turkish)

/* For Jekejeke Prolog, using the Java toLowerCase() */
?- atom_codes('İ',[X]), Y is lowercase(X), atom_codes(R,[Y]).
X = 304,
Y = 105,
R = i
?- atom_codes(i,[X]), Y is uppercase(X), atom_codes(R,[Y]).
X = 105,
Y = 73,
R = 'I'

/* For SWI-Prolog */
?- Y='İ', atom_codes(Y,R), downcase_atom(Y,X), atom_codes(X,L).
Y = X, X = 'İ',
R = L, L = [304].
?- Y='i', atom_codes(Y,R), upcase_atom(Y,X), atom_codes(X,L).
Y = i,
R = [105],
X = 'I',
L = [73].

The special unicode rules are here:
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

So I get in one scenario: İ -low-> i -upper-> I
And in the other scenario: İ -low-> İ

Hm...

Bye

Jan Burse schrieb:

Jan Burse

unread,
Nov 17, 2012, 12:30:23 PM11/17/12
to
Dear All,

Just noticed that Unicode can also have an impact on old Latin1
interpretations. I found for example that the tokenizer of the
Craft of Prolog suggest to classify the following character

Latin1: ᅵ (181) (the book calls it a symbol character)
Unicode: 'MICRO SIGN' (U+00B5)

as graphic. Now newer Prolog systems interpret it as lower case
letter according to Unicode. We therefore have for example:

GNU Prolog 1.4.1
By Daniel Diaz
?- X = ᅵA.
X = ᅵA

But some Latin1 based Prolog systems still look at it as a
graphic character. For example have:

SICStus 4.2.3 (x86_64-win32-nt-4)
| ?- X = ᅵA.
! Syntax error
! operator expected after expression
! in line 7
! X = ᅵ
! <<here>>
! A .

The workaround for atoms is simple, put single quotes around it.
So we can do:

SICStus 4.2.3 (x86_64-win32-nt-4)
| ?- X = 'ᅵA'.
X = 'ᅵA' ?

But what about variables? Since it is a lower case letter,
we can prepend an upper case letter, and turn corresponding atoms
into variables. So we would have:

GNU Prolog 1.4.1
By Daniel Diaz
?- XᅵA = 3.
XᅵA = 3

But this will not for those Prolog systems that have an other
interpretation of Latin1 segment than Unicode does. We will have:

SICStus 4.2.3 (x86_64-win32-nt-4)
| ?- XᅵA = 3.
! Syntax error
! operator expected after expression
! in line 22
! X
! <<here>>
! ᅵ A = 3 .

If we would have back quotes for variables, we could at least do
the following:

/* A Prolog system that has back quotes for variables */
| ?- `XᅵA` = 3.
`XᅵA` = 3 ?

But the above is still not the best example of Unicode impact on
old interpretations and variables names. Example would work better
if we have a character that is upper case in Unicode and graphic
in some old interpretation. The quest goes on.

Bye

P.S.: Just notice GNU and SWI are not 100% identical on Latin1.
For micro sign there is no problem. It is lower case letter both
in GNU and SWI Prolog. But for example the following behaves
differently:

Latin1: ᅵ (169)
Unicode: 'COPYRIGHT SIGN' (U+00A9)

I find that GNU doesn't accept it at all unquoted:

GNU Prolog 1.4.1
By Daniel Diaz
?- X = ᅵ.
uncaught exception: error(syntax_error('user_input:4 (char:5)
right operand expected for infix operator'),read_term/3)
?- X = 'ᅵ'.
X = 'ᅵ'

On the other hand for SWI Prolog it is a graphic character, since
it joins with the period:

Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.3.4)
Copyright (c) 1990-2012 University of Amsterdam, VU Amsterdam
?- X = ᅵ.
| .
X = ᅵ. .

Jan Burse schrieb:

Ulrich Neumerkel

unread,
Nov 20, 2012, 6:45:58 AM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Dear All,
>
>Just noticed that Unicode can also have an impact on old Latin1
>interpretations. I found for example that the tokenizer of the
>Craft of Prolog suggest to classify the following character
>
> Latin1: µ (181) (the book calls it a symbol character)
> Unicode: 'MICRO SIGN' (U+00B5)
>
>as graphic. Now newer Prolog systems interpret it as lower case
>letter according to Unicode. We therefore have for example:
>
> GNU Prolog 1.4.1
> By Daniel Diaz
> ?- X = µA.
> X = µA

In my Linux version of GNU Prolog 1.4.1 I cannot reproduce this.


>But what about variables?

Same problem.


> Latin1: © (169)
> Unicode: 'COPYRIGHT SIGN' (U+00A9)
>
>I find that GNU doesn't accept it at all unquoted:
>
> GNU Prolog 1.4.1
> By Daniel Diaz
> ?- X = ©.
> uncaught exception: error(syntax_error('user_input:4 (char:5)
> right operand expected for infix operator'),read_term/3)
> ?- X = '©'.
> X = '©'
>
>On the other hand for SWI Prolog it is a graphic character, since
>it joins with the period:
>
> Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.3.4)
> Copyright (c) 1990-2012 University of Amsterdam, VU Amsterdam
> ?- X = ©.
> | .
> X = ©. .

These are excellent examples, in any case!

Still, the biggest difficulty is the low conformance within the
range of ASCII.

http://www.complang.tuwien.ac.at/ulrich/iso-prolog/conformity_assessment

Does Jekejeke fully support ISO syntax?

Jan Burse

unread,
Nov 20, 2012, 7:14:40 AM11/20/12
to
Ulrich Neumerkel schrieb:
> Does Jekejeke fully support ISO syntax?

No without TC2, there are currently 3 discrepancies.
1.) [] and {} cannot be used as functor or atom.
2.) Layout, i.e. space or coment, after minus sign
prevents detecting negative number,
3.) Quoting rules for '|' and '.'.
4.) What else?

Concerning 3) I recently started formalizing it:

http://www.jekejeke.ch/idatab/doclet/prod/en/docs/05_run/10_docu/02_reference/06_syntax/02_term/04_expressions.html
operator --> "|" | ","
| atom.

http://www.jekejeke.ch/idatab/doclet/prod/en/docs/05_run/10_docu/02_reference/06_syntax/01_token/03_words.html
atom --> "!" | ";"
| lower { alpha | digit }
| graphic { graphic }
| str_single.

It still does not reflect what happens with '.'. But do
you think the above is otherwise correct and would allow
promotion to TC2?

Best Regards

Jan Burse

unread,
Nov 20, 2012, 7:17:29 AM11/20/12
to
Ulrich Neumerkel schrieb:
> In my Linux version of GNU Prolog 1.4.1 I cannot reproduce this.

I am using a MacBook Pro:

$ gprolog
GNU Prolog 1.4.1
By Daniel Diaz
Copyright (C) 1999-2012 Daniel Diaz
| ?- X = ᅵA.

ᅵA = X

yes


Bye

Jan Burse

unread,
Nov 20, 2012, 7:18:06 AM11/20/12
to
Jan Burse schrieb:
> 1.) [] and {} cannot be used as functor or atom.
Corr.:
1.) [] and {} cannot be used as functor or operator.

Jan Burse

unread,
Nov 20, 2012, 8:16:00 AM11/20/12
to
Jan Burse schrieb:
Oops. Its wrong, should exclude
prefix operator. Oki Doki.

Ulrich Neumerkel

unread,
Nov 20, 2012, 8:17:39 AM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Ulrich Neumerkel schrieb:
>> In my Linux version of GNU Prolog 1.4.1 I cannot reproduce this.
>
>I am using a MacBook Pro:
>
>$ gprolog
>GNU Prolog 1.4.1
>By Daniel Diaz
>Copyright (C) 1999-2012 Daniel Diaz
>| ?- X = 渙.
>
>渙 = X
>
>yes


This answer is unusual, because the variable X
appears on the left-hand side. I see answers in GNU like

| ?- X = mA.

X = mA

yes

Ulrich Neumerkel

unread,
Nov 20, 2012, 8:19:57 AM11/20/12
to
You need the IS, Cor. 1 and Cor. 2. There everything is defined.

I do not see a need to introduce new and different nonterminals.

In the IS, there is the name token (* 6.4.2 *) for this. In
fact, it is quite differently defined. Think of /*

Jan Burse

unread,
Nov 20, 2012, 9:57:40 AM11/20/12
to
Ulrich Neumerkel schrieb:
> In the IS, there is the name token (* 6.4.2 *) for this. In
> fact, it is quite differently defined. Think of /*

Have to check. The goal in my docu is also to have a syntax def
that is easily extensible to Unicode. So will probably adapt my
atom def again, so that it depends on the delimiter class, which
is subsequently extended for Unicode. Just noticed this bug today.

There will then be already some delimiter (Jekejeke Terminology)
in the Latin1 space (it works already in release 0.9.6). For
example Jekejeke Prolog parses:

«abc»

As:

« abc »

So I guess « (0xAB) and » (0xBB) will be solo (ISO Terminology 6.5.3,
but didn't verify). This is not as the Craft of Prolog has defined
its Latin1 extension. There they were graphic.

Bye


Jan Burse

unread,
Nov 20, 2012, 10:38:08 AM11/20/12
to
On Windows it is detected as an atom. On Mac it
is detected as a variable, since it actually reads
a 194 before the 181. Which is very strange:

MacBook Pro, GNU 1.4.1:
?- X = µA.
µA = X
?- µA = 3.
µA = 3
?- X = '\xB5\'.
X = ?
?- atom_codes(X,[181]).
X = ?
?- atom_code('µA',X).
X = [194,181,65]

I guess the 0xC2 (=194) is from UTF-8 coding. Which
is not what Windows delivers, therefore on Windows
it is an atom:

Windows 7, GNU 1.4.1:
?- X = µA.
X = µA
?- µA = 3.
no
?- X = '\xB5\'.
X = µ
?- atom_codes(X,[181]).
X = µ
?- atom_code('µA',X).
X = [181,65]

On Linux (Fedora 16) I am not able to enter the micro
sign in the GNU Terminal. Altough it is possible in a
unix terminal itself, for example Jekejeke Prolog with
the -h option can work with it.

Looks rather to me that also some issues of terminal
handling are involved. Also I guess GNU Prolog does
not claim to handle these cases.

But somehow back quotes could also deliver a relieve
here. For example to clearly distinguish between atom
and variable when in doubt (for example when converting a
Prolog text from one system to another):

100% an atom:
?- 'µA' = 3.
No

100% a variable:
?- `µA` = 3.
`µA` = 3

Or to clearly allow any character in a variable
name when in doubt (for example when converting a
Prolog text from one system to another):

Any character in a variable:
?- `\xB5\A` = 3.
`µA` = 3


Ulrich Neumerkel schrieb:
> Jan Burse <janb...@fastmail.fm> writes:
>> Ulrich Neumerkel schrieb:
>>> In my Linux version of GNU Prolog 1.4.1 I cannot reproduce this.
>>
>> I am using a MacBook Pro:
>>
>> $ gprolog
>> GNU Prolog 1.4.1
>> By Daniel Diaz
>> Copyright (C) 1999-2012 Daniel Diaz
>> | ?- X = µA.
>>
>> µA = X

Ulrich Neumerkel

unread,
Nov 20, 2012, 3:08:09 PM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Ulrich Neumerkel schrieb:
>> In the IS, there is the name token (* 6.4.2 *) for this. In
>> fact, it is quite differently defined. Think of /*
>
>Have to check. The goal in my docu is also to have a syntax def
>that is easily extensible to Unicode. So will probably adapt my
>atom def again, so that it depends on the delimiter class, which
>is subsequently extended for Unicode. Just noticed this bug today.

I can only recommend to extend the exisitng standard grammar.
And not anything else. The grammar in 6 is already very
complex, by introducing new terminology you are only reducing
synergy.

Jan Burse

unread,
Nov 20, 2012, 5:06:24 PM11/20/12
to
Ulrich Neumerkel schrieb:
> I can only recommend to extend the exisitng standard grammar.
> And not anything else. The grammar in 6 is already very
> complex, by introducing new terminology you are only reducing
> synergy.

You don't find some true new terminology in my
grammar. But you find the following synonyms
(online version 0.9.7 as of today, not PDF):

- small letter -> lower
- capital letter -> upper
- solo -> delimiter
- What else?

And no logical names for the many special
characters, they are all literal in the syntax
to make it shorter. Also instead of "decimal
digit char" etc.., simply "digit" etc.. to make
it shorter.

I guess this is not a challenge in some way.

What didn't work for me, was finding character
class based variation points in the ISO grammar
to allow a general Unicode extension. For example
the following production is not pure:

name token (* 6.4.2 *)
= letter digit token (* 6.4.2 *)
graphic token (* 6.4.2 *I
quoted token (* 6.4.2 *)
semicolon token (* 6.4.2 *I
cut token (* 6.4.2 *) ;

It is not pure since it is a mixture of character
class based tokens (letter digit token and graphic
token) and individual character based tokens (semicolon
token and cut token).

The above syntax breaks the following promise:

6.5 Processor character set

The processor character set PCS is an implementation
defined character set. The members of PCS shall include
each character defined by char (6.5).

PCS may include additional members, known as extended
characters. It shall be implementation defined for each
extended character whether it is a graphic char, or an
alphanumeric char, or a solo char, or a layout char, or a
meta char.

It especially breaks the promise, when an implementation
extends the solo char class and thus indirectly the name token
class. This happens in my general Unicode extension. Therefore
I have the following syntax for a name:

name --> delimiter except "(", "{", "[",
"]", "}", ")", ",", "|"
| lower { alpha | digit }
| graphic { graphic } except "."
| str_single.

The above syntax is fully pluggable. When the character classes
delimiter, lower and graphic change, the syntax of name changes
automatically. This is why I gave the example of:

Jan Burse schrieb (15:57):
> There will then be already some delimiter (Jekejeke Terminology)
> in the Latin1 space (it works already in release 0.9.6). For
> example Jekejeke Prolog parses:
>
> «abc»
>
> As:
>
> « abc »
>
> So I guess « (0xAB) and » (0xBB) will be solo (ISO Terminology 6.5.3,
> but didn't verify). This is not as the Craft of Prolog has defined
> its Latin1 extension. There they were graphic.

It is very important that a Unicode extension is
character class based pluggable, since the underlying
platform might change the release number of the Unicode
libraries any time. So to avoid that one has to run after
each Unicode release, and pick individual characters, it
is much easier to work with a character class based
grammar that is pure and does not contain individual
characters.

Derived character class via an except by some ASCII
characters is not a problem. Since we exclude ASCII
characters once and for all, and this is stable. The
Unicode extension definition currently uses only
the following non-ASCII excepts:

graphic' --> DASH_PUNCTUATION |
OTHER_PUNCTUATION except ",", ";", "!", "'", "\"" |
MATH_SYMBOL except "|" |
CURRENCY_SYMBOL |
MODIFIER_SYMBOL except "`" |
OTHER_SYMBOL except "\xFFFD\".

0xFFFD is a special marker indicating an invalid byte
sequence, which we don't want to land in names. Otherwise
all excepts are ASCII so far, so hopefully this has
been defined once and for all (Unicode 3.x, 4.x, 5.x,
6.x, etc..). But who knows, maybe some adaptions will
be needed in the future.

Do you have something better in mind, Ulrich?

Bye

Ulrich Neumerkel

unread,
Nov 20, 2012, 6:08:30 PM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Ulrich Neumerkel schrieb:
>> I can only recommend to extend the exisitng standard grammar.
>> And not anything else. The grammar in 6 is already very
>> complex, by introducing new terminology you are only reducing
>> synergy.
>
>You don't find some true new terminology in my
>grammar. But you find the following synonyms
>(online version 0.9.7 as of today, not PDF):
>
>- small letter -> lower
>- capital letter -> upper
>- solo -> delimiter
>- What else?

But the names are different. THAT is a challenge.

>What didn't work for me, was finding character
>class based variation points in the ISO grammar
>to allow a general Unicode extension. For example
>the following production is not pure:
>
> name token (* 6.4.2 *)
> = letter digit token (* 6.4.2 *)
> graphic token (* 6.4.2 *I
> quoted token (* 6.4.2 *)
> semicolon token (* 6.4.2 *I
> cut token (* 6.4.2 *) ;
>
>It is not pure since it is a mixture of character
>class based tokens (letter digit token and graphic
>token) and individual character based tokens (semicolon
>token and cut token).
>
>The above syntax breaks the following promise:
>
> 6.5 Processor character set
>
> The processor character set PCS is an implementation
> defined character set. The members of PCS shall include
> each character defined by char (6.5).
>
> PCS may include additional members, known as extended
> characters. It shall be implementation defined for each
> extended character whether it is a graphic char, or an
> alphanumeric char, or a solo char, or a layout char, or a
> meta char.

This refers in particular

graphic char (* 6.5.1 *)
alphanumeric char (* 6.5.2 *) for small and capital too.

Adding solo chars is not very well received.
Same for meta char.

>It especially breaks the promise, when an implementation
>extends the solo char class and thus indirectly the name token
>class.

Do you have a concrete example for this other than the example
below?

> > �abc�

This is a graphic char in many systems. Please, leave it like that.
In particular, since the quoting rather goes �abc�, at least in
Germany and Austria (not in Switzerland). Not without reasons
the characters are called left/right pointing and not open/close.

> > So I guess � (0xAB) and � (0xBB) will be solo (ISO Terminology 6.5.3,
> > but didn't verify). This is not as the Craft of Prolog has defined
> > its Latin1 extension. There they were graphic.

Show me a system where these doublequotes are solos. We had lengthy
discussions on this and the outcome was: No new solos. SWI
had many, but no longer.

>It is very important that a Unicode extension is
>character class based pluggable, since the underlying
>platform might change the release number of the Unicode
>libraries any time. So to avoid that one has to run after
>each Unicode release, and pick individual characters, it
>is much easier to work with a character class based
>grammar that is pure and does not contain individual
>characters.

We cannot anticipate all the irregularities that Unicode will
introduce. If they are able to break Japanese, Chinese,
Korean and other Asian languages, if they are able to have case
problems due to lanuages spoken in the U.S.A. (!), they might
break everything else too. Unicode is not the solid system
it claims to be. Yes, but we cannot change that.

>Do you have something better in mind, Ulrich?

Yes, less classification.

There is no need to accept each and every character
in unquoted form. Most other programming languages that
claim to "support" Unicode, only support them in strings
and characters. If we have here both the distiction
between capital and small characters, and (some) graphic
characters that seems more than enough. So, in case
of doubt, rather reject the unquoted character.

Jan Burse

unread,
Nov 20, 2012, 6:52:57 PM11/20/12
to
Ulrich Neumerkel schrieb:
>> Do you have something better in mind, Ulrich?
> Yes, less classification.
>
> There is no need to accept each and every character
> in unquoted form.

Except for variables, there nobody wants back
quotes. Also note, that my proposal doesn't
prevent you from using quotes. You can still
do the following:

?- X = 'ᅵabcᅵ'.
X = 'ᅵabcᅵ'

Do you see in my proposal some restriction
on the quoting of names? I don't understand
your panic, Ulrich, this is how I see your
long post.

Anyway, the extra solos would be for Prolog
extensions that have mixfix resp. distfix.
The advantage of solos would be that they
wont glue, so one can do the following:

?- op(300,fy,ᅵ).
Yes
?- op(300,yf,ᅵ).
Yes
?- X = ᅵᅵxᅵ+ᅵxᅵᅵ.
X = ᅵᅵxᅵ+ᅵxᅵᅵ

If the parenthesis where graphics, they
would glue. That they don't glue shows
the write_canonical:

?- write_canonical(ᅵᅵxᅵ+ᅵxᅵᅵ), nl.
+(ᅵ(ᅵ(ᅵ(x))),ᅵ(ᅵ(ᅵ(x))))
Yes

That we really would need mixfix resp distfix
does also show the above write_canonical,
since the outcome is not really as expected.

Bye

Jan Burse

unread,
Nov 20, 2012, 7:20:29 PM11/20/12
to
Ulrich Neumerkel schrieb:
> Most other programming languages that
> claim to "support" Unicode, only support them in strings
> and characters.

I don't see this evidence. For example Scheme says:

"In addition to the identifier characters of the ASCII repertoire
specified below, Scheme implementations may permit any additional
repertoire of Unicode characters to be em- ployed in identifiers,
provided that each such character has a Unicode general category of Lu,
Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or
Co, or is U+200C or U+200D (the zero-width non-joiner and joiner,
respectively, which are needed for correct spelling in Persian, Hindi,
and other languages). However, it is an error for the first character to
have a general category of Nd, Mc, or Me. It is also an error to use a
non-Unicode character in symbols or identifiers."
http://trac.sacrideo.us/wg/raw-attachment/wiki/WikiStart/r7rs-draft-7.pdf

BTW: They have also a quoting syntax, where they use
the vertical bar as quotes, and inside quotes they also
allow escapes. But the above doesn't deal with
quoting syntax.

Your comment Ulrich suggests to only allow Unicode in
quoting syntax. I think this doesn't capture the goal
of a Unicode extension. I guess the goal of a unicode
extension should be that a person can code in his own
foreign script, without quotes whereever possible.

Prolog is special here, since it has 3 syntactic
subcategories of names: Solo, graphic and letter digit.
I feel uneasy if Solo becomes a closed category.

I was also thinking about finding a different pattern,
so that for example the exclamation mark ! is extened.
So that one could use the upside down exclamation mark,
as used in spain.

But I only arrived at a Unicode extension where I
put the parenthesis class into solo. I did not yet
find a solution for the generalization of the exclamation
mark. Problem is that although exclamation mark ! is
solo, the question mark ? is not solo.

Current function of the extended solo class is
basically to make the scanner going in case of
something of the following UNASSIGNED, SURROGATE
or PRIVATE_USE appears. That is why they have
been also added to the delimiter class:

delimiter' --> UNASSIGNED |
SURROGATE |
PRIVATE_USE |
START_PUNCTUATION |
END_PUNCTUATION |
INITIAL_QUOTE_PUNCTUATION |
FINAL_QUOTE_PUNCTUATION |
"," | ";" | "!" | "|" | "\xFFFD\".

Only after tokenizing names that contain
UNASSIGNED, SURROGATE or PRIVATE_USE are flagged
as errors. But since this happens after tokenizing
the full line until the terminating period has
already been read. And consulting doesn't get
out of sync.

I don't know how I should model this feature
with the ISO syntax except via extending solo,
and the provisio of the check during parse.
I guess eventually something could be done
via the distinction between processor character
set and character set.

But basically I think the ISO syntax is not
able to model this fault tollerant behaviour
usefully with a closed solo class.

Bye








Jan Burse

unread,
Nov 20, 2012, 7:27:06 PM11/20/12
to
Jan Burse schrieb:
>
> But basically I think the ISO syntax is not
> able to model this fault tollerant behaviour
> usefully with a closed solo class.

The layout class cannot be the tangent of attack,
since it doesn't survive always the tokenizing
phase. If I want to model a two step process of
tokenizing and then additional validation during
parsing, I have to do something either with solo,
graphic or letter digit.

If I do something with graphic resp. letter digit,
the danger is that the unvanted character glues
with wanted characters, and the error messages
are probably more confusing. Dunno.

Bye

Jan Burse

unread,
Nov 20, 2012, 7:56:33 PM11/20/12
to
Jan Burse schrieb:
> "In addition to the identifier characters of the ASCII repertoire
> specified below, Scheme implementations may permit any additional
> repertoire of Unicode characters to be em- ployed in identifiers,
> provided that each such character has a Unicode general category of Lu,
> Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or
> Co, or is U+200C or U+200D (the zero-width non-joiner and joiner,
> respectively, which are needed for correct spelling in Persian, Hindi,
> and other languages). However, it is an error for the first character to
> have a general category of Nd, Mc, or Me. It is also an error to use a
> non-Unicode character in symbols or identifiers."

Oho, 0x200C and 0x200D, are not covered as above
in my extension. They are recognized as layout in
Jekejeke Prolog (tested with manually pasting
0x200C between the quotes resp. between A and B):

?- atom_codes('‌',X).
Error: Illegal layout character in string.
atom_codes('‌',X).
^
?- X = A‌B.
Error: Superfluous token.
X = A‌B.
^

(ISO requires the error in the quoted atom)

Since they are currently layout in Jekejeke thats
also why they get printed escaped. I guess it
needs to be improved, since unicode explicitly
also allows the two controls in identifiers:

http://www.unicode.org/reports/tr31/
"... Thus in such circumstances, an implementation
should allow the following Join_Control ..."

SWI-Prolog also follows the same strategy,
i.e. currently recognizes it as space. But
is more tollerant inside quoted strings
(tested with manually pasting 0x200C between
the quotes resp. between A and B):

?- atom_codes('‌',X).
X = [8204].

?- X = A‌B.
ERROR: Syntax error: Operator expected
ERROR: X = A
ERROR: ** here **
ERROR: ‌B .

Somehow also a case for backquotes for
variables. Assume you switch between two
Prolog systems, one allowing the control in
quotes/unquoted and the other not.

Bye

Jan Burse

unread,
Nov 20, 2012, 7:59:22 PM11/20/12
to
Ulrich Neumerkel schrieb:
> Adding solo chars is not very well received.
> Same for meta char.

Don't say such things Ulrich, you are inventing
and spreading rumours.

Just reading, SWI-Prolog does the following:
- Other characters (this is mainly No: a numeric character of other
type) are currently handled as `solo'.
http://www.swi-prolog.org/pldoc/doc_for?object=section%282,%272.15%27,swi%28%27/doc/Manual/syntax.html%27%29%29

Just curious, motivation in that case behind it?

Bye

Ulrich Neumerkel

unread,
Nov 20, 2012, 8:09:36 PM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Ulrich Neumerkel schrieb:
>> Most other programming languages that
>> claim to "support" Unicode, only support them in strings
>> and characters.
>
>I don't see this evidence. For example Scheme says:

And what other language?

>Your comment Ulrich suggests to only allow Unicode in
>quoting syntax. I think this doesn't capture the goal
>of a Unicode extension. I guess the goal of a unicode
>extension should be that a person can code in his own
>foreign script, without quotes whereever possible.

Please reread:

alphanumeric (* 6.5.2 *) makes sense.
and the graphic chars.

And for the rest, yes, there, quoting makes more sense.

Ulrich Neumerkel

unread,
Nov 20, 2012, 8:12:48 PM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Do you see in my proposal some restriction
>on the quoting of names? I don't understand
>your panic, Ulrich, this is how I see your
>long post.

The more special characters are used from Unicode, the
more there will be instability.

Just, as you suggest:

Ulrich Neumerkel

unread,
Nov 20, 2012, 8:14:27 PM11/20/12
to
Jan Burse <janb...@fastmail.fm> writes:
>Ulrich Neumerkel schrieb:
>> Adding solo chars is not very well received.
>> Same for meta char.
>
>Don't say such things Ulrich, you are inventing
>and spreading rumours.

If you do not believe me, please go to SWI's git and see
what it was like: All symbols above 128 used to be solos.

Jan Burse

unread,
Nov 20, 2012, 8:18:51 PM11/20/12
to
Ulrich Neumerkel schrieb:
>> Don't say such things Ulrich, you are inventing
>> >and spreading rumours.
> If you do not believe me, please go to SWI's git and see
> what it was like: All symbols above 128 used to be solos.

And?

Jan Burse

unread,
Nov 20, 2012, 8:37:19 PM11/20/12
to
Ulrich Neumerkel schrieb:
> alphanumeric (* 6.5.2 *) makes sense.
> and the graphic chars.
>
> And for the rest, yes, there, quoting makes more sense.

Whan you make the solo closed, than everything
new is either alphanumeric, graphic or layout.
So there is no rest.

When you put something into layout you force
the end-user to use quotes, but not all systems
might follow this. See current 0x200C 0x200D
placement in Jekejeke and SWI.

Whether it makes sense to close solo, is
still non conclusive for me. I don't see
any argument in your posts, except exhibiting
Illocutionary force.

Bye

Jan Burse

unread,
Nov 22, 2012, 6:07:35 PM11/22/12
to
Ulrich Neumerkel schrieb:
> This refers in particular
>
> graphic char (* 6.5.1 *)
> alphanumeric char (* 6.5.2 *) for small and capital too.

There is one more complication with the form
of the ISO core standard, when going Unicode.
Namely if ID_START and ID_CONTINUE are different.

We don't find this in the ISO core standard. For
ASCII we have ID_START and ID_CONTINUE are the
same, both are alphanumeric plus underscore. This
makes the grammar rules in the current ISO core
standard especially simple.

SWI-Prolog has already solved the problem. For
example it splits a fronting Mark-Nonspacing,
but does not so for an inside Mark-Nonspacing.

So we have for a \x0327\a input:

?- X = ̧a.
ERROR: Syntax error: Operator expected
ERROR: X = ̧
ERROR: ** here **
ERROR: a .

But for an a\x0327\ input:

?- X = a̧.
X = a̧.

Now what is a Mark-Nonspacing that is splitted?
I guess it is a solo.

So we have for a \x0327\ input:

?- X = ̧.
X = '̧'.

I guess the latter doesn't need quotes when it is
a solo. But this enlarges the solos again, when my
interpretation is correct.

Bye

0 new messages