Ulrich Neumerkel schrieb:
> I can only recommend to extend the exisitng standard grammar.
> And not anything else. The grammar in 6 is already very
> complex, by introducing new terminology you are only reducing
> synergy.
You don't find some true new terminology in my
grammar. But you find the following synonyms
(online version 0.9.7 as of today, not PDF):
- small letter -> lower
- capital letter -> upper
- solo -> delimiter
- What else?
And no logical names for the many special
characters, they are all literal in the syntax
to make it shorter. Also instead of "decimal
digit char" etc.., simply "digit" etc.. to make
it shorter.
I guess this is not a challenge in some way.
What didn't work for me, was finding character
class based variation points in the ISO grammar
to allow a general Unicode extension. For example
the following production is not pure:
name token (* 6.4.2 *)
= letter digit token (* 6.4.2 *)
graphic token (* 6.4.2 *I
quoted token (* 6.4.2 *)
semicolon token (* 6.4.2 *I
cut token (* 6.4.2 *) ;
It is not pure since it is a mixture of character
class based tokens (letter digit token and graphic
token) and individual character based tokens (semicolon
token and cut token).
The above syntax breaks the following promise:
6.5 Processor character set
The processor character set PCS is an implementation
defined character set. The members of PCS shall include
each character defined by char (6.5).
PCS may include additional members, known as extended
characters. It shall be implementation defined for each
extended character whether it is a graphic char, or an
alphanumeric char, or a solo char, or a layout char, or a
meta char.
It especially breaks the promise, when an implementation
extends the solo char class and thus indirectly the name token
class. This happens in my general Unicode extension. Therefore
I have the following syntax for a name:
name --> delimiter except "(", "{", "[",
"]", "}", ")", ",", "|"
| lower { alpha | digit }
| graphic { graphic } except "."
| str_single.
The above syntax is fully pluggable. When the character classes
delimiter, lower and graphic change, the syntax of name changes
automatically. This is why I gave the example of:
Jan Burse schrieb (15:57):
> There will then be already some delimiter (Jekejeke Terminology)
> in the Latin1 space (it works already in release 0.9.6). For
> example Jekejeke Prolog parses:
>
> «abc»
>
> As:
>
> « abc »
>
> So I guess « (0xAB) and » (0xBB) will be solo (ISO Terminology 6.5.3,
> but didn't verify). This is not as the Craft of Prolog has defined
> its Latin1 extension. There they were graphic.
It is very important that a Unicode extension is
character class based pluggable, since the underlying
platform might change the release number of the Unicode
libraries any time. So to avoid that one has to run after
each Unicode release, and pick individual characters, it
is much easier to work with a character class based
grammar that is pure and does not contain individual
characters.
Derived character class via an except by some ASCII
characters is not a problem. Since we exclude ASCII
characters once and for all, and this is stable. The
Unicode extension definition currently uses only
the following non-ASCII excepts:
graphic' --> DASH_PUNCTUATION |
OTHER_PUNCTUATION except ",", ";", "!", "'", "\"" |
MATH_SYMBOL except "|" |
CURRENCY_SYMBOL |
MODIFIER_SYMBOL except "`" |
OTHER_SYMBOL except "\xFFFD\".
0xFFFD is a special marker indicating an invalid byte
sequence, which we don't want to land in names. Otherwise
all excepts are ASCII so far, so hopefully this has
been defined once and for all (Unicode 3.x, 4.x, 5.x,
6.x, etc..). But who knows, maybe some adaptions will
be needed in the future.
Do you have something better in mind, Ulrich?
Bye