Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GNAT vs UTF-8 source file names

418 views
Skip to first unread message

Simon Wright

unread,
Apr 30, 2017, 1:10:45 PM4/30/17
to
ACATS 4.1 test C250002 involves unit names with UTF-8 characters (the
source has the correct UTF-8 BOM, the relevant unit is named C250002_Z
where Z is actually UTF-8 C381, latin capital letter a with acute;
gnatchop correctly generates a source file with the BOM and name
c250002_z where z is actually UTF-8 C3A1, latin small letter a with
acute).

On compiling, the compiler (GNAT GPL 2016, FSF GCC 7.0.1) fails to find
the file; it says e.g.

GNATMAKE GPL 2016 (20160515-49)
Copyright (C) 1992-2016, Free Software Foundation, Inc.
gcc -c -I../../../support -gnatW8 c250002.adb
gcc -c -I../../../support -gnatW8 c250002_0.ads
End of compilation
gnatmake: "c250002_?.adb" not found

I _suspect_ that the problem is down to the .ali file. macOS says

$ file -I *
c250002.adb: text/plain; charset=utf-8
c250002.ali: text/plain; charset=unknown-8bit
c250002.lst: text/plain; charset=us-ascii
c250002.o: application/x-mach-binary; charset=binary
c250002_0.ads: text/plain; charset=utf-8
c250002_á.adb: text/plain; charset=utf-8
c250002_á.ads: text/plain; charset=utf-8

(the last 2 were actually a-acute on the terminal) but the .ali file is
confused about whether the representation of the a-acute is C3A1 (good,
assuming it gets interpreted as UTF-8 without a BOM) or E3A1 (bad),
particularly about the corresponding .ali file name.

Any thoughts? is this a known issue?

(C250001, which has BOMs and UTF-8 identifiers but not file names, works fine
with no -gnatW8 messing)

Simon Wright

unread,
Jun 17, 2017, 1:20:30 PM6/17/17
to
Simon Wright <si...@pushface.org> writes:

> ACATS 4.1 test C250002 involves unit names with UTF-8 characters (the
> source has the correct UTF-8 BOM, the relevant unit is named C250002_Z
> where Z is actually UTF-8 C381, latin capital letter a with acute;
> gnatchop correctly generates a source file with the BOM and name
> c250002_z where z is actually UTF-8 C3A1, latin small letter a with
> acute).
>
> On compiling, the compiler (GNAT GPL 2016, FSF GCC 7.0.1) fails to find
> the file; it says e.g.
>
> GNATMAKE GPL 2016 (20160515-49)
> Copyright (C) 1992-2016, Free Software Foundation, Inc.
> gcc -c -I../../../support -gnatW8 c250002.adb
> gcc -c -I../../../support -gnatW8 c250002_0.ads
> End of compilation
> gnatmake: "c250002_?.adb" not found

PR ada/81114 refers[1].

It turns out that this failure occurs on Windows and macOS. The problem
is that GNAT smashes the file name to lower case if it knows that the
file system is case-insensitive (using an ASCII to-lower, so of course
'smash' is the right word if there are UTF-8 characters in there).

There is an undocumented environment variable that affects this:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake c250002
gcc -c c250002.adb
gcc -c c250002_á.adb
gnatbind -x c250002.ali
gnatlink c250002.ali
$ ./c250002

,.,. C250002 ACATS 4.1 17-06-17 18:05:55
---- C250002 Check that characters above ASCII.Del can be used in
identifiers, character literals and strings.
- C250002 C250002_0.TAGGED_à_ID.
==== C250002 PASSED ============================.

I wonder why, if the FS is case-insensitive, GNAT bothers at all? (there
was, I think, some remark about detecting whether two filenames
represented different files).

What do people who actually need to use international character sets do
about this? Do you just avoid using international characters in Ada unit
names? Or have I just missed the relevant part of the manual?

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

Jacob Sparre Andersen

unread,
Jun 27, 2017, 9:22:13 AM6/27/17
to
Simon Wright wrote:

> What do people who actually need to use international character sets
> do about this? Do you just avoid using international characters in Ada
> unit names? Or have I just missed the relevant part of the manual?

One of my customers simply has a policy saying that all identifiers have
to be in English (the policy doesn't say if it should be American
English or proper English), and thus neatly works around the problem.

This reminds me tha Jean-Pierre Rosen had a very entertaining tutorial
on glyphs, graphemes, alphabets, characters, character sets, encodings,
etc. at Ada-Europe 2017 in Vienna. We learnt all kinds of stuff we
really don't want to know and worry about. ;-)

Greetings,

Jacob
--
"Even god needs a bus to get there."

Niklas Holsti

unread,
Jun 27, 2017, 5:45:59 PM6/27/17
to
On 17-06-27 16:22 , Jacob Sparre Andersen wrote:
> Simon Wright wrote:
>
>> What do people who actually need to use international character sets
>> do about this? Do you just avoid using international characters in
>> Ada unit names? Or have I just missed the relevant part of the
>> manual?

I use ISO-Latin-1 identifiers in some Ada programs written in a Finnish
context, using the Finnish alphabet letters ä, ö, and sometimes the
Swedish å. Worked OK for me until *some* of the file systems I use
changed from file names with 8-bit characters to UTF-8 file names, after
which CVS was quite messed up. I have since limited myself to ASCII in
all identifiers that become file name parts in GNAT's file-naming
convention, but I still use ISO Latin 1 for other identifiers.

> One of my customers simply has a policy saying that all identifiers
> have to be in English (the policy doesn't say if it should be American
> English or proper English), and thus neatly works around the problem.

Only if you stick to "modern" English spelling. Otherwise you could
have, for example,

package Coördinates is ...

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

G.B.

unread,
Jun 28, 2017, 1:05:42 AM6/28/17
to
On 27.06.17 23:45, Niklas Holsti wrote:
>
>> One of my customers simply has a policy saying that all identifiers
>> have to be in English (the policy doesn't say if it should be American
>> English or proper English), and thus neatly works around the problem.
>
> Only if you stick to "modern" English spelling. Otherwise you could have, for example,
>
> package Coördinates is ...

Just like some might be tempted to use floating point
types when they have permission to use integer types
instead: the support for the more complicated, error
prone, and difficult new floating point type is partially
broken, so, programmers, let us get away with the current
support situation by preferring integer types. They are much
more portable, anyway!

Simon Wright

unread,
Jul 4, 2017, 9:57:06 AM7/4/17
to
Simon Wright <si...@pushface.org> writes:

> PR ada/81114 refers[1].
>
> It turns out that this failure occurs on Windows and macOS. The problem
> is that GNAT smashes the file name to lower case if it knows that the
> file system is case-insensitive (using an ASCII to-lower, so of course
> 'smash' is the right word if there are UTF-8 characters in there).

> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

It's worse than that, on macOS anyway[2].

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

The reason for this apparently-bizarre message is[3] that macOS takes
the composed form (lowercase a acute) and converts it under the hood
to what HFS+ insists on, the fully decomposed form (lowercase a, combining
acute); thus the names are actually different even though they _look_
the same.

I have to say that, great as it would be to have this fixed, the changes
required would be extensive, and I can’t see that anyone would think it
worth the trouble.

The recommendation would be "don’t use international characters in the
names of library units".

[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1
[3] https://stackoverflow.com/a/6153713/40851

Shark8

unread,
Jul 4, 2017, 1:30:04 PM7/4/17
to
On Tuesday, July 4, 2017 at 7:57:06 AM UTC-6, Simon Wright wrote:
>
> It's worse than that, on macOS anyway[2].
>
> $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> gcc -c páck3.ads
> páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
>
> The reason for this apparently-bizarre message is[3] that macOS takes
> the composed form (lowercase a acute) and converts it under the hood
> to what HFS+ insists on, the fully decomposed form (lowercase a, combining
> acute); thus the names are actually different even though they _look_
> the same.

This is why I maintain that unicode is crap -- a mistake along the lines of C that will likely take *decades* for the rest of "the industry" / computer science to realize.

>
> I have to say that, great as it would be to have this fixed, the changes
> required would be extensive, and I can’t see that anyone would think it
> worth the trouble.

One of unicode's biggest problems is that there's no longer any coherent vision -- it started off as a idea to offer one code-point per character in human language, but then shifted to glyph-building (hence combining characters), and as such lacks a unifying principle.

J-P. Rosen

unread,
Jul 5, 2017, 1:21:41 AM7/5/17
to
Le 04/07/2017 à 15:57, Simon Wright a écrit :
> The reason for this apparently-bizarre message is[3] that macOS takes
> the composed form (lowercase a acute) and converts it under the hood
> to what HFS+ insists on, the fully decomposed form (lowercase a, combining
> acute); thus the names are actually different even though they _look_
> the same.
Apparently, they use NFD (Normalization Form D). Normalization forms are
necessary to avoid a whole lot of problems, although Ada requires
normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is
implementation defined if the text is not in NFC.

--
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

J-P. Rosen

unread,
Jul 5, 2017, 1:25:20 AM7/5/17
to
Le 04/07/2017 à 19:30, Shark8 a écrit :
> This is why I maintain that unicode is crap -- a mistake along the
> lines of C that will likely take *decades* for the rest of "the
> industry" / computer science to realize.
Please don't make such statements until you understand all the issues -
the problem of character sets is incredibly complicated.

>> I have to say that, great as it would be to have this fixed, the
>> changes required would be extensive, and I can’t see that anyone
>> would think it worth the trouble.
> One of unicode's biggest problems is that there's no longer any
> coherent vision -- it started off as a idea to offer one code-point
> per character in human language, but then shifted to glyph-building
> (hence combining characters), and as such lacks a unifying
> principle.
The unifying principle is the normalization forms. The fact that there
are several normalization forms comes from the difference between human
and computer needs.

Simon Wright

unread,
Jul 5, 2017, 5:47:42 AM7/5/17
to
"J-P. Rosen" <ro...@adalog.fr> writes:

> Le 04/07/2017 à 15:57, Simon Wright a écrit :
>> The reason for this apparently-bizarre message is[3] that macOS takes
>> the composed form (lowercase a acute) and converts it under the hood
>> to what HFS+ insists on, the fully decomposed form (lowercase a,
>> combining acute); thus the names are actually different even though
>> they _look_ the same.
> Apparently, they use NFD (Normalization Form D). Normalization forms
> are necessary to avoid a whole lot of problems, although Ada requires
> normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is
> implementation defined if the text is not in NFC.

That reference specifies NFKC which I suppose is near! GNAT uses this if
either you compile with -gnatW8 or the file begins with a UTF8 BOM.

The problems I've noted in this thread in the GNAT implementation are
two:

(1) On Windows and macOS (and possibly on VMS, not sure if that's
relevant any more) the file name corresponding to a unit name is
converted to lower-case assuming it's Latin-1 -
System.Case_Util.To_Lower,

function To_Lower (A : Character) return Character is
A_Val : constant Natural := Character'Pos (A);

begin
if A in 'A' .. 'Z'
or else A_Val in 16#C0# .. 16#D6#
or else A_Val in 16#D8# .. 16#DE#
then
return Character'Val (A_Val + 16#20#);
else
return A;
end if;
end To_Lower;

This is the problem that prevents use of extended characters in unit
names.

(2) On macOS, the expected file name appears to be stored in NFC, but is
retrieved from the file system in NFD.

It seems this will only cause a problem if you compile the file (on its
own, not as part of the closure of another file - weird - possibly
because the wildcard picks up the NFD representation, while compiling as
part of the closure uses the NFC representation in the ALI?) with -gnatwe:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f p*.ads -gnatwe
gcc -c -gnatwe páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
gnatmake: "páck3.ads" compilation error

(this message was copied from Terminal and pasted into Emacs, which
makes clear the difference between the two representations; previously
I've copied from Terminal and pasted into Safari/Bugzilla, which
produced identical glyphs).

J-P. Rosen

unread,
Jul 5, 2017, 7:21:00 AM7/5/17
to
Le 05/07/2017 à 11:47, Simon Wright a écrit :
> That reference specifies NFKC which I suppose is near!
Not that near when it comes to ligatures and other crazy characters...
But you are right, it's NFKC.

> GNAT uses this if
> either you compile with -gnatW8 or the file begins with a UTF8 BOM.
Actually, this has nothing to do with encoding or coded character sets.
Even if you use Latin-1, the set of allowed characters is defined as
those that belong to NFKC.

> The problems I've noted in this thread in the GNAT implementation are
> two:
>
> (1) On Windows and macOS (and possibly on VMS, not sure if that's
> relevant any more) the file name corresponding to a unit name is
> converted to lower-case assuming it's Latin-1 -
> System.Case_Util.To_Lower,
I can talk about character issues since I gave that tutorial at AE'17...
How operating systems manage that, I don't know.

Randy Brukardt

unread,
Jul 5, 2017, 2:42:17 PM7/5/17
to
"J-P. Rosen" <ro...@adalog.fr> wrote in message
news:ojihrl$qu2$1...@dont-email.me...
> Le 05/07/2017 à 11:47, Simon Wright a écrit :
>> That reference specifies NFKC which I suppose is near!
> Not that near when it comes to ligatures and other crazy characters...
> But you are right, it's NFKC.

Actually, you were right the first time, but it doesn't show up in the Ada
2012 as this is a recent correction (recall AI12-0004-1? It was just
approved by WG 9 at the June meeting). NFKC is *definitely* the wrong rule.

Note that we chose NFC in part because WC3 recommends that all Internet
content be in NFC, and because it is the more compact representation. I'm
surprised that anyone would use NFD (since it can be three times larger than
NFC), but I suppose I shouldn't ever be surprised by the choices of others.
;-)

As always, you can see the *current* state of Ada by using the working draft
RM (see http://www.ada-auth.org/standards/ada2x.html). For this rule, that
is 2.1(4.1/5).

I suppose the working draft is a bit confusing for this use (that is,
Ada-Comment) as corrections (like this) take effect immediately upon WG 9
approval while amendments don't take effect until the next Standard update.
You can tell them apart by looking at the bottom of each subclause at the
"<something> from Ada 2012" (for instance, "Wording Changes from Ada
2012") -- "corrections" are identified that way, while amendments are not
identified specially.

Randy.


Shark8

unread,
Jul 6, 2017, 11:18:43 AM7/6/17
to
On Tuesday, July 4, 2017 at 11:25:20 PM UTC-6, J-P. Rosen wrote:
> Le 04/07/2017 à 19:30, Shark8 a écrit :
> > This is why I maintain that unicode is crap -- a mistake along the
> > lines of C that will likely take *decades* for the rest of "the
> > industry" / computer science to realize.
> Please don't make such statements until you understand all the issues -
> the problem of character sets is incredibly complicated.

I'm not saying it isn't complicated; I'm saying that it could, and should, have been done better. Instead we get a bizarre Frankenstein's-monster of techniques where some character-glyphs are precomposed (with duplicates across multiple languages) and Zalgo-script is a thing. (see: https://eeemo.net/ )

Not only that, but there's the problem of strings; instead of doing something sensible ("but wasteful"*) by designing a "multilanguage string" that partitioned strings by language. Ex:

Type Language is (English, French, Russian); -- supported languages

Type Discriminated_String( Words : Language; Length : Natural ) is record
Data : String(1..Length); -- Sequence of code-points/characters.
end record;

Package Discriminated_String_Vector is new Ada.Containers.Indefinite_Vector
( Index_Type => Positive, Element_Type => Discriminated_String );


Type Multi_Language_String is new Discriminated_String_Vector.Vector with null record;
-- New primitive operations.

And *THERE* you have a sane framework for managing multilingual text; granted *most* text would only /need/ a single element vector because most text is not multi-lingual; that's ok. The important part here is that the languages are kept distinct and clearly indicated. (This would also allow far more maintainability than unicode's system because you could then allow independent subgroups to manage their own language.)

>
> >> I have to say that, great as it would be to have this fixed, the
> >> changes required would be extensive, and I can’t see that anyone
> >> would think it worth the trouble.
> > One of unicode's biggest problems is that there's no longer any
> > coherent vision -- it started off as a idea to offer one code-point
> > per character in human language, but then shifted to glyph-building
> > (hence combining characters), and as such lacks a unifying
> > principle.
> The unifying principle is the normalization forms. The fact that there
> are several normalization forms comes from the difference between human
> and computer needs.

Perhaps so, but there ought to be a way to identify such a context rather than just throwing these normalized forms in the UTF-string blender, shrugging, and handing it off to the programmers as "not my problem".

I mean as a counter-example ASN.1 has normalizing encodings like DER and CER, but these are (a) usually distinguished by being defined by their particular encoding, and when they aren't (b) are proper subsets of BER. [Much like subtypes in Ada and how we can use Natural & Positive for better describing our problem, but can use Integer when needed (ie foreign interfacing where the constraint might not be guarenteed).]


* -- Wasteful like keeping the bounds of an array seems wasteful to C programmers.

Simon Wright

unread,
Jul 6, 2017, 2:43:51 PM7/6/17
to
"J-P. Rosen" <ro...@adalog.fr> writes:

>> GNAT uses this if
>> either you compile with -gnatW8 or the file begins with a UTF8 BOM.
> Actually, this has nothing to do with encoding or coded character sets.
> Even if you use Latin-1, the set of allowed characters is defined as
> those that belong to NFKC.

I don't understand.

If your source has no BOM and you don't say -gnatW8, GNAT expects
Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
expects UTF8 encoding (I haven't tried what happens if you use NFD).

I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
use in unit names - ARM 2.1(16) says it should be accepted.

(later) UTF8 is accepted in strings but not in identifiers.

J-P. Rosen

unread,
Jul 7, 2017, 4:19:57 AM7/7/17
to
Le 06/07/2017 à 17:18, Shark8 a écrit :
> I'm not saying it isn't complicated; I'm saying that it could, and
> should, have been done better.
I'm willing to accept these kinds of statement only from people who
participated in the design...

> Instead we get a bizarre
> Frankenstein's-monster of techniques where some character-glyphs are
> precomposed (with duplicates across multiple languages) and
> Zalgo-script is a thing. (see: https://eeemo.net/ )
Yes, representation of characters is not unique. It's a compromise
between compacity, compatibility, exhaustivity...

> Not only that, but there's the problem of strings; instead of doing
> something sensible ("but wasteful"*) by designing a "multilanguage
> string" that partitioned strings by language. Ex:
This is total confusion. Unicode is about coded sets and encodings, it
has nothing to do with languages and internationalization.

>> The unifying principle is the normalization forms. The fact that
>> there are several normalization forms comes from the difference
>> between human and computer needs.
>
> Perhaps so, but there ought to be a way to identify such a context
> rather than just throwing these normalized forms in the UTF-string
> blender, shrugging, and handing it off to the programmers as "not my
> problem".
Another confusion: normalization forms have nothing to do with encodings
(UTF or not). Normalization provides a unique representation of
composite characters that may be represented in several ways.

> I mean as a counter-example ASN.1 has normalizing encodings like DER
> and CER, but these are (a) usually distinguished by being defined by
> their particular encoding, and when they aren't (b) are proper
> subsets of BER. [Much like subtypes in Ada and how we can use Natural
> & Positive for better describing our problem, but can use Integer
> when needed (ie foreign interfacing where the constraint might not be
> guarenteed).]
I don't follow you here. ASN.1 is a representation of structured data,
and AFAIU does not specify which coded set is used.

J-P. Rosen

unread,
Jul 7, 2017, 4:26:10 AM7/7/17
to
Le 06/07/2017 à 20:43, Simon Wright a écrit :
>> Even if you use Latin-1, the set of allowed characters is defined as
>> those that belong to NFKC.
> I don't understand.
>
> If your source has no BOM and you don't say -gnatW8, GNAT expects
> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
> expects UTF8 encoding (I haven't tried what happens if you use NFD).
>
> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
> use in unit names - ARM 2.1(16) says it should be accepted.
>
> (later) UTF8 is accepted in strings but not in identifiers.

This is a common confusion between characters, coded sets, and encodings...

ISO-10646 defines a coded set (code points) for a number of characters
(identical to the one defined by Unicode). Some of these characters can
be represented in NFKC. These are the allowed characters.

If you use Latin-1, you have different code points for the same
characters - and the allowed characters are still those representable in
NFKC, even with different code points.

UTF8 is an encoding, nothing more than a compression algorithm for
numerical values. It is generally used to compress Unicode strings, but
could be used for any numerical values. In any case, it doesn't change
logical values, just the way they are stored.

Simon Wright

unread,
Jul 7, 2017, 7:01:17 AM7/7/17
to
"J-P. Rosen" <ro...@adalog.fr> writes:

> Le 06/07/2017 à 20:43, Simon Wright a écrit :
>>> Even if you use Latin-1, the set of allowed characters is defined as
>>> those that belong to NFKC.
>> I don't understand.
>>
>> If your source has no BOM and you don't say -gnatW8, GNAT expects
>> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
>> expects UTF8 encoding (I haven't tried what happens if you use NFD).
>>
>> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
>> use in unit names - ARM 2.1(16) says it should be accepted.
>>
>> (later) UTF8 is accepted in strings but not in identifiers.
>
> This is a common confusion between characters, coded sets, and encodings...
>
> ISO-10646 defines a coded set (code points) for a number of characters
> (identical to the one defined by Unicode). Some of these characters can
> be represented in NFKC. These are the allowed characters.
>
> If you use Latin-1, you have different code points for the same
> characters - and the allowed characters are still those representable in
> NFKC, even with different code points.
>
> UTF8 is an encoding, nothing more than a compression algorithm for
> numerical values. It is generally used to compress Unicode strings, but
> could be used for any numerical values. In any case, it doesn't change
> logical values, just the way they are stored.

I think this is a response to my "I don't understand" - I think I do
understand a little better now, thank you.

The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says

"An Ada implementation shall accept Ada source code in UTF-8
encoding, with or without a BOM (see A.4.11), where every character
is represented by its code point."

which for GNAT is not met unless either there is a BOM or -gnatW8 is
used.

On the other hand, ARM 2.1(4/3) says "The coded representation for
characters is implementation defined", which seems to conflict with (16)
- but then, the AARM ramification (4.b/2) notes that the rule doesn't
have much force!

Jacob Sparre Andersen

unread,
Jul 7, 2017, 7:49:59 AM7/7/17
to
Simon Wright wrote:

> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>
> "An Ada implementation shall accept Ada source code in UTF-8
> encoding, with or without a BOM (see A.4.11), where every character
> is represented by its code point."
>
> which for GNAT is not met unless either there is a BOM or -gnatW8 is
> used.

Which sounds perfectly okay.

There are no limitations to which command-line arguments a program can
require to behave like an Ada compiler.

> On the other hand, ARM 2.1(4/3) says "The coded representation for
> characters is implementation defined", which seems to conflict with
> (16) - but then, the AARM ramification (4.b/2) notes that the rule
> doesn't have much force!

That sounds like the classical wording.

I suppose that the intent is that UTF-8 encoded ISO-10646 (in the right
normalization form) _has_ to be supported, but that any other encoding
is allowed in addition to that.

It would of course be nice if that was also what the ARM actually said.

Greetings,

Jacob
--
"Only Hogwarts students really need spellcheckers"
-- An anonymous RISKS reader

Randy Brukardt

unread,
Jul 7, 2017, 3:40:20 PM7/7/17
to
"Simon Wright" <si...@pushface.org> wrote in message
news:lybmow1...@pushface.org...
...
> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>
> "An Ada implementation shall accept Ada source code in UTF-8
> encoding, with or without a BOM (see A.4.11), where every character
> is represented by its code point."
>
> which for GNAT is not met unless either there is a BOM or -gnatW8 is
> used.

The Standard says "shall accept"; it has nothing to say about what
handstands are needed to get the required behavior. If GNAT required to
chant "Ada is Great" toward New York and then Paris before accepting UTF-8
source, it would still meet the requirement of the Standard. Certainly
requiring the use of -gnatW8 to get the language required behavior is
acceptable (recall that you have to use -gnatE and used to have to
use -gnato to get the language required behavior in other areas).

> On the other hand, ARM 2.1(4/3) says "The coded representation for
> characters is implementation defined", which seems to conflict with (16)
> - but then, the AARM ramification (4.b/2) notes that the rule doesn't
> have much force!

An implementation can have other encodings (which are
implementation-defined). The new rule (2.1(16/3)) mainly just reflects that
practically, an Ada compiler has to be able to accept the source of the
ACATS; we decided to require that in the Standard that so that there is a
standard source form that every compiler is going to support. Thus it is now
possible to portably write Ada source code as well as write a portable Ada
program. (Practically, this was always true, but it's better to have it
written in the Standard.)

Randy.


Randy Brukardt

unread,
Jul 7, 2017, 3:44:19 PM7/7/17
to
"Jacob Sparre Andersen" <ja...@jacob-sparre.dk> wrote in message
news:87inj4x...@jacob-sparre.dk...
...
>> On the other hand, ARM 2.1(4/3) says "The coded representation for
>> characters is implementation defined", which seems to conflict with
>> (16) - but then, the AARM ramification (4.b/2) notes that the rule
>> doesn't have much force!
>
> That sounds like the classical wording.
>
> I suppose that the intent is that UTF-8 encoded ISO-10646 (in the right
> normalization form) _has_ to be supported, but that any other encoding
> is allowed in addition to that.

Precisely.

> It would of course be nice if that was also what the ARM actually said.

Mostly we're not changing text that doesn't have to be changed. In some
cases, it would make more sense if it was changed, but since every change
has a potential for errors and unintended consequences, its often best to
leave stuff alone. (There are many cases where a "simple" change broke
something else, leading to repeated fixes.)

Randy.


Simon Wright

unread,
Jul 7, 2017, 5:02:09 PM7/7/17
to
"Randy Brukardt" <ra...@rrsoftware.com> writes:

> "Simon Wright" <si...@pushface.org> wrote in message
> news:lybmow1...@pushface.org...
> ...
>> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>>
>> "An Ada implementation shall accept Ada source code in UTF-8
>> encoding, with or without a BOM (see A.4.11), where every character
>> is represented by its code point."
>>
>> which for GNAT is not met unless either there is a BOM or -gnatW8 is
>> used.
>
> The Standard says "shall accept"; it has nothing to say about what
> handstands are needed to get the required behavior

I suppose I'm more used to military requirements, where (IMO) handstands
would be unacceptable, and "shall accept" means just that. Perhaps
"shall be able to accept"? But (having read your other note) I see why
this isn't going to change.
0 new messages