Ada and Unicode

DrPi

unread,

Apr 17, 2021, 6:03:14 PM4/17/21

to

Hi,

I have a good knowledge of Unicode : code points, encoding...
What I don't understand is how to manage Unicode strings with Ada. I've
read part of ARM and did some tests without success.

I managed to be partly successful with source code encoded in Latin-1.
Any other encoding failed.
Any way to use source code encoded in UTF-8 ?
In some languages, it is possible to set a tag at the beginning of the
source file to direct the compiler which encoding to use.
I wasn't successful using -gnatW8 switch. But maybe I made to many tests
and my brain was scrambled.

Even with source code encoded in Latin-1, I've not been able to manage
Unicode strings correctly.

What's the way to manage Unicode correctly ?

Regards,
Nicolas

Luke A. Guest

unread,

Apr 17, 2021, 8:02:02 PM4/17/21

to

On 17/04/2021 23:03, DrPi wrote:
> Hi,
>
> I have a good knowledge of Unicode : code points, encoding...
> What I don't understand is how to manage Unicode strings with Ada. I've
> read part of ARM and did some tests without success.

It's a mess imo. I've complained about it before. The official stance is
that the standard defines that a compiler should accept the ISO
equivalent of Unicode and that a compiler should implement a flawed
system, especially UTF-8 types,
http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html

Unicode is a bit painful, I've messed about with it to some degree here
https://github.com/Lucretia/uca.

There are other attempts:

1. http://www.dmitry-kazakov.de/ada/strings_edit.htm
2. https://github.com/reznikmm/matreshka (very heavy, many layers)
3. https://github.com/Blady-Com/UXStrings

I remember getting an exception converting from my unicode_string to a
wide_wide string for some reason ages ago.

Maxim Reznik

unread,

Apr 19, 2021, 4:29:36 AM4/19/21

to

воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:

>
> Any way to use source code encoded in UTF-8 ?

Yes, with GNAT just use "-gnatW8" for compiler flag (in command line or your project file):

-- main.adb:
with Ada.Wide_Wide_Text_IO;

procedure Main is
Привет : constant Wide_Wide_String := "Привет";
begin
Ada.Wide_Wide_Text_IO.Put_Line (Привет);
end Main;

$ gprbuild -gnatW8 main.adb
$ ./main
Привет

> In some languages, it is possible to set a tag at the beginning of the
> source file to direct the compiler which encoding to use.

You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. Take a look:

-- main.adb:
pragma Wide_Character_Encoding (UTF8);

with Ada.Wide_Wide_Text_IO;

procedure Main is
Привет : constant Wide_Wide_String := "Привет";
begin
Ada.Wide_Wide_Text_IO.Put_Line (Привет);
end Main;

$ gprbuild main.adb
$ ./main
Привет

> What's the way to manage Unicode correctly ?
>

You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka

> Regards,
> Nicolas

Stephen Leake

unread,

Apr 19, 2021, 5:08:40 AM4/19/21

to

DrPi <3...@drpi.fr> writes:

> Any way to use source code encoded in UTF-8 ?

for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");

from the gnat user guide, 4.3.1 Alphabetical List of All Switches:

`-gnati`c''
Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details
of the possible selections for `c', see *note Character Set
Control: 4e.

This applies to identifiers in the source code

`-gnatW`e''
Wide character encoding method (`e'=n/h/u/s/e/8).

This applies to string and character literals.

> What's the way to manage Unicode correctly ?

There are two issues: Unicode in source code, that the compiler must
understand, and Unicode in strings, that your program must understand.

(I've never written a program that dealt with utf strings other than
file names).

-gnati8 tells the compiler that the source code uses utf-8 encoding.

-gnatW8 tells the compiler that string literals use utf-8 encoding.

package Ada.Strings.UTF_Encoding provides some facilities for dealing
with utf. It does _not_ provide walking a string by code point, which
would seem necessary.

We could be more helpful if you show what you are trying to do, you've
tried, and what errors you got.

--
-- Stephe

DrPi

unread,

Apr 19, 2021, 5:09:39 AM4/19/21

to

Thanks

DrPi

unread,

Apr 19, 2021, 5:28:39 AM4/19/21

to

Wide and Wide_Wide characters and UTF-8 are two distinct things.
Wide and Wide_Wide characters are supposed to contain Unicode code
points (Unicode characters).
UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.
What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?

>
>
>> What's the way to manage Unicode correctly ?
>>
>
> You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka

Thanks
>
>> Regards,
>> Nicolas

Dmitry A. Kazakov

unread,

Apr 19, 2021, 5:34:29 AM4/19/21

to

On 2021-04-19 11:08, Stephen Leake wrote:

> (I've never written a program that dealt with utf strings other than
> file names).
>
> -gnati8 tells the compiler that the source code uses utf-8 encoding.
>
> -gnatW8 tells the compiler that string literals use utf-8 encoding.

Both are recipes for disaster, especially the second. IMO the source
must be strictly ASCII 7-bit. It is less dangerous to have UTF-8 or
Latin-1 identifiers, they could be at least checked, except when used
for external names. But string literals would be a ticking bomb.

If you need a wider set than ASCII, use named constants and integer
literals. E.g.

Celsius : constant String := Character'Val (16#C2#) &
Character'Val (16#B0#) & 'C';

> We could be more helpful if you show what you are trying to do, you've
> tried, and what errors you got.

True

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Simon Wright

unread,

Apr 19, 2021, 7:15:33 AM4/19/21

to

Maxim Reznik <rezn...@gmail.com> writes:

> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>
>> Any way to use source code encoded in UTF-8 ?
>
> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line
> or your project file):

But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

Luke A. Guest

unread,

Apr 19, 2021, 7:50:34 AM4/19/21

to

On 19/04/2021 12:15, Simon Wright wrote:
> Maxim Reznik <rezn...@gmail.com> writes:
>
>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>>
>>> Any way to use source code encoded in UTF-8 ?
>>
>> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line
>> or your project file):
>
> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:

There's no such thing as "character" any more and we need to move away
from that. Unicode has the concept of a code point which is 32 bit and
any "character" as we know it, or glyph, can consist of multiple code
points.

In my lib, nowhere near ready (whether it will be I don't know), I
define octets, Unicode_String (utf-8 string) which is array of octets
and Code_Points which an iterator produces as it iterates over those
strings. I was intending to have an iterator for grapheme clusters and
other units.

Luke A. Guest

unread,

Apr 19, 2021, 7:56:29 AM4/19/21

to

On 19/04/2021 10:08, Stephen Leake wrote:
>> What's the way to manage Unicode correctly ?
>
> There are two issues: Unicode in source code, that the compiler must
> understand, and Unicode in strings, that your program must understand.

And this is there the Ada standard gets it wrong, in the encodings
package re utf-8.

Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the
leading octet indicates whether there are trailing octets. See
https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data
layout. The first 128 "characters" in Unicode match that of 7-bit ASCII,
not 8-bit ASCII, and certainly not Latin 1. Therefore this:

package Ada.Strings.UTF_Encoding
...
subtype UTF_8_String is String;
...
end Ada.Strings.UTF_Encoding;

Was absolutely and totally wrong.

Luke A. Guest

unread,

Apr 19, 2021, 8:13:11 AM4/19/21

to

On 19/04/2021 12:56, Luke A. Guest wrote:

>
> package Ada.Strings.UTF_Encoding
> ...
> subtype UTF_8_String is String;
> ...
> end Ada.Strings.UTF_Encoding;
>
> Was absolutely and totally wrong.

...and, before someone comes back with "but all the upper half of latin
1" are represented and have the same values." Yes, they do, in Code
points which is a 32 bit number. In UTF-8 they are encoded as 2 octets!

Dmitry A. Kazakov

unread,

Apr 19, 2021, 8:52:45 AM4/19/21

to

It is practical solution. Ada type system cannot express differently
represented/constrained string/array/vector subtypes. Ignoring Latin-1
and using String as if it were an array of octets is the best available
solution.

Luke A. Guest

unread,

Apr 19, 2021, 9:00:35 AM4/19/21

to

On 19/04/2021 13:52, Dmitry A. Kazakov wrote:

> It is practical solution. Ada type system cannot express differently
represented/constrained string/array/vector subtypes. Ignoring Latin-1
and using String as if it were an array of octets is the best available
solution.
>

They're different types and should be incompatible, because, well, they
are. What does Ada have that allows for this that other languages
doesn't? Oh yeah! Types!

Dmitry A. Kazakov

unread,

Apr 19, 2021, 9:10:51 AM4/19/21

to

They are subtypes, differently constrained, like Positive and Integer.
Operations are same values are differently constrained. It does not make
sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is
same glyph differently encoded. Encoding is a representation aspect,
ergo out of the interface!

BTW, subtype is a type.

Luke A. Guest

unread,

Apr 19, 2021, 9:15:18 AM4/19/21

to

On 19/04/2021 14:10, Dmitry A. Kazakov wrote:

>> They're different types and should be incompatible, because, well,
> they are. What does Ada have that allows for this that other languages
> doesn't? Oh yeah! Types!
>
> They are subtypes, differently constrained, like Positive and Integer.

No they're not. They're subtypes only and therefore compatible. The UTF
string isn't constrained in any other ways.

> Operations are same values are differently constrained. It does not make
> sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is
> same glyph differently encoded. Encoding is a representation aspect,
> ergo out of the interface!

As I already said in Unicode the glyph is not part part of Unicode. The
single code point character concept doesn't exist anymore.

>
> BTW, subtype is a type.
>

subtype is a compatible type.

Vadim Godunko

unread,

Apr 19, 2021, 9:18:23 AM4/19/21

to

Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".

There are few different aspects of Unicode support need to be considered:

1. Representation of string literals. If you want to use non-ASCII characters in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
3. Text encoding/decoding on input/output operations. GNAT allows to use UTF-8 by providing some magic string for Form parameter of Text_IO.

It is hard to say that it is reasonable set of features for modern world. To fix some of drawbacks of current situation we are developing new text processing library, know as VSS.

https://github.com/AdaCore/VSS

At current stage it provides encoding independent API for text manipulation, encoders and decoders API for I/O, and JSON reader/writer; regexp support should come soon.

Encoding independent API means that application always use Unicode characters to process text, independently from the real encoding used to store information in memory (UTF-8 is used for now, UTF-16 will be added later for interoperability with Windows API and WASM). Coders and encoders allow translation from/to different encodings when application exchange information with the world.

J-P. Rosen

unread,

Apr 19, 2021, 9:24:36 AM4/19/21

to

Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> They're different types and should be incompatible, because, well, they
> are. What does Ada have that allows for this that other languages
> doesn't? Oh yeah! Types!

They are not so different. For example, you may read the first line of a
file in a string, then discover that it starts with a BOM, and thus
decide it is UTF-8.

BTW, the very first version of this AI had different types, but the ARG
felt that it would just complicate the interface for the sake of abusive
"purity".

--
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52
https://www.adalog.fr

Dmitry A. Kazakov

unread,

Apr 19, 2021, 9:31:30 AM4/19/21

to

On 2021-04-19 15:15, Luke A. Guest wrote:
> On 19/04/2021 14:10, Dmitry A. Kazakov wrote:
>
>>> They're different types and should be incompatible, because, well,
>> they are. What does Ada have that allows for this that other languages
>> doesn't? Oh yeah! Types!
>>
>> They are subtypes, differently constrained, like Positive and Integer.
>
> No they're not. They're subtypes only and therefore compatible. The UTF
> string isn't constrained in any other ways.

Of course it is. There could be string encodings that have no Unicode
counterparts and thus missing in UTF-8/16.

>> Operations are same values are differently constrained. It does not
>> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It
>> is same glyph differently encoded. Encoding is a representation
>> aspect, ergo out of the interface!
>
> As I already said in Unicode the glyph is not part part of Unicode. The
> single code point character concept doesn't exist anymore.

It does not matter from practical point of view. Some Unicode's
idiosyncrasies are better ignored.

>> BTW, subtype is a type.

> subtype is a compatible type.

Ada subtype is both a sub- and supertype, i.e. substitutable [or so the
compiler thinks] in both directions. A derived tagged type is
substitutable in only one direction.

Neither is fully "compatible", because otherwise there would be no
reason to have an exactly same thing.

Maxim Reznik

unread,

Apr 19, 2021, 9:50:43 AM4/19/21

to

понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi:

> Le 19/04/2021 à 10:29, Maxim Reznik a écrit :
> > воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:

> >> In some languages, it is possible to set a tag at the beginning of the
> >> source file to direct the compiler which encoding to use.
> >
> > You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file.
> >

> Wide and Wide_Wide characters and UTF-8 are two distinct things.
> Wide and Wide_Wide characters are supposed to contain Unicode code
> points (Unicode characters).
> UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.

Yes, it is.

> What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?

This pragma specifies the character encoding to be used in program source text...

https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding

I would suggest also this article to read:

https://two-wrongs.com/unicode-strings-in-ada-2012

Best regards,

DrPi

unread,

Apr 19, 2021, 11:48:23 AM4/19/21

to

A code point has no size. Like universal integers in Ada.

DrPi

unread,

Apr 19, 2021, 11:51:48 AM4/19/21

to

Le 19/04/2021 à 15:50, Maxim Reznik a écrit :
> понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi:
>> Le 19/04/2021 à 10:29, Maxim Reznik a écrit :
>>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>>> In some languages, it is possible to set a tag at the beginning of the
>>>> source file to direct the compiler which encoding to use.
>>>
>>> You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file.
>>>
>> Wide and Wide_Wide characters and UTF-8 are two distinct things.
>> Wide and Wide_Wide characters are supposed to contain Unicode code
>> points (Unicode characters).
>> UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.
>
> Yes, it is.
>
>> What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?
>
> This pragma specifies the character encoding to be used in program source text...
>
> https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding

Good to know.

>
> I would suggest also this article to read:
>
> https://two-wrongs.com/unicode-strings-in-ada-2012
>

I think I've already read it. But will do again.

> Best regards,
>
Thanks

DrPi

unread,

Apr 19, 2021, 11:53:28 AM4/19/21

to

Good to know.
Thanks

DrPi

unread,

Apr 19, 2021, 12:07:37 PM4/19/21

to

I agree.

In Python2, encoded and "decoded" strings are of same type "str". Bad
design.

In Python3, "decoded" strings are of type "str" and encoded strings are
of type "bytes" (byte array). Both are different things and can't be
assigned one to the other. Much more clear for the programmer.
It should the same in Ada. Different types.

DrPi

unread,

Apr 19, 2021, 12:14:42 PM4/19/21

to

Le 19/04/2021 à 11:08, Stephen Leake a écrit :
> DrPi <3...@drpi.fr> writes:
>
>> Any way to use source code encoded in UTF-8 ?
>
> for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
>

That's interesting.
Using these switches at project level is not OK. Project source files
not always use the same encoding. Especially when using libraries.
Using these switches at source level is better. A little bit complicated
to use but better.

Björn Lundin

unread,

Apr 19, 2021, 1:12:41 PM4/19/21

to

Den 2021-04-19 kl. 18:14, skrev DrPi:
>> for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
>>
> That's interesting.
> Using these switches at project level is not OK. Project source files
> not always use the same encoding. Especially when using libraries.
> Using these switches at source level is better. A little bit complicated
> to use but better.

You did understand that the above setting only applies to the file
called 'non_ascii.ads' - and not to the rest of the files?

--
Björn

DrPi

unread,

Apr 19, 2021, 3:44:37 PM4/19/21

to

Yes, that's what I've understood.

Shark8

unread,

Apr 19, 2021, 6:40:37 PM4/19/21

to

On Saturday, April 17, 2021 at 4:03:14 PM UTC-6, DrPi wrote:
> Hi,

>
> I have a good knowledge of Unicode : code points, encoding...
> What I don't understand is how to manage Unicode strings with Ada. I've
> read part of ARM and did some tests without success.
>
> I managed to be partly successful with source code encoded in Latin-1.

Ah.
Yes, this is an issue in GNAT, and possibly other compilers.
The easiest method for me is to right-click the text-buffer for the file in GPS, click properties in the menu that pops up, then in the dialog select from the Character Set drop-down "Unicode UTF-#".
> Any other encoding failed.

> Any way to use source code encoded in UTF-8 ?

There's the above method with GPS.
IIRC there's also a Pragma and a compiler-flag for GNAT.

It's actually a non-issue for Byron, because the file-reader does a BOM-check [IIRC defaulting to ASCII in the absence of a BOM] and outputs to the lexer the Wide_Wide_Character equivalent of the input-encoding.
See: https://github.com/OneWingedShark/Byron/blob/master/src/reader/readington.adb

> In some languages, it is possible to set a tag at the beginning of the
> source file to direct the compiler which encoding to use.

> I wasn't successful using -gnatW8 switch. But maybe I made to many tests
> and my brain was scrambled.

IIRC the gnatW8 flag sets it to UTF-8, so if your editor is saving in something else like UTF-16 BE, the compiler [probably] won't read it correctly.

> Even with source code encoded in Latin-1, I've not been able to manage
> Unicode strings correctly.
>

> What's the way to manage Unicode correctly ?

I typically use the GPS file/properties method above, and then I might also use the pragma.

Simon Wright

unread,

Apr 20, 2021, 11:05:09 AM4/20/21

to

Shark8 <onewing...@gmail.com> writes:

> It's actually a non-issue for Byron, because the file-reader does a
> BOM-check [IIRC defaulting to ASCII in the absence of a BOM]

GNAT does a BOM-check also. gnatchop does one better, carrying the BOM
from the top of the input file through to each output file.

Randy Brukardt

unread,

Apr 20, 2021, 3:06:29 PM4/20/21

to

"Luke A. Guest" <lag...@archeia.com> wrote in message
news:s5jute$1s08$1...@gioia.aioe.org...

If they're incompatible, you need an automatic way to convert between
representations, since these are all views of the same thing (an abstract
string type). You really don't want 35 versions of Open each taking a
different string type.

It's the fact that Ada can't do this that makes Unbounded_Strings unusable
(well, barely usable). Ada 202x fixes the literal problem at least, but we'd
have to completely abandon Unbounded_Strings and use a different library
design in order for for it to allow literals. And if you're going to do
that, you might as well do something about UTF-8 as well -- but now you're
going to need even more conversions. Yuck.

I think the only true solution here would be based on a proper abstract
Root_String type. But that wouldn't work in Ada, since it would be
incompatible with all of the existing code out there. Probably would have to
wait for a follow-on language.

Randy.

Randy Brukardt

unread,

Apr 20, 2021, 3:13:33 PM4/20/21

to

"J-P. Rosen" <ro...@adalog.fr> wrote in message
news:s5k0ai$bb5$1...@dont-email.me...

> Le 19/04/2021 ą 15:00, Luke A. Guest a écrit :
>> They're different types and should be incompatible, because, well, they
>> are. What does Ada have that allows for this that other languages
>> doesn't? Oh yeah! Types!
>
> They are not so different. For example, you may read the first line of a
> file in a string, then discover that it starts with a BOM, and thus decide
> it is UTF-8.
>
> BTW, the very first version of this AI had different types, but the ARG
> felt that it would just complicate the interface for the sake of abusive
> "purity".

Unfortunately, that was the first instance that showed the beginning of the
end for Ada. If I remember correctly (and I may not ;-), that came from some
people who were wedded to the Linux model where nothing is checked (or IMHO,
typed). For them, a String is simply a bucket of octets. That prevented
putting an encoding of any sort of any type on file names ("it should just
work on Linux, that's what people expect"). The rest follows from that.

Those of us who care about strong typing were disgusted, the result
essentially does not work on Windows or MacOS (which do check the content of
file names - as you can see in GNAT compiling units with non-Latin-1
characters in their names), and I don't really expect any recovery from
that.

Randy.

Randy Brukardt

unread,

Apr 20, 2021, 3:17:41 PM4/20/21

to

"Simon Wright" <si...@pushface.org> wrote in message
news:lybla95...@pushface.org...

That's what the documentation says, but it didn't work on ACATS source files
(the few which use Unicode start with a BOM). I had to write a bunch of
extra code in the script generator to stick the options on the Unicode files
(that worked). Perhaps that's been fixed since, but I wouldn't trust it
(burned once, twice shy).

Randy.

Simon Wright

unread,

Apr 20, 2021, 4:04:10 PM4/20/21

to

It does now: just checked again with c250001, c250002.

Thomas

unread,

Apr 3, 2022, 12:51:59 PM4/3/22

to

In article <f9d91cb0-c9bb-4d42...@googlegroups.com>,

Vadim Godunko <vgod...@gmail.com> wrote:

> On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:

> > What's the way to manage Unicode correctly ?
> >
>
> Ada doesn't have good Unicode support. :( So, you need to find suitable set
> of "workarounds".
>
> There are few different aspects of Unicode support need to be considered:
>
> 1. Representation of string literals. If you want to use non-ASCII characters
> in source code, you need to use -gnatW8 switch and it will require use of
> Wide_Wide_String everywhere.
> 2. Internal representation during application execution. You are forced to
> use Wide_Wide_String at previous step, so it will be UCS4/UTF32.

> It is hard to say that it is reasonable set of features for modern world.

I don't think Ada would be lacking that much, for having good UTF-8
support.

the cardinal point is to be able to fill a
Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
(once you got it, when you'll try to fill a Standard.String with a
non-Latin-1 character, it'll make an error, i think it's fine :-) )

does Ada 202x allow it ?

if not, it would probably be easier if it was
type UTF_8_String is new String;
instead of
subtype UTF_8_String is String;

for all subprograms it's quite easy:
we just have to duplicate them with the new type, and to mark the old
one as Obsolescent.

but, now that "subtype UTF_8_String" exists, i don't know what we can do
for types.
(is the only way to choose a new name?)

> To
> fix some of drawbacks of current situation we are developing new text
> processing library, know as VSS.
>
> https://github.com/AdaCore/VSS

(are you working at AdaCore ?)

--
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

Thomas

unread,

Apr 3, 2022, 1:24:14 PM4/3/22

to

In article <s5k0ne$opv$1...@gioia.aioe.org>,

"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote:

> On 2021-04-19 15:15, Luke A. Guest wrote:
> > On 19/04/2021 14:10, Dmitry A. Kazakov wrote:
> >
> >>> They're different types and should be incompatible, because, well,
> >> they are. What does Ada have that allows for this that other languages
> >> doesn't? Oh yeah! Types!
> >>
> >> They are subtypes, differently constrained, like Positive and Integer.
> >
> > No they're not. They're subtypes only and therefore compatible. The UTF
> > string isn't constrained in any other ways.
>
> Of course it is. There could be string encodings that have no Unicode
> counterparts and thus missing in UTF-8/16.

1
there is missing a validity function to tell weather a given
UTF_8_String is valid or not,
and a Dynamic_Predicate on the subtype UTF_8_String connected to the
function.

2
more important, (when non-ASCII,) valid UTF_8_String *do not* represent
the same thing as themselves converted to String.

>
> >> Operations are same values are differently constrained. It does not
> >> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It
> >> is same glyph differently encoded. Encoding is a representation
> >> aspect, ergo out of the interface!

it works because 'a' is ASCII.
if you try it with a non-ASCII character, all goes wrong.

Thomas

unread,

Apr 3, 2022, 2:04:38 PM4/3/22

to

In article <s5k0ai$bb5$1...@dont-email.me>, "J-P. Rosen" <ro...@adalog.fr>
wrote:

> Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> > They're different types and should be incompatible, because, well, they
> > are. What does Ada have that allows for this that other languages
> > doesn't? Oh yeah! Types!
>
> They are not so different. For example, you may read the first line of a
> file in a string, then discover that it starts with a BOM, and thus
> decide it is UTF-8.

could you give me an example of sth that you can do yet, and you could
not do if UTF_8_String was private, please?
(to discover that it starts with a BOM, you must look at it.)

>
> BTW, the very first version of this AI had different types, but the ARG
> felt that it would just complicate the interface for the sake of abusive
> "purity".

could you explain "abusive purity" please?

i guess it is because of ASCII.
i guess a lot of developpers use only ASCII in a lot of situation, and
they would find annoying to need Ada.Strings.UTF_Encoding.Strings every
time.

but I think a simple explicit conversion is acceptable, for a not fully
compatible type which requires some attention.

the best would be to be required to use ASCII_String as intermediate,
but i don't know how it could be designed at language level:

UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var));
Latin_1_Var:= String (ASCII_String (UTF_8_Var));

and this would be forbidden :
UTF_8_Var := UTF_8_String (Latin_1_Var);

this would ensures to raise Constraint_Error when there are somme
non-ASCII characters.

Thomas

unread,

Apr 3, 2022, 2:37:12 PM4/3/22

to

In article <s5n8nj$cec$1...@franka.jacob-sparre.dk>,

"Randy Brukardt" <ra...@rrsoftware.com> wrote:

> "Luke A. Guest" <lag...@archeia.com> wrote in message
> news:s5jute$1s08$1...@gioia.aioe.org...
> >
> >
> > On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
> >
> > > It is practical solution. Ada type system cannot express differently
> > represented/constrained string/array/vector subtypes. Ignoring Latin-1 and
> > using String as if it were an array of octets is the best available
> > solution.
> > >
> >
> > They're different types and should be incompatible, because, well, they
> > are. What does Ada have that allows for this that other languages doesn't?
> > Oh yeah! Types!
>
> If they're incompatible, you need an automatic way to convert between
> representations, since these are all views of the same thing (an abstract
> string type). You really don't want 35 versions of Open each taking a
> different string type.

i need not 35 versions of Open.
i need a version of Open with an Unicode string type (not Latin-1 -
preferably UTF-8), which will use Ada.Strings.UTF_Encoding.Conversions
as far as needed, regarding the underlying API.

>
> It's the fact that Ada can't do this that makes Unbounded_Strings unusable
> (well, barely usable).

knowing Ada, i find it acceptable.
i don't say the same about Ada.Strings.UTF_Encoding.UTF_8_String.

> Ada 202x fixes the literal problem at least, but we'd
> have to completely abandon Unbounded_Strings and use a different library
> design in order for for it to allow literals. And if you're going to do
> that, you might as well do something about UTF-8 as well -- but now you're
> going to need even more conversions. Yuck.

as i said to Vadim Godunko, i need to fill a string type with an UTF-8
litteral.
but i don't think this string type has to manage various conversions.

from my point of view, each library has to accept 1 kind of string type
(preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the
underlying API. not the user.

>
> I think the only true solution here would be based on a proper abstract
> Root_String type. But that wouldn't work in Ada, since it would be
> incompatible with all of the existing code out there. Probably would have to
> wait for a follow-on language.

of course, it would be very nice to have a more thicker language with a
garbage collector, only 1 String type which allows all what we need, etc.

Thomas

unread,

Apr 3, 2022, 3:20:21 PM4/3/22

to

In article <lyfszm5...@pushface.org>,

Simon Wright <si...@pushface.org> wrote:

> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

if i understand, Eric Botcazou is a gnu admin who decided to reject your bug?
i find him very "low portability thinking"!

it is the responsability of compilers and other underlying tools, to manage various underlying OS and FS,
not of the user to avoid those that the compiler devs find too bad!
(or to use the right encoding. i heard that Windows uses UTF-16, do you know about it?)

clearly, To_Lower takes Latin-1.
and this kind of problems would be easier to avoid if string types were stronger ...

after:

package Ada.Strings.UTF_Encoding
...

type UTF_8_String is new String;

...
end Ada.Strings.UTF_Encoding;

i would have also made:

package Ada.Directories
...
type File_Name_String is new Ada.Strings.UTF_Encoding.UTF_8_String;
...
end Ada.Directories;

with probably a validity check and a Dynamic_Predicate which allows "".

then, i would use File_Name_String in all Ada.Directories and Ada.*_IO.

Vadim Godunko

unread,

Apr 4, 2022, 2:10:27 AM4/4/22

to

On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
>
> > But don't use unit names containing international characters, at any
> > rate if you're (interested in compiling on) Windows or macOS:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
>

> and this kind of problems would be easier to avoid if string types were stronger ...
>

Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't resolve some issues, like relative paths and change of rules at mounting points.

Simon Wright

unread,

Apr 4, 2022, 10:19:20 AM4/4/22

to

I think that's a macOS problem that Apple aren't going to resolve* any
time soon! While banging my head against PR81114 recently, I found
(can't remember where) that (lower case a acute) and (lower case a,
combining acute) represent the same concept and it's up to
tools/operating systems etc to recognise that.

Emacs, too, has a problem: it doesn't recognise the 'combining' part of
(lower case a, combining acute), so what you see on your screen is "a'".

* I don't know how/whether clang addresses this.

Simon Wright

unread,

Apr 4, 2022, 10:33:24 AM4/4/22

to

Thomas <fantome.foru...@free.fr.invalid> writes:

> In article <lyfszm5...@pushface.org>,
> Simon Wright <si...@pushface.org> wrote:
>
>> But don't use unit names containing international characters, at any
>> rate if you're (interested in compiling on) Windows or macOS:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
>
> if i understand, Eric Botcazou is a gnu admin who decided to reject
> your bug? i find him very "low portability thinking"!

To be fair, he only suspended it - you can tell I didn't want to press
very far.

We could remove the part where the filename is smashed to lower-case as
if it were ASCII[1][2][3] (OK, perhaps Latin-1?) if the machine is
Windows or (Apple if not on aarch64!!!), but that still leaves the
filesystem name issue. Windows might be OK (code pages???)

[1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/adaint.c#L620
[2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L812
[2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L1490

Simon Wright

unread,

Apr 4, 2022, 11:11:26 AM4/4/22

to

Simon Wright <si...@pushface.org> writes:

> I think that's a macOS problem that Apple aren't going to resolve* any
> time soon! While banging my head against PR81114 recently, I found
> (can't remember where) that (lower case a acute) and (lower case a,
> combining acute) represent the same concept and it's up to
> tools/operating systems etc to recognise that.

[...]

> * I don't know how/whether clang addresses this.

It doesn't, so far as I can tell; has the exact same problem.

Randy Brukardt

unread,

Apr 4, 2022, 7:52:36 PM4/4/22

to

"Thomas" <fantome.foru...@free.fr.invalid> wrote in message
news:fantome.forums.tDeConte...@news.free.fr...
...

> as i said to Vadim Godunko, i need to fill a string type with an UTF-8

> litteral.but i don't think this string type has to manage various

> conversions.
>
> from my point of view, each library has to accept 1 kind of string type
> (preferably UTF-8 everywhere),
> and then, this library has to make needed conversions regarding the
> underlying API. not the user.

This certainly is a fine ivory tower solution, but it completely ignores two
practicalities in the case of Ada:

(1) You need to replace almost all of the existing Ada language defined
packages to make this work. Things that are deeply embedded in both
implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
have to change substantially. The result would essentially be a different
language, since the resulting libraries would not work with most existing
programs. They'd have to have different names (since if you used the same
names, you change the failures from compile-time to runtime -- or even
undetected -- which would be completely against the spirit of Ada), which
means that one would have to essentially start over learning and using the
resulting language. Calling it Ada would be rather silly, since it would be
practically incompatible (and it would make sense to use this point to
eliminate a lot of the cruft from the Ada design).

(2) One needs to be able to read and write data given whatever encoding the
project requires (that's often decided by outside forces, such as other
hardware or software that the project needs to interoperate with). That
means that completely hiding the encoding (or using a universal encoding)
doesn't fully solve the problems faced by Ada programmers. At a minimum, you
have to have a way to specify the encoding of files, streams, and hardware
interfaces (this sort of thing is not provided by any common target OS, so
it's not in any target API). That will greatly complicate the interface and
implementation of the libraries.

> ... of course, it would be very nice to have a more thicker language with
> a garbage collector ...

I doubt that you will ever see that in the Ada family, as analysis and
therefore determinism is a very important property for the language. Ada has
lots of mechanisms for managing storage without directly doing it yourself
(by calling Unchecked_Deallocation), yet none of them use any garbage
collection in a traditional sense. I could see more such mechanisms (an
ownership option on the line of Rust could easily manage storage at the same
time, since any object that could be orphaned could never be used again and
thus should be reclaimed), but standard garbage collection is too
non-deterministic for many of the uses Ada is put to.

Randy.

Vadim Godunko

unread,

Apr 5, 2022, 3:59:56 AM4/5/22

to

On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
> I think that's a macOS problem that Apple aren't going to resolve* any
> time soon! While banging my head against PR81114 recently, I found
> (can't remember where) that (lower case a acute) and (lower case a,
> combining acute) represent the same concept and it's up to
> tools/operating systems etc to recognise that.
>

And will not. It is application responsibility to convert file names to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases when canonical reordering is applied.

J-P. Rosen

unread,

Apr 6, 2022, 2:57:01 PM4/6/22

to

Le 03/04/2022 à 21:04, Thomas a écrit :
>> They are not so different. For example, you may read the first line of a
>> file in a string, then discover that it starts with a BOM, and thus
>> decide it is UTF-8.
>
> could you give me an example of sth that you can do yet, and you could
> not do if UTF_8_String was private, please?
> (to discover that it starts with a BOM, you must look at it.)

Just what I said above, since a BOM is not a valid UTF-8 (otherwise, it
could not be recognized).

>>
>> BTW, the very first version of this AI had different types, but the ARG
>> felt that it would just complicate the interface for the sake of abusive
>> "purity".
>
> could you explain "abusive purity" please?
>

It was felt that in practice, being too strict in separating the types
would make things more difficult, without any practical gain. This has
been discussed - you may not agree with the outcome, but it was not made
out of pure lazyness

--
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52
https://www.adalog.fr

Randy Brukardt

unread,

Apr 6, 2022, 9:31:02 PM4/6/22

to

"J-P. Rosen" <ro...@adalog.fr> wrote in message

news:t2knpr$s26$1...@dont-email.me...
...

> It was felt that in practice, being too strict in separating the types
> would make things more difficult, without any practical gain. This has
> been discussed - you may not agree with the outcome, but it was not made
> out of pure lazyness

The problem with that, of course, is that it sends the wrong message
vis-a-vis strong typing and interfaces. If we abandon it at the first sign
of trouble, they we are saying that it isn't really that important.

In this particular case, the reason really came down to practicality: if you
want to do anything string-like with a UTF-8 string, making it a separate
type becomes painful. It wouldn't work with anything in Ada.Strings,
Ada.Text_IO, or Ada.Directories, even though most of the operations are
fine. And there was no political will to replace all of those things with
versions to use with proper universal strings.

Moreover, if you really want to do that, you have to hide much of the array
behavior of the Universal string. For instance, you can't allow willy-nilly
slicing or replacement: cutting a character representation in half or
setting an illegal representation has to be prohibited (operations that
would turn a valid string into an invalid string should always raise an
exception). That means you can't (directly) use built-in indexing and
slicing -- those have to go through some sort of functions. So you do pretty
much have to use a private type for universal strings (similar to
Ada.Strings.Bounded would be best, I think).

If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless operations
supported for array types (since things like slices are mainly useful for
string operations). So such a language should simplify the core
substantially by dropping many of those obsolete features (especially as
little of the library would be directly compatible anyway). So one should
end up with a new language that draws from Ada rather than something in Ada
itself. (It would be great if that language could make strings with
different capacities interoperable - a major annoyance with Ada. And
modernizing access types, generalizing resolution, and the like also would
be good improvements IMHO.)

Randy.

Simon Wright

unread,

Apr 8, 2022, 4:56:23 AM4/8/22

to

"Randy Brukardt" <ra...@rrsoftware.com> writes:

> If you had an Ada-like language that used a universal UTF-8 string
> internally, you then would have a lot of old and mostly useless
> operations supported for array types (since things like slices are
> mainly useful for string operations).

Just off the top of my head, wouldn't it be better to use UTF32-encoded
Wide_Wide_Character internally? (you would still have trouble with
e.g. national flag emojis :)

Dmitry A. Kazakov

unread,

Apr 8, 2022, 5:26:09 AM4/8/22

to

On 2022-04-08 10:56, Simon Wright wrote:
> "Randy Brukardt" <ra...@rrsoftware.com> writes:
>
>> If you had an Ada-like language that used a universal UTF-8 string
>> internally, you then would have a lot of old and mostly useless
>> operations supported for array types (since things like slices are
>> mainly useful for string operations).
>
> Just off the top of my head, wouldn't it be better to use UTF32-encoded
> Wide_Wide_Character internally?

Yep, that is the exactly the problem, a confusion between interface and
implementation.

Encoding /= interface, e.g. an interface of a string viewed as an array
of characters. That interface just same for ASCII, Latin-1, EBCDIC,
RADIX50, UTF-8 etc strings. Why do you care what is inside?

Ada type system's inability to implement this interface is another
issue. Usefulness of this interface is yet another. For immutable
strings it is quite useful. For mutable strings it might appear too
constrained, e.g. for packed encodings like UTF-8 and UTF-16.

Also this interface should have nothing to do with the interface of an
UTF-8 string as an array of octets or the interface of an UTF-16LE
string as an array of little endian words.

Since Ada cannot separate these interfaces, for practical purposes,
Strings are arrays of octets considered as UTF-8 encoding. The rest goes
into coding guidelines under the title "never ever do this."

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Simon Wright

unread,

Apr 8, 2022, 3:19:14 PM4/8/22

to

"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> writes:

> On 2022-04-08 10:56, Simon Wright wrote:
>> "Randy Brukardt" <ra...@rrsoftware.com> writes:
>>
>>> If you had an Ada-like language that used a universal UTF-8 string
>>> internally, you then would have a lot of old and mostly useless
>>> operations supported for array types (since things like slices are
>>> mainly useful for string operations).
>>
>> Just off the top of my head, wouldn't it be better to use
>> UTF32-encoded Wide_Wide_Character internally?
>
> Yep, that is the exactly the problem, a confusion between interface
> and implementation.

Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.

> Encoding /= interface, e.g. an interface of a string viewed as an
> array of characters. That interface just same for ASCII, Latin-1,
> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?

With a user's hat on, I don't. Implementers might have a different point
of view.

Simon Wright

unread,

Apr 8, 2022, 3:24:38 PM4/8/22

to

Isn't the compiler a tool? gnatmake? gprbuild? (gnatmake handles ACATS
c250002 provided you tell the compiler that the fs is case-sensitive,
gprbuild doesn't even manage that)

Dmitry A. Kazakov

unread,

Apr 8, 2022, 3:45:18 PM4/8/22

to

On 2022-04-08 21:19, Simon Wright wrote:
> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> writes:
>
>> On 2022-04-08 10:56, Simon Wright wrote:
>>> "Randy Brukardt" <ra...@rrsoftware.com> writes:
>>>
>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>> internally, you then would have a lot of old and mostly useless
>>>> operations supported for array types (since things like slices are
>>>> mainly useful for string operations).
>>>
>>> Just off the top of my head, wouldn't it be better to use
>>> UTF32-encoded Wide_Wide_Character internally?
>>
>> Yep, that is the exactly the problem, a confusion between interface
>> and implementation.
>
> Don't understand. My point was that *when you are implementing this* it
> mught be easier to deal with 32-bit charactrs/code points/whatever the
> proper jargon is than with UTF8.

I think it would be more difficult, because you will have to convert
from and to UTF-8 under the hood or explicitly. UTF-8 is de-facto
interface standard and I/O standard. That would be 60-70% of all cases
you need a string. Most string operations like search, comparison,
slicing are isomorphic between code points and octets. So you would win
nothing from keeping strings internally as arrays of code points.

The situation is comparable to Unbounded_Strings. The implementation is
relatively simple, but the user must carry the burden of calling
To_String and To_Unbounded_String all over the application and the
processor must suffer the overhead of copying arrays here and there.

>> Encoding /= interface, e.g. an interface of a string viewed as an
>> array of characters. That interface just same for ASCII, Latin-1,
>> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?
>
> With a user's hat on, I don't. Implementers might have a different point
> of view.

Sure, but in Ada philosophy their opinion should carry less weight,
than, say, in C.

Randy Brukardt

unread,

Apr 9, 2022, 12:05:42 AM4/9/22

to

"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:t2q3cb$bbt$1...@gioia.aioe.org...

> On 2022-04-08 21:19, Simon Wright wrote:
>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> writes:
>>
>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>> "Randy Brukardt" <ra...@rrsoftware.com> writes:
>>>>
>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>> internally, you then would have a lot of old and mostly useless
>>>>> operations supported for array types (since things like slices are
>>>>> mainly useful for string operations).
>>>>
>>>> Just off the top of my head, wouldn't it be better to use
>>>> UTF32-encoded Wide_Wide_Character internally?
>>>
>>> Yep, that is the exactly the problem, a confusion between interface
>>> and implementation.
>>
>> Don't understand. My point was that *when you are implementing this* it
>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>> proper jargon is than with UTF8.
>
> I think it would be more difficult, because you will have to convert from
> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
> standard and I/O standard. That would be 60-70% of all cases you need a
> string. Most string operations like search, comparison, slicing are
> isomorphic between code points and octets. So you would win nothing from
> keeping strings internally as arrays of code points.

I basically agree with Dmitry here. The internal representation is an
implementation detail, but it seems likely that you would want to store
UTF-8 strings directly; they're almost always going to be half the size
(even for languages using their own characters like Greek) and for most of
us, they'll be just a bit more than a quarter the size. The amount of bytes
you copy around matters; the number of operations where code points are
needed is fairly small.

The main problem with UTF-8 is representing the code point positions in a
way that they (a) aren't abused and (b) don't cost too much to calculate.
Just using character indexes is too expensive for UTF-8 and UTF-16
representations, and using octet indexes is unsafe (since the splitting a
character representation is a possibility). I'd probably use an abstract
character position type that was implemented with an octet index under the
covers.

I think that would work OK as doing math on those is suspicious with a UTF
representation. We're spoiled from using Latin-1 representations, of course,
but generally one is interested in 5 characters, not 5 octets. And the
number of octets in 5 characters depends on the string. So most of the sorts
of operations that I tend to do (for instance from some code I was fixing
earlier today):

if Fort'Length > 6 and then
Font(2..6) = "Arial" then

This would be a bad idea if one is using any sort of universal
representation -- you don't know how many octets is in the string literal so
you can't assume a number in the test string. So the slice is dangerous
(even though in this particular case it would be OK since the test string is
all Ascii characters -- but I wouldn't want users to get in the habit of
assuming such things).

[BTW, the above was a bad idea anyway, because it turns out that the
function in the Ada library returned bounds that don't start at 1. So the
slice was usually out of range -- which is why I was looking at the code.
Another thing that we could do without. Slices are evil, since they *seem*
to be the right solution, yet rarely are in practice without a lot of
hoops.]

> The situation is comparable to Unbounded_Strings. The implementation is
> relatively simple, but the user must carry the burden of calling To_String
> and To_Unbounded_String all over the application and the processor must
> suffer the overhead of copying arrays here and there.

Yes, but that happens because Ada doesn't really have a string abstraction,
so when you try to build one, you can't fully do the job. One presumes that
a new language with a universal UTF-8 string wouldn't have that problem. (As
previously noted, I don't see much point in trying to patch up Ada with a
bunch of UTF-8 string packages; you would need an entire new set of
Ada.Strings libraries and I/O libraries, and then you'd have all of the old
stuff messing up resolution, using the best names, and confusing everything.
A cleaner slate is needed.)

Randy.

Simon Wright

unread,

Apr 9, 2022, 3:43:38 AM4/9/22

to

Well, I don't have any skin in this game, so I'll shut up at this point.

DrPi

unread,

Apr 9, 2022, 6:27:08 AM4/9/22

to

In Python-2, there is the same kind of problem. A string is a byte
array. This is the programmer responsibility to encode/decode to/from
UTF8/Latin1/... and to manage everything correctly. Litteral strings can
be considered as encoded or decoded depending on the notation ("" or u"").

In Python-3, a string is a character(glyph ?) array. The internal
representation is hidden to the programmer.
UTF8/Latin1/... encoded "strings" are of type bytes (byte array).
Writing/reading to/from a file is done with bytes type.
When writing/reading to/from a file in text mode, you have to specify
the encoding to use. The encoding/decoding is then internally managed.
As a general rule, all "external communications" are done with bytes
(byte array). This is the programmer responsability to encode/decode
where needed to convert from/to strings.
The source files (.py) are considered to be UTF8 encoded by default but
one can declare the actual encoding at the top of the file in a special
comment tag. When a badly encoded character is found, an exception is
raised at parsing time. So, literal strings are real strings, not bytes.

I think the Python-3 way of doing things is much more understandable and
really usable.

On the Ada side, I've still not understood how to correctly deal with
all this stuff.

Note : In Python-3, bytes type is not reserved to encoded "strings". It
is a versatile type for what it's named : a byte array.

DrPi

unread,

Apr 9, 2022, 3:00:03 PM4/9/22

to

Le 09/04/2022 à 18:46, Dennis Lee Bieber a écrit :
> On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <3...@drpi.fr> declaimed the
> following:

>
>>
>> In Python-3, a string is a character(glyph ?) array. The internal
>> representation is hidden to the programmer.
>

> <SNIP>

>>
>> On the Ada side, I've still not understood how to correctly deal with
>> all this stuff.
>

> One thing to take into account is that Python strings are immutable.
> Changing the contents of a string requires constructing a new string from
> parts that incorporate the change.
>

Right. I forgot to mention it.

> That allows for the second aspect -- even if not visible to a
> programmer, Python (3) strings are not a fixed representation: If all
> characters in the string fit in the 8-bit UTF range, that string is stored
> using one byte per character. If any character uses a 16-bit UTF
> representation, the entire string is stored as 16-bit characters (and
> similar for 32-bit UTF points). Thus, indexing into the string is still
> fast -- just needing to scale the index by the character width of the
> entire string.
>

Thanks for clarifying.

Vadim Godunko

unread,

Apr 10, 2022, 1:58:50 AM4/10/22

to

On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:
>
> On the Ada side, I've still not understood how to correctly deal with
> all this stuff.
>

Take a look at https://github.com/AdaCore/VSS

Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.

I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect.

DrPi

unread,

Apr 10, 2022, 2:59:24 PM4/10/22

to

That's an interesting solution.

Randy Brukardt

unread,

Apr 12, 2022, 2:13:12 AM4/12/22

to

"Vadim Godunko" <vgod...@gmail.com> wrote in message
news:3962d55d-10e8-4dff...@googlegroups.com...
...

>I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and
>programming languages; more cleaner types and API is a requirement now.

...which essentially means Ada is obsolete in your view, as String in
particular is way too embedded in the definition and the language-defined
units to use anything else. You'd end up with a mass of conversions to get
anything done (the main problem with Ada.Strings.Unbounded).

Or I suppose you could replace pretty much the entire library with a new
one. But now you have two of everything to confuse newcomers and you still
have a mass of old nonsense weighing down the language and complicating
implementations.

>The only case when old character/string types is really makes value is low

>resources embedded systems; ...

...which of course is at least 50% of the use of Ada, and probably closer to
90% of the money. Any solution for Ada has to continue to meet the needs of
embedded programmers. For instance, it would need to support fixed, bounded,
and unbounded versions (solely having unbounded strings would not work for
many applications, and indeed not just embedded systems need to restrict
those -- any long-running server has to control dynamic allocation)

>...in other cases their use generates a lot of hidden issues, which is very
>hard to detect.

At least some of which occur because a string is not an array, and the
forcible mapping to them never worked very well. The Z-80 Pascals that we
used to implement the very earliest versions of Ada had more functional
strings than Ada does (by being bounded and using a library for most
operations) - they would have been way easier to extend (as the Python ones
were, as an example).

Randy.

Thomas

unread,

Apr 15, 2022, 10:32:23 PM4/15/22

to

In article <86mttuk...@stephe-leake.org>,
Stephen Leake <stephe...@stephe-leake.org> wrote:

> DrPi <3...@drpi.fr> writes:
>
> > Any way to use source code encoded in UTF-8 ?

> from the gnat user guide, 4.3.1 Alphabetical List of All Switches:
>
> `-gnati`c''
> Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details
> of the possible selections for `c', see *note Character Set
> Control: 4e.
>
> This applies to identifiers in the source code
>
> `-gnatW`e''
> Wide character encoding method (`e'=n/h/u/s/e/8).
>
> This applies to string and character literals.

afaik, -gnati is deactivated when -gnatW is not n or h (from memory)

so you can't ask both to check that identifiers are in ASCII and to have
literals in UTF-8.

(if it's resolved in new versions it's a good news :-) )

Thomas

unread,

Mar 30, 2023, 7:35:50 PM3/30/23

to

sorry for the delay.

In article <48309745-aa2a-47bd...@googlegroups.com>,

Vadim Godunko <vgod...@gmail.com> wrote:

> On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
> >
> > > But don't use unit names containing international characters, at any
> > > rate if you're (interested in compiling on) Windows or macOS:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
> >
> > and this kind of problems would be easier to avoid if string types were
> > stronger ...
> >
>
> Your suggestion is unable to resolve this issue on Mac OS X.

i said "easier" not "easy".

don't forget that Unicode has 2 levels :
- octets <-> code points
- code points <-> characters/glyphs

and you can't expect the upper to work if the lower doesn't.

> Like case
> sensitivity, binary compare of two strings can't compare strings in different
> normalization forms. Right solution is to use right type to represent any
> paths,

what would be the "right type", according to you?

In fact, here the first question to ask is:
what's the expected encoding for Ada.Text_IO.Open.Name?
- is it Latin-1 because the type is String not UTF_8_String?
- is it undefined because it depends on the underling FS?

Thomas

unread,

Mar 30, 2023, 11:06:25 PM3/30/23

to

In article <t2g0c1$eou$1...@dont-email.me>,

"Randy Brukardt" <ra...@rrsoftware.com> wrote:

> "Thomas" <fantome.foru...@free.fr.invalid> wrote in message
> news:fantome.forums.tDeConte...@news.free.fr...
> ...
> > as i said to Vadim Godunko, i need to fill a string type with an UTF-8
> > litteral.but i don't think this string type has to manage various
> > conversions.
> >
> > from my point of view, each library has to accept 1 kind of string type
> > (preferably UTF-8 everywhere),
> > and then, this library has to make needed conversions regarding the
> > underlying API. not the user.
>
> This certainly is a fine ivory tower solution,

I like to think from an ivory tower,
and then look at the reality to see what's possible to do or not. :-)

> but it completely ignores two
> practicalities in the case of Ada:
>
> (1) You need to replace almost all of the existing Ada language defined
> packages to make this work. Things that are deeply embedded in both
> implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
> have to change substantially. The result would essentially be a different
> language, since the resulting libraries would not work with most existing
> programs.

- in Ada, of course we can't delete what's existing, and there are many
packages which are already in 3 versions (S/WS/WWS).
imho, it would be consistent to make a 4th version of them for a new
UTF_8_String type.

- in a new language close to Ada, it would not necessarily be a good
idea to remove some of them, depending on industrial needs, to keep them
with us.

> They'd have to have different names (since if you used the same
> names, you change the failures from compile-time to runtime -- or even
> undetected -- which would be completely against the spirit of Ada), which
> means that one would have to essentially start over learning and using the
> resulting language.

i think i don't understand.

> (and it would make sense to use this point to
> eliminate a lot of the cruft from the Ada design).

could you give an example of cruft from the Ada design, please? :-)

>
> (2) One needs to be able to read and write data given whatever encoding the
> project requires (that's often decided by outside forces, such as other
> hardware or software that the project needs to interoperate with).

> At a minimum, you
> have to have a way to specify the encoding of files, streams, and hardware
> interfaces

> That will greatly complicate the interface and
> implementation of the libraries.

i don't think so.
it's a matter of interfacing libraries, for the purpose of communicating
with the outside (neither of internal libraries nor of the choice of the
internal type for the implementation).

Ada.Text_IO.Open.Form already allows (a part of?) this (on the content
of the files, not on their name), see ARM A.10.2 (6-8).
(write i the reference to ARM correctly?)

>
> > ... of course, it would be very nice to have a more thicker language with
> > a garbage collector ...
>
> I doubt that you will ever see that in the Ada family,

> as analysis and
> therefore determinism is a very important property for the language.

I completely agree :-)

> Ada has
> lots of mechanisms for managing storage without directly doing it yourself
> (by calling Unchecked_Deallocation), yet none of them use any garbage
> collection in a traditional sense.

sorry, i meant "garbage collector" in a generic sense, not in a
traditional sense.
that is, as Ada users we could program with pointers and pool, without
memory leaks nor calling Unchecked_Deallocation.

for example Ada.Containers.Indefinite_Holders.

i already wrote one for constrained limited types.
do you know if it's possible to do it for unconstrained limited types,
like the class of a limited tagged type?

Randy Brukardt

unread,

Apr 1, 2023, 6:18:16 AM4/1/23

to

I'm not going to answer this point-by-point, as it would take very much too
long, and there is a similar thread going on the ARG's Github (which needs
my attention more than comp.lang.ada.

But my opinion is that Ada got strings completely wrong, and the best thing
to do with them is to completely nuke them and start over. But one cannot do
that in the context of Ada, one would have to at least leave way to use the
old mechanisms for compatibility with older code. That would leave a
hodge-podge of mechanisms that would make Ada very much harder (rather than
easier) to use.

As far as the cruft goes, I wrote up a 20+ page document on that during the
pandemic, but I could never interest anyone knowledgeable to review it, and
I don't plan to make it available without that. Most of the things are
caused by interactions -- mostly because of too much generality. And of
course there are features that Ada would be better off without (like
anonymous access types).

Randy.

"Thomas" <fantome.foru...@free.fr.invalid> wrote in message

news:64264e2f$0$25952$426a...@news.free.fr...

Thomas

unread,

Apr 3, 2023, 8:02:05 PM4/3/23

to

In article
<fantome.forums.tDeConte...@news.free.fr>,
Thomas <fantome.foru...@free.fr.invalid> wrote:

> In article <f9d91cb0-c9bb-4d42...@googlegroups.com>,
> Vadim Godunko <vgod...@gmail.com> wrote:
>
> > On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
>
> > > What's the way to manage Unicode correctly ?

> > Ada doesn't have good Unicode support. :( So, you need to find suitable set
> > of "workarounds".
> >
> > There are few different aspects of Unicode support need to be considered:
> >
> > 1. Representation of string literals. If you want to use non-ASCII
> > characters
> > in source code, you need to use -gnatW8 switch and it will require use of
> > Wide_Wide_String everywhere.
> > 2. Internal representation during application execution. You are forced to
> > use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
>
> > It is hard to say that it is reasonable set of features for modern world.
>
> I don't think Ada would be lacking that much, for having good UTF-8
> support.
>
> the cardinal point is to be able to fill a
> Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
> (once you got it, when you'll try to fill a Standard.String with a
> non-Latin-1 character, it'll make an error, i think it's fine :-) )
>
> does Ada 202x allow it ?

hi !

I think I found a quite nice solution!
(reading <t3lj44$fh5$1...@dont-email.me> again)
(not tested yet)

it's not perfect as in the rules of the art,
but it is:

- Ada 2012 compatible
- better than writing UTF-8 Ada code and then telling gnat it is Latin-1
(in this way it would take UTF_8_String for what it is:
an array of octets, but it would not detect an invalid UTF-8 string,
and if someone tells it's really UTF-8 all goes wrong)
- better than being limited to ASCII in string literals
- never need to explicitely declare Wide_Wide_String:
it's always implicit, for very short time,
and AFAIK eligible for optimization

package UTF_Encoding is

subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

function "+" (A : in Wide_Wide_String) return UTF_8_String
renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;

then we can do:

package User is

use UTF_Encoding;

My_String : UTF_8_String := + "Greek characters + smileys";

end User;

if you want to avoid "use UTF_Encoding;",
i think "use type UTF_Encoding.UTF_8_String;" doesn't work,
but this should work:

package UTF_Encoding is

subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

type Literals_For_UTF_8_String is new Wide_Wide_String;

function "+" (A : in Literals_For_UTF_8_String) return UTF_8_String
renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;

package User is

use type UTF_Encoding.Literals_For_UTF_8_String;

My_String : UTF_Encoding.UTF_8_String
:= + "Greek characters + smileys";

end User;

what do you think about that ? good idea or not ? :-)