Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why no Ada.Wide_Directories?

98 views
Skip to first unread message

Michael Rohan

unread,
Oct 14, 2011, 2:58:45 AM10/14/11
to
Hi,

I've working a little on accessing files and directories using Ada.Directories and have been using a thin wrapper layer to convert from Wide_String to UTF8 and back. It does, however, seem strange there is no Wide_Directories version in the std library. Was there a technical reason it wasn't included?

Take care,
Michael

Yannick Duchêne (Hibou57)

unread,
Oct 14, 2011, 3:39:32 AM10/14/11
to
Le Fri, 14 Oct 2011 08:58:45 +0200, Michael Rohan
<michael...@gmail.com> a écrit:

> Hi,
>
> I've working a little on accessing files and directories using
> Ada.Directories and have been using a thin wrapper layer to convert from
> Wide_String to UTF8 and back.

Does it mean you pass UTF-8 encoded strings to Ada directory operations ?


--
“Syntactic sugar causes cancer of the semi-colons.” [Epigrams on
Programming — Alan J. — P. Yale University]
“Structured Programming supports the law of the excluded muddle.” [Idem]
Java: Write once, Never revisit

Dmitry A. Kazakov

unread,
Oct 14, 2011, 5:07:20 AM10/14/11
to
On Fri, 14 Oct 2011 09:39:32 +0200, Yannick Duch�ne (Hibou57) wrote:

> Le Fri, 14 Oct 2011 08:58:45 +0200, Michael Rohan

> <michael...@gmail.com> a �crit:


>
>> I've working a little on accessing files and directories using
>> Ada.Directories and have been using a thin wrapper layer to convert from
>> Wide_String to UTF8 and back.
> Does it mean you pass UTF-8 encoded strings to Ada directory operations ?

In most cases this is how it works under Linux. Under Windows that would
depend what kind of operations xA, xW etc the implementation uses.

I would strongly recommend not to use Ada.Directory until it gets fixed,
e.g. *at least* all its calls made Wide_Wide_String or explicitly mandated
as UTF-8 encoded.

Until then as an alternative use GIO binding of Gtk or an equivalent in Qt.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Yannick Duchêne (Hibou57)

unread,
Oct 14, 2011, 8:48:41 AM10/14/11
to
Le Fri, 14 Oct 2011 11:07:20 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:

> On Fri, 14 Oct 2011 09:39:32 +0200, Yannick Duchêne (Hibou57) wrote:
>
>> Le Fri, 14 Oct 2011 08:58:45 +0200, Michael Rohan

>> <michael...@gmail.com> a écrit:


>>
>>> I've working a little on accessing files and directories using
>>> Ada.Directories and have been using a thin wrapper layer to convert
>>> from
>>> Wide_String to UTF8 and back.
>> Does it mean you pass UTF-8 encoded strings to Ada directory operations
>> ?
>
> In most cases this is how it works under Linux. Under Windows that would
> depend what kind of operations xA, xW etc the implementation uses.

This is indeed not safe to use. I get it raising exceptions when facing
file names (perfectly valid file names for the OS), which it did not like
(containing characters outside of ISO-8859). As platform independent, this
package should not impose its own conventions on a platform, so better
avoid it, indeed. It's not safe for a program to fail scanning sanely the
content of a directory when it should be able to (unless its OK for the
program to randomly miss some files).

Yannick Duchêne (Hibou57)

unread,
Oct 14, 2011, 8:54:42 AM10/14/11
to
Le Fri, 14 Oct 2011 11:07:20 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> In most cases this is how it works under Linux. Under Windows that would
> depend what kind of operations xA, xW etc the implementation uses.
It will not work at all if the implementation use xxW, and if the
implementation use xxA, it may randomly work or fail depending on
individual file names. So that's not safe (I guess GNAT on Windows use
xxxA). I guess this may work better on Linux, as it use UTF-8 internally.

ytomino

unread,
Oct 14, 2011, 9:06:05 PM10/14/11
to
Hello.
In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
(except Ada.Strings.UTF_Encoding).
I'm afraid that is violation of the standard even if the
implementation accepts UTF-8.

Of course, I think that the standard is impractical, too.
If we must keep the standard, there is no way to access a file (and
other environment features) named with non-ASCII, at all.
I'm unlikely to bear... But that's another problem.

I do not know why the standard does not have Wide_Directories and
Text_IO.Wide_Open and Wide_Command_Line and Wide_Environment_Variables
and...,
Still, too, I hope these (or the standard allows that Character/String
represent UTF-8).

Vadim Godunko

unread,
Oct 15, 2011, 2:55:39 AM10/15/11
to
On Oct 15, 5:06 am, ytomino <aghi...@gmail.com> wrote:
>
> Of course, I think that the standard is impractical, too.
> If we must keep the standard, there is no way to access a file (and
> other environment features) named with non-ASCII, at all.
> I'm unlikely to bear... But that's another problem.
>
It is always possible to use non-standard library. For example you can
look at Matreshka http://forge.ada-ru.org/matreshka; it has own string
type which is equivalent to Wide_Wide_String, but more space and
performance efficient. It provides access to command line switches and
environment variables in platform and encoding independent way using
this string. Unfortunately, directory operations is not implemented
now, but we will implement them in some point.

Dmitry A. Kazakov

unread,
Oct 15, 2011, 4:38:23 AM10/15/11
to
On Fri, 14 Oct 2011 18:06:05 -0700 (PDT), ytomino wrote:

> In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
> (except Ada.Strings.UTF_Encoding).
> I'm afraid that is violation of the standard even if the
> implementation accepts UTF-8.

The same applies to Wide_String, which is UCS-2 not UTF-16. Implementations
pretending otherwise are wrong. For that matter Windows xW calls are
UTF-16. Passing Wide_String there is wrong.

> Of course, I think that the standard is impractical, too.

There are two problems with the standard:

1. It does not define strings and characters in terms of a code point type
to be consistent with Unicode;

2. It does not provide automatic conversions between character/string
types, because of the problem #1, and because the Ada type system is too
weak for that.

Clearly file operations, directory operations, character maps should be
defined using code points rather than characters. There should be only one
instance of each operation/package independent on the encoding and the
combinations of encodings.

ytomino

unread,
Oct 15, 2011, 8:34:57 AM10/15/11
to
On Oct 15, 3:55 pm, Vadim Godunko <vgodu...@gmail.com> wrote:
> It is always possible to use non-standard library. For example you can
> look at Matreshkahttp://forge.ada-ru.org/matreshka;it has own string
> type which is equivalent to Wide_Wide_String, but more space and
> performance efficient. It provides access to command line switches and
> environment variables in platform and encoding independent way using
> this string. Unfortunately, directory operations is not implemented
> now, but we will implement them in some point.

Matreshka seems good designed!

Anyway, we can do anything if using non-standard library.
However these are not a reasonable method in fact.

(By the way, I've making another runtime https://github.com/ytomino/drake
like you. It has intentional violation of the standard. Character/
String are just UTF-8. And Ada.Strings works according to code-point
in my runtime. This is a result of avoiding inefficient that the
standard library and non-standard library (doing same things) are
linked. Of course, it's illegal. I do not recommend using my runtime.)

Peter C. Chapin

unread,
Oct 15, 2011, 9:12:39 AM10/15/11
to
On 2011-10-15 04:38, Dmitry A. Kazakov wrote:

> There are two problems with the standard:
>
> 1. It does not define strings and characters in terms of a code point type
> to be consistent with Unicode;
>
> 2. It does not provide automatic conversions between character/string
> types, because of the problem #1, and because the Ada type system is too
> weak for that.
>
> Clearly file operations, directory operations, character maps should be
> defined using code points rather than characters. There should be only one
> instance of each operation/package independent on the encoding and the
> combinations of encodings.

Disclaimer: I haven't thought about this very much.

It seems like you are expecting too much from the standard. If a
standard program writes files with names that the standard understands
then a standard program can read those files back and manipulate them
via Ada.Directories. Yes?

The problem arises when you try to ask a standard program to delve into
system specific details such as reading arbitrary ("exotic") file names
supported by the system. That doesn't work, but I wouldn't expect it to
work so what's the problem?

C avoids this complexity by just not including directory manipulation in
the standard at all. Ada at least allows a standard program to
manipulate directories containing files written by another standard program.

I can understand that it might be nice to extend the standard to include
proper support for Unicode file names and such. But I don't think the
lack of that support can be interpreted as some kind of failure of the
standard.

Peter

Ludovic Brenta

unread,
Oct 15, 2011, 9:22:43 AM10/15/11
to
"Peter C. Chapin" <PCh...@vtc.vsc.edu> writes on comp.lang.ada:
> It seems like you are expecting too much from the standard. If a
> standard program writes files with names that the standard understands
> then a standard program can read those files back and manipulate them
> via Ada.Directories. Yes?
>
> The problem arises when you try to ask a standard program to delve
> into system specific details such as reading arbitrary ("exotic") file
> names supported by the system. That doesn't work, but I wouldn't
> expect it to work so what's the problem?
>
> C avoids this complexity by just not including directory manipulation
> in the standard at all. Ada at least allows a standard program to
> manipulate directories containing files written by another standard
> program.
>
> I can understand that it might be nice to extend the standard to
> include proper support for Unicode file names and such. But I don't
> think the lack of that support can be interpreted as some kind of
> failure of the standard.

+1

--
Ludovic Brenta.

Dmitry A. Kazakov

unread,
Oct 15, 2011, 10:47:55 AM10/15/11
to
On Sat, 15 Oct 2011 09:12:39 -0400, Peter C. Chapin wrote:

> On 2011-10-15 04:38, Dmitry A. Kazakov wrote:
>
>> There are two problems with the standard:
>>
>> 1. It does not define strings and characters in terms of a code point type
>> to be consistent with Unicode;
>>
>> 2. It does not provide automatic conversions between character/string
>> types, because of the problem #1, and because the Ada type system is too
>> weak for that.
>>
>> Clearly file operations, directory operations, character maps should be
>> defined using code points rather than characters. There should be only one
>> instance of each operation/package independent on the encoding and the
>> combinations of encodings.
>
> Disclaimer: I haven't thought about this very much.
>
> It seems like you are expecting too much from the standard. If a
> standard program writes files with names that the standard understands
> then a standard program can read those files back and manipulate them
> via Ada.Directories. Yes?

Maybe, it is difficult to guess.

Anyway, what you say is a much bigger expectation than mine. I wished mere
consistency of Ada string types with Unicode (after all Ada adopted
Unicode), which is all internal language matter. What you are expecting is
certain behavior of the language environment, which Ada cannot control at
all.

> The problem arises when you try to ask a standard program to delve into
> system specific details such as reading arbitrary ("exotic") file names
> supported by the system.

There is no such thing. Unicode was introduced in order to support any
thinkable names.

> C avoids this complexity by just not including directory manipulation in
> the standard at all. Ada at least allows a standard program to
> manipulate directories containing files written by another standard program.

As a matter of fact, it does not. Consider an Ada program creating a file
in its current directory. Let the directory path contain Unicode characters
outside Latin-1. Then another Ada program running in a different directory
won't be able to find this file. In fact you cannot even walk the file
system tree/forest using Ada.Directories. You cannot do this neither
portably, nor even system-dependent.

And it is just silly to make the point that Ada programs should read/write
only files created by other Ada programs [compiled by the same compiler, I
guess]. However even this is not guaranteed.

> I can understand that it might be nice to extend the standard to include
> proper support for Unicode file names and such.

Ada is Unicode.

> But I don't think the
> lack of that support can be interpreted as some kind of failure of the
> standard.

Maybe it was a success, but some people really wished Ada.Directory be
usable for developing portable GUI programs. Presently I am using GIO
instead of Ada.Directory, and it does not make me happy.

Yannick Duchêne (Hibou57)

unread,
Oct 16, 2011, 1:48:54 AM10/16/11
to
Le Sat, 15 Oct 2011 16:47:55 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> And it is just silly to make the point that Ada programs should
> read/write
> only files created by other Ada programs
+1
> Ada is Unicode.
+1

Yannick Duchêne (Hibou57)

unread,
Oct 16, 2011, 1:51:05 AM10/16/11
to
Le Sat, 15 Oct 2011 15:12:39 +0200, Peter C. Chapin <PCh...@vtc.vsc.edu>
a écrit:
> It seems like you are expecting too much from the standard.
I feel to expect safe execution. The actual behavior is unsafe.

Peter C. Chapin

unread,
Oct 16, 2011, 8:15:11 PM10/16/11
to
On 2011-10-15 10:47, Dmitry A. Kazakov wrote:

> And it is just silly to make the point that Ada programs should read/write
> only files created by other Ada programs [compiled by the same compiler, I
> guess].

It's not that silly.

In order to talk sensibly about files the standard needs to define a
model of "file" and, in this case even "file system." This needs to be a
model that will be applicable to the widest range of platforms possible.
Such is the nature of a standard. Thus the standard model of "file" and
"file system" will be a simplified abstraction of the real thing on any
particular system. A portable program can only make use of that
simplified abstraction if it expects to remain portable.

If other files on the system also conform to that simplified model, that
is good. A portable program will be able to manipulate them. However, if
a program wishes to manipulate all files on a particular system, with
their full generality, system-specific techniques are going to be necessary.

For example, I don't believe the Ada standard allows one to access
information about a file's owner. Yet every file on my Linux system has
an owner. If I want to write a portable Ada program I have to live
without that information. If the Ada standard goes on to say that I
can't access files with names containing "exotic" characters, how is
that any different in principle?

I can appreciate that accessing files with Unicode names might be a
useful thing to do in a standard program. What happens when such a
program tries to create files with such names on a system that doesn't
support them? I suppose a solution could be found, but I can also see
how it would get ugly.

Peter

Yannick Duchêne (Hibou57)

unread,
Oct 16, 2011, 11:23:30 PM10/16/11
to
Le Mon, 17 Oct 2011 02:15:11 +0200, Peter C. Chapin <PCh...@vtc.vsc.edu>
a écrit:
> It's not that silly.
>
> In order to talk sensibly about files the standard needs to define a
> model of "file" and, in this case even "file system." This needs to be a
> model that will be applicable to the widest range of platforms possible.
> Such is the nature of a standard. Thus the standard model of "file" and
> "file system" will be a simplified abstraction of the real thing on any
> particular system. A portable program can only make use of that
> simplified abstraction if it expects to remain portable.
> [etc]
You are going too far, turning the matter into something it was not. After
that, your conclusions can be just wrong or can just apply to other
matters, but the actual.

It was just about character set, and that model exist in Ada (in a
non‑homogeneous way, which seems a failure).

> What happens when such a program tries to create files with such names
> on a system that doesn't support them?
What about Text I/O then ? “Surprisingly”, no body complained.

Simon Wright

unread,
Oct 17, 2011, 3:12:35 AM10/17/11
to
"Peter C. Chapin" <PCh...@vtc.vsc.edu> writes:

> I can appreciate that accessing files with Unicode names might be a
> useful thing to do in a standard program. What happens when such a
> program tries to create files with such names on a system that doesn't
> support them? I suppose a solution could be found, but I can also see
> how it would get ugly.

Exception Name_Error, I'd think!

Dmitry A. Kazakov

unread,
Oct 17, 2011, 3:59:37 AM10/17/11
to
On Sun, 16 Oct 2011 20:15:11 -0400, Peter C. Chapin wrote:

> In order to talk sensibly about files the standard needs to define a
> model of "file" and, in this case even "file system." This needs to be a
> model that will be applicable to the widest range of platforms possible.

Right

> Such is the nature of a standard.

Thank you for making my point. The standard disregards the above principle
by using Latin-1 encoding for file names.

> Thus the standard model of "file" and
> "file system" will be a simplified abstraction of the real thing on any
> particular system.

"Simplified" is in contradiction with "widest range". In order to support
widest range it must generalized, abstracted, rather than simplified,
degraded.

Ada adopted Unicode. Unicode is a generalized model capable to handle any
encoding the target platform might use. The programmer need not to know the
actual encoding, it becomes irrelevant.

> If other files on the system also conform to that simplified model, that
> is good. A portable program will be able to manipulate them. However, if
> a program wishes to manipulate all files on a particular system, with
> their full generality, system-specific techniques are going to be necessary.

Wrong. All file systems share common features, which can and must be
properly abstracted. System-specific are the implementations, not the
package specifications.

> For example, I don't believe the Ada standard allows one to access
> information about a file's owner.

This has nothing to do with file names, but if the standard wished to
address access rights, it could do it as well.

> Yet every file on my Linux system has
> an owner. If I want to write a portable Ada program I have to live
> without that information. If the Ada standard goes on to say that I
> can't access files with names containing "exotic" characters, how is
> that any different in principle?

Because inability to spell the file name is not same as lacking access
rights. Access rights are external to the program code. The file name,
coded as a string literal is a part of the program. Failure of the former
is not a bug. The latter is a bug, because the file exists, is accessible
and has proper name. A program bug which cannot be fixed is a language
design bug.

Randy Brukardt

unread,
Oct 17, 2011, 5:33:28 PM10/17/11
to
"ytomino" <agh...@gmail.com> wrote in message
news:418b8140-fafb-442f...@y22g2000pri.googlegroups.com...
> Hello.
> In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
> (except Ada.Strings.UTF_Encoding).
> I'm afraid that is violation of the standard even if the
> implementation accepts UTF-8.

Say what?

Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store
UTF-8 encoded strings. As such, I'd find it pretty surprising if doing so
was "a violation of the standard".

The intent has always been that Open, Ada.Directories, etc. take UTF-8
strings as an option. Presumably the implementation would use a Form to
specify that the file names in UTF-8 form rather than Latin-1. (I wasn't
able to find a reference for this in a quick search, but I know it has been
talked about on several occasions.)

One of the primary reasons that Ada.Strings.Encoding uses a subtype of
String rather than a separate type is so that it can be passed to Open and
the like.

It's probably true that we should standardize on the Form needed to use
UTF-8 strings in these contexts, or at least come up with Implementation
Advice on that point.

Randy.


Randy Brukardt

unread,
Oct 17, 2011, 5:41:12 PM10/17/11
to
"Yannick Duchêne (Hibou57)" <yannick...@yahoo.fr> wrote in message
news:op.v3fjv...@index.ici...
>Le Sat, 15 Oct 2011 15:12:39 +0200, Peter C. Chapin <PCh...@vtc.vsc.edu>
>a écrit:
>> It seems like you are expecting too much from the standard.
>I feel to expect safe execution. The actual behavior is unsafe.

That's clearly an implementation problem rather than a language one.
Ada.Directories was designed with the intent that UTF-8 encoding could be
used throughout (as an option) and it would work. To the extent that that is
not true, there would be a bug, but I know of no such problems.

Now, if an implementation on Windows doesn't have a way to use UTF-8
encoding, that is an implementation problem, but not one that the Standard
can do much about.

Randy.




ytomino

unread,
Oct 17, 2011, 7:47:49 PM10/17/11
to
On Oct 18, 6:33 am, "Randy Brukardt" <ra...@rrsoftware.com> wrote:
>
> Say what?
>
> Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store
> UTF-8 encoded strings. As such, I'd find it pretty surprising if doing so
> was "a violation of the standard".
>
> The intent has always been that Open, Ada.Directories, etc. take UTF-8
> strings as an option. Presumably the implementation would use a Form to
> specify that the file names in UTF-8 form rather than Latin-1. (I wasn't
> able to find a reference for this in a quick search, but I know it has been
> talked about on several occasions.)
>
> One of the primary reasons that Ada.Strings.Encoding uses a subtype of
> String rather than a separate type is so that it can be passed to Open and
> the like.
>
> It's probably true that we should standardize on the Form needed to use
> UTF-8 strings in these contexts, or at least come up with Implementation
> Advice on that point.
>
>                                        Randy.

Good news. Thanks for letting know.
My worry is decreased a little.

However, even if that is right, Form parameters are missing for many
subprograms.
Probably, All subprograms in Ada.Directories,
Ada.Directories.Hierarchical_File_Names, Ada.Command_Line,
Ada.Environment_Variables and other subprograms having Name parameter
or returning a file name should have Form parameter.
(For example, I do Open (X, Form => "UTF-8"). Which does Name (X)
returns UTF-8 or Latin-1?)

Moreover, in the future, we will always use I/O subprograms as UTF-8
mode if what you say is realized.
But other libraries in the standard are explicitly defined as Latin-1.
It's certain that Ada.Character.Handling.To_Upper breaks UTF-8.
So we can not use almost subprograms in Ada.Characters and Ada.Strings
for handling file names.
(For example, Ada.Directories.Name_Case_Equivalence returns
Case_Insensitive. We can not use Ada.Strings.Equal_Case_Insensitive to
compare two file names.)
It means standard libraries are separated UTF-8 from Latin-1.
It's not reasonable.

I wish it be solved.

Adam Beneschan

unread,
Oct 17, 2011, 9:10:35 PM10/17/11
to
I have a feeling you're fundamentally confused about what UTF-8 is, as
compared to "Latin-1". Latin-1 is a character mapping. It defines,
for all integers in the range 0..255, what character that integer
represents (e.g. 77 represents 'M', etc.). Unicode is a character
mapping that defines characters for a much larger integer range. For
integers in the range 0..255, the character represented in Unicode is
the same as that in Latin-1; higher integers represent characters in
other alphabets, other symbols, etc. Those mappings just tell you
what symbols go with what numbers, and they don't say anything about
how the numbers are supposed to be stored.

UTF-8 is an encoding (representation). It defines, for each non-
negative integer up to a certain point, what bits are used to
represent that integer. The number of bits is not fixed. So even if
you're working with characters all in the 0..255 range, some of those
characters will be represented in 8 bits (one byte) and some will take
16 bits (two bytes).

Because of this, it is not feasible to work with strings or characters
in UTF-8 encoding. Suppose you declare a string

S : String (1 .. 100);

but you want it to be a UTF-8 string. How would that work? If you
want to look at S(50), the computer would have to start at the
beginning of the string and figure out whether each character is
represented as 1 or 2 bytes. Nobody wants that.

The only sane way to work with strings in memory is to use a format
where every character is the same size (String if all your characters
are in the 0..255 range, Wide_String for 0..65535, Wide_Wide_String
for 0..2**32-1). Then, if you have a string of bytes in UTF-8 format,
you convert it to a regular (Wide_)(Wide_)String with routines in
Ada.Strings.UTF_Encoding; and it also has routines for converting
regular strings to UTF-8 format. But you don't want to *keep* strings
in memory and work with them in UTF-8 format. That's why it doesn't
make sense to have string routines (like
Ada.Strings.Equal_Case_Insensitive or Ada.Character_Handling.To_Upper)
that work with UTF-8.

Hope this solves your problem.

-- Adam

ytomino

unread,
Oct 17, 2011, 10:32:04 PM10/17/11
to
I'm not confused. Your misreading.

Of course, if applications always hold file names as Wide_Wide_String,
and encode to UTF-8 only/every calling I/O subprograms as what you
say, so it's very simple and it is perhaps intended method. I
understand it.

But, where do these file names come from?
These are usually told by command-line or configuration file (written
by user).
It is probably encoded UTF-8 if the locale setting of OS is UTF-8.
So Form parameters of subprograms in Ada.Command_Line are necessary
and it's natural keeping UTF-8.

(Some file systems like Linux accept broken code as correct file name.
Applications must not (can not?) decode/encode file names in this
case.
Broken file name may be right file name if user sets LANG variable.
Same thing is in NTFS/NFS+. These file systems can accept broken
UTF-16. Strictly speaking, always, an application should not encode/
decode file names. But, Ada decides file names are stored into String
(as long as Randy says). So we have to give up about UTF-16 file
systems.)

And, it's popular that text processing functions keep encoded strings
in many other libraries or languages. I do not necessarily want to
deny the way of Ada, but I feel your opinion is prejudiced. It is not
so difficult as you say in fact.

Yannick Duchêne (Hibou57)

unread,
Oct 17, 2011, 10:59:28 PM10/17/11
to
Le Mon, 17 Oct 2011 23:33:28 +0200, Randy Brukardt <ra...@rrsoftware.com>
a écrit:
> Say what?
>
> Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store
> UTF-8 encoded strings.

*Please, note the following in just personal opinion* (just want to tell
what I feel, don't expect to hurt any one)

Every one know and noticed, while this is still confusing “bytes and
character” like C did. Eiffel had an implementation of UTF-8 string, which
was different than the default ASCII string, and you could not access
bytes from it, there was proper encapsulation and type check. It happened
I used a similar abstraction in a tiny Ada application.

Unless it is required there is a BOM at the beginning of each UTF-8
string, and this BOM is required to always be checked --- will have to
check the new RM, but feel the answer is No ---, confusing both types into
a single one is not that clean --- even if the answer was Yes, this would
only be dynamic check, and not static check. I feel it is more an
implementation trick (which was indeed intended by the design of UTF-8
targeting some hardly solvable context), than a clean formalization.

Try to iterate over an element of type String. What did you get if it is a
proper ISO 8859-1 srtring ? You get Characters. What did you get if it is
UTF-8 ? You get garbage and “random who-know-what-it-is”, … _and the type
system does not catch it_ (*), while it is is one of its primarily intent.

By the way, if ISO/ANSI string and UTF-8 strings are the same, then what
is Wide_Character ? Unicode Basic Plan or UTF-16LE or UTF-16BE or guess ?

This will not break Ada values to the eye of most people (**), but I
believe these and some other people noticed the same.

(*) Both types are not even structurally compatible.

(**) That's a library design flaw, not a language flaw! The difference
between both, is that if a library part is not strongly tight into the
language definition like IO attributes or finalization behaviors are, one
always has the provision to work it around using its own library. But
still lost the interest of a standard library.

Yannick Duchêne (Hibou57)

unread,
Oct 17, 2011, 11:15:45 PM10/17/11
to
Le Tue, 18 Oct 2011 03:10:35 +0200, Adam Beneschan <ad...@irvine.com> a
écrit:
> That's why it doesn't
> make sense to have string routines (like
> Ada.Strings.Equal_Case_Insensitive or Ada.Character_Handling.To_Upper)
> that work with UTF-8.
That would make sens, if String was an array container for _one_ Unicode
character subset and nothing else. What a mess if someone pass an UTF-8
string to such a casing mapping method (*)… don't expect to decode it
after that.

(*) Except if the string is restricted to US-ASCII, in which case, you
will not get anything wrong, but just still a pure US-ASCII string, which
is always UTF-8 by definition. Not the same story for ISO-8859-1 strings.

Michael Rohan

unread,
Oct 18, 2011, 12:07:35 AM10/18/11
to
Hi,

Just to confirm "ytomino" take on things: while I started this on Ada.Directories, I have fallen into the practice of simply doing From_UTF8 on anything coming from the environment (Ada.Command_Line, Ada.Environment_Variables, etc) and To_UTF8 on the way out. This works for my Linux system (en-US, Latin-1, no surprise), but using Wide_String internally, the external/internal interface Strings need to be converted somehow and UTF8 is as reasonable option.

As to the use of the Form parameter, additional standardization might be needed. With GNAT, the Form (for Open at least) can be used to define the encoding of the file contents but not of the file name.

Take care,
Michael.

ytomino

unread,
Oct 18, 2011, 12:46:13 AM10/18/11
to
Well...If my supplement is allowed, in my honest opinion ignoring the
existing way of Ada, "File_Name_String" is better.
(In addition, It's welcome that UTF_8_String and UTF_16_String be new
types like Yannick says.)

File_Name_String is UTF-8 on OSX(if only using POSIX API), UTF-16 on
Windows and localized string on BSD (I18N of BSD is unique!).
And Equal_File_Names/Less_File_Names are necessity.
Ada.Directories.Name_Case_Equivalence is useless because case
insensitive rules between NTFS and NFS+ are different.

(I had implemented case insensitive rules of NFS+. It's difficult and
required the different table from UCD. NTFS is easier because
CompareString API can be used for this purpose. CoreFoundation
possibly have usable function, too. But I do not know. Anyway, these
are not portable. I want standard library wrapping these.)

It's just supposed story ignoreing existing way.
I may be satisfied if Form parameter is usable.

ytomino

unread,
Oct 18, 2011, 12:54:22 AM10/18/11
to
Excuse my digression.

On Oct 18, 11:59 am, Yannick Duchêne (Hibou57)
<yannick_duch...@yahoo.fr> wrote:
> Eiffel had an implementation of UTF-8 string, which was different
> than the default ASCII string, and ***you could not access bytes from it***

(about ***)

Really? I'm novice about Eiffel, but I think that accessing bytes of
encoded string is worth. I can not believe it at once. So I did
google.
If you talk about UNICODE_STRING, it seems decoded from UTF-8 to code-
points array (like Wide_Wide_String).
http://www.maths.tcd.ie/~odunlain/eiffel/html/base/UNICODE_STRING.html
If you talk about Eiffel.NET, it seems having byte_count and
byte_item.
http://www.eiffelroom.org/blog/peter_gummer/utf_8_unicode_in_eiffel_for_net

Dmitry A. Kazakov

unread,
Oct 18, 2011, 3:29:30 AM10/18/11
to
On Mon, 17 Oct 2011 16:41:12 -0500, Randy Brukardt wrote:

> "Yannick Duchêne (Hibou57)" <yannick...@yahoo.fr> wrote in message
> news:op.v3fjv...@index.ici...
>>Le Sat, 15 Oct 2011 15:12:39 +0200, Peter C. Chapin <PCh...@vtc.vsc.edu>
>>a écrit:
>>> It seems like you are expecting too much from the standard.
>>I feel to expect safe execution. The actual behavior is unsafe.
>
> That's clearly an implementation problem rather than a language one.
> Ada.Directories was designed with the intent that UTF-8 encoding could be
> used throughout (as an option) and it would work.

How could it be an option? String is either Latin-1 or UTF-8. The standard
must explicitly require UTF-8 (breaking some existing programs).

> Now, if an implementation on Windows doesn't have a way to use UTF-8
> encoding, that is an implementation problem, but not one that the Standard
> can do much about.

It is the standard problem so long such Windows implementations are conform
to the standard.

Implementations, which would recode String from UTF-8 to UTF-16 and pass
that to a xW Windows call, look illegal to me because String is proclaimed
Latin-1.

Dmitry A. Kazakov

unread,
Oct 18, 2011, 3:55:07 AM10/18/11
to
On Mon, 17 Oct 2011 18:10:35 -0700 (PDT), Adam Beneschan wrote:

> I have a feeling you're fundamentally confused about what UTF-8 is, as
> compared to "Latin-1". Latin-1 is a character mapping. It defines,
> for all integers in the range 0..255, what character that integer
> represents (e.g. 77 represents 'M', etc.). Unicode is a character
> mapping that defines characters for a much larger integer range.

No, Unicode is a standard describes character mappings. Both UTF-8 and
Latin-1 are encodings. Latin-1 as an encoding has a property that there is
1-1 octet to code point correspondence, at the cost that some (most) of
code points cannot be represented by the encoding. UTF-8 lacks this
property, but is capable to represent all code points.

> Because of this, it is not feasible to work with strings or characters
> in UTF-8 encoding. Suppose you declare a string
>
> S : String (1 .. 100);
>
> but you want it to be a UTF-8 string. How would that work? If you
> want to look at S(50), the computer would have to start at the
> beginning of the string and figure out whether each character is
> represented as 1 or 2 bytes. Nobody wants that.

Nobody actually cares, because strings are not processed that way. String
indices are obtained in the course of operations which keep them at the
beginnings of properly encoded code points.

It is a language problem to distinguish index (some index type) and
position (cardinal number). Ada does this BTW.

When you write S(50), what is 50 here? 50th character (code point) counting
from the beginning of the string or the index 50 of a character which
position is unknown without looking into the string? Considering the
declaration of String, it is not clear if Positive is a position or proper
index. For the latter S(50) just does is not read as "50th character".
Furthermore it is not guaranteed that of 50 is a valid index then 51 is
valid too.

Dmitry A. Kazakov

unread,
Oct 18, 2011, 4:01:53 AM10/18/11
to
On Mon, 17 Oct 2011 16:47:49 -0700 (PDT), ytomino wrote:

> But other libraries in the standard are explicitly defined as Latin-1.
> It's certain that Ada.Character.Handling.To_Upper breaks UTF-8.
> So we can not use almost subprograms in Ada.Characters and Ada.Strings
> for handling file names.

Right, it is lot more than just Ada.Directories. I have implemented UTF-8
versions of Ada.Strings.Handling and Ada.Strings.Maps: sets and maps of
characters, case conversions, character characterization, superscript and
subscript integer I/O.

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 5:32:07 AM10/18/11
to
Le Tue, 18 Oct 2011 06:46:13 +0200, ytomino <agh...@gmail.com> a écrit:

> Well...If my supplement is allowed, in my honest opinion ignoring the
> existing way of Ada, "File_Name_String" is better.
> (In addition, It's welcome that UTF_8_String and UTF_16_String be new
> types like Yannick says.)
For personal and specific use cases, yes, however, for a standard, I would
be more in favor of an Unicode_String type. To be honest, my dream would
be to replace the Ada String type with that Unicode_String type (a dream…
I said). I use to attempt to create packages where the String type was
redefined, but failed due to some scope trouble (could never make my mind
about wither or not this was a GNAT bug or not).

This is important, because UTF-8, vs UTF-16LE, UTF-16BE and even possibly
UTF-32BE and UTF-32LE, is only a matter of implementation and is not a
good candidate for an interface, unless participating in a specific use
case.

Unicode_String implementation could be optionally encoded, or not, at the
sole discretion of implementation. The implementation could use UTF-32 if
it wish to be simple, or be in favor of the same encoding as the target
platform. This Unicode_String type would have method to return a
conversion into one of UTF-8, UTF-16 and UTF-32, and optionally (may raise
runtime error) to ISO-8859-1. For efficiency, this could also provide
primitive for common iterated composition, such as concatenation, getting
slice, comparison (which can be implemented far more efficiently at the
implementation level, that by mean getting and setting character, which
involve encoding and decoding each time). I would also suggest a
Change_To_Uppercase (Unicode_String, Index), and the same with
Change_To_Lower_Case, along with a Remove_Slice and Insert_Slice
primitives. These primitive would cover most of use case and help preserve
efficiency.

This could also solve a glitch. Actually, if you want to store UTF-8
string in an Ada source, you have to cheat the compiler: edit the file as
UTF-8, and compile as if it was ISO-8859-1 (*). Unfortunately, this is not
clean. If there was a real Unicode_String type (or the String type changed
into a Unicode one… in my dreams), this would not be a trouble any more.

On the other hand, if this would cause troubles to Ada, I prefer no
change, and to go on with personal methods.

(*) You can do the same for UTF-16, with some variation: use
Wide_Character for your string, edit sources in UTF-16, and cheat the
compiler telling him the sources are UCS2 encoded (note: UCS2 is another
no-encoding Unicode subset, the same way ISO-8859-1 is, except two bytes
wide instead of one byte wide).

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 5:41:14 AM10/18/11
to
Le Tue, 18 Oct 2011 09:55:07 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> No, Unicode is a standard describes character mappings. Both UTF-8 and
> Latin-1 are encodings. Latin-1 as an encoding has a property that there
> is 1-1 octet to code point correspondence, at the cost that some (most)
> of
> code points cannot be represented by the encoding. UTF-8 lacks this
> property,
To not mislead people, don't forget UTF-8 too has this property, in regard
to US-ASCII, which also is an Unicode subset.

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 5:54:27 AM10/18/11
to
“Really?” Yes Ytomino ;) You obviously need to initialize it in some way,
but until initialized, you can't access individuals bytes, and all indexes
you pass to UNICODE_STRING methods, are character index, never byte index.

In the former link, just to a search using “-- Get ”, to search for basic
accessors: you have two, one as a method, one as an operator (you will
easily guess, the syntax is Ada inspired), and both expect a character
index, not a byte index. You obviously have initializers and converters to
and from other encoding, and methods to check if the actual content match
some restricted Unicode range to avoid runtime error (defensive
programming), but the main interface, accesses characters, not bytes at
all.

It's a long time I did not write any Eiffel stuff. Thanks for the above
link, was a pleasure to see :) (some days ago, some one else posted some
stuff with Eiffel inside too)

Dmitry A. Kazakov

unread,
Oct 18, 2011, 6:00:00 AM10/18/11
to
On Tue, 18 Oct 2011 11:32:07 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Tue, 18 Oct 2011 06:46:13 +0200, ytomino <agh...@gmail.com> a écrit:
>
>> Well...If my supplement is allowed, in my honest opinion ignoring the
>> existing way of Ada, "File_Name_String" is better.
>> (In addition, It's welcome that UTF_8_String and UTF_16_String be new
>> types like Yannick says.)
> For personal and specific use cases, yes, however, for a standard, I would
> be more in favor of an Unicode_String type. To be honest, my dream would
> be to replace the Ada String type with that Unicode_String type (a dream…

No need to replace anything, just fix the type system. It should be capable
to have String a subtype of Wide_Wide_String, which is already Unicode.
UTF8_String should also be a subtype of Wide_Wide_String, being just an
alternative implementation of.

All differences between string and character types are differences in their
implementations, not in the semantics. Semantically any string is a
sequence of code points (with various constraints applied to the set of
code points).

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 6:06:17 AM10/18/11
to
Le Tue, 18 Oct 2011 12:00:00 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> No need to replace anything, just fix the type system. It should be
> capable
> to have String a subtype of Wide_Wide_String, which is already Unicode.
So you, you are dreaming of an universal_string type ? ;)

J-P. Rosen

unread,
Oct 18, 2011, 6:10:05 AM10/18/11
to
Le 18/10/2011 04:59, Yannick Duchêne (Hibou57) a écrit :
> I feel it is more an
> implementation trick (which was indeed intended by the design of UTF-8
> targeting some hardly solvable context), than a clean formalization.
It is not (says the one who wrote the AI).

The issue of using String or a different type was carefully
investigated, and String was chosen mainly on the ground of usability
(f.e., when you read from a text file, you don't know if it is encoded
or not until you have read the BOM - would you read the BOM into a
String or an Encoded_String?)

This appears in the !discussion section of AI AI05-0137-1/05

Note that the AI carefully talks about "characters whose position
numbers correspond to the encoding".

An encoding is a way to store a character string in a more compact
manner, but it still represents a string of characters. Compare that
with packed arrays - that does not change the high level nature of the
array.
--
---------------------------------------------------------
J-P. Rosen (ro...@adalog.fr)
Adalog a déménagé / Adalog has moved:
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00

J-P. Rosen

unread,
Oct 18, 2011, 6:25:17 AM10/18/11
to
Le 18/10/2011 09:55, Dmitry A. Kazakov a écrit :

> No, Unicode is a standard describes character mappings.
True

> Both UTF-8 and
> Latin-1 are encodings.
Wrong. Latin-1 is the name of the lower left corner of the BMP (Basic
Multilingual Plan, or Plan 0 of ISO-10646)

ytomino

unread,
Oct 18, 2011, 6:52:01 AM10/18/11
to
On Oct 18, 6:54 pm, Yannick Duchêne (Hibou57)
OK, I've understood.
But, UNICODE_STRING is usually not called "UTF-8 string". Because the
content is decoded.
UNICODE_STRING seems just array of UCS-32 code points to me. It's
called "UTF-32 string" commonly.
(It's same as that Wide_Wide_String is not called UTF-8 string.)

Peter C. Chapin

unread,
Oct 18, 2011, 6:55:54 AM10/18/11
to
On 2011-10-17 03:59, Dmitry A. Kazakov wrote:

> Wrong. All file systems share common features, which can and must be
> properly abstracted. System-specific are the implementations, not the
> package specifications.

Not all possible file system features, even common ones, are abstracted
by the standard. So the standard must pick and choose which ones to expose.

> Because inability to spell the file name is not same as lacking access
> rights. Access rights are external to the program code. The file name,
> coded as a string literal is a part of the program. Failure of the former
> is not a bug. The latter is a bug, because the file exists, is accessible
> and has proper name. A program bug which cannot be fixed is a language
> design bug.

I don't see it the same way. Extended attributes also exist, are
accessible (to the system), and have names. Yet the standard doesn't
allow you to access them.

Anyway it seems like this is drifting off the main topic as it sounds
like the standard does have a mechanism for accessing general Unicode
file names... or at least that's what I'm gathering from the discussion.

The issue of character set handling is slippery business, as you know.
Perhaps the fundamental problem is that Unicode text is essentially
binary data. For example when reading a Unicode file one needs to treat
it as a binary file and then decode the contents (into String,
Wide_String or Wide_Wide_String as desired) as it is read.

Personally the idea of holding on to encoded data in memory seems like a
bad idea. I know some programming languages store strings internally in
"UTF-8 format" but that never made sense to me. UTF-8 encoded data is
binary data. It should be put into an array of bytes or have a new type
for it. I definitely don't want to accidentally mix "normal" strings of
(decoded) characters with UTF-8 encoded strings. I have a feeling,
Dmitry, this is what you are also saying.

Peter

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 6:56:20 AM10/18/11
to
Le Tue, 18 Oct 2011 12:25:17 +0200, J-P. Rosen <ro...@adalog.fr> a écrit:

> Le 18/10/2011 09:55, Dmitry A. Kazakov a écrit :
>
>> No, Unicode is a standard describes character mappings.
> True
>
>> Both UTF-8 and
>> Latin-1 are encodings.
> Wrong. Latin-1 is the name of the lower left corner of the BMP (Basic
> Multilingual Plan, or Plan 0 of ISO-10646)
May I add to avoid confusion in readers mind, what I named ISO 8859-1 in
this thread, is the formal name of Latin-1. Both refers to the same thing,
Latin-1 is kind of its friendly name.

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 7:02:05 AM10/18/11
to
Le Tue, 18 Oct 2011 12:52:01 +0200, ytomino <agh...@gmail.com> a écrit:
> OK, I've understood.
> But, UNICODE_STRING is usually not called "UTF-8 string". Because the
> content is decoded.
> UNICODE_STRING seems just array of UCS-32 code points to me. It's
> called "UTF-32 string" commonly.
> (It's same as that Wide_Wide_String is not called UTF-8 string.)
If my mind is still right since the time I get into SmallEiffel compiler's
sources (back to 1999 and 2000), this was implemented with UTF-8 for
memory efficiency. May be its successor, SmartEiffel, less memory
efficient, was different. Has underlined by Dmitry, the best way is to see
it as a sequence of code points, as you first said, indeed (although
directly mappable to code points, UTF-32, still formally refers to an
encoding, although a straight and direct encoding… but don't mind, that's
just a detail).

Dmitry A. Kazakov

unread,
Oct 18, 2011, 8:01:05 AM10/18/11
to
On Tue, 18 Oct 2011 12:06:17 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Tue, 18 Oct 2011 12:00:00 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>> No need to replace anything, just fix the type system. It should be capable
>> to have String a subtype of Wide_Wide_String, which is already Unicode.
> So you, you are dreaming of an universal_string type ? ;)

No, rather a cloud of string types with different implementations and same
interface. The problem is that types which are semantically same:

String
Wide_String
Wide_Wide_String
Unbounded_String
...

are not same in the language. Adding UTF-8, UTF-16 etc would multiply that
already grotesque mess.

Dmitry A. Kazakov

unread,
Oct 18, 2011, 8:27:34 AM10/18/11
to
On Tue, 18 Oct 2011 06:55:54 -0400, Peter C. Chapin wrote:

> On 2011-10-17 03:59, Dmitry A. Kazakov wrote:
>
>> Wrong. All file systems share common features, which can and must be
>> properly abstracted. System-specific are the implementations, not the
>> package specifications.
>
> Not all possible file system features, even common ones, are abstracted
> by the standard.

Maybe, but the code point of a file name is not that kind of feature. Each
file system in the end operates Unicode code points, even if it does not
support Unicode.

>> Because inability to spell the file name is not same as lacking access
>> rights. Access rights are external to the program code. The file name,
>> coded as a string literal is a part of the program. Failure of the former
>> is not a bug. The latter is a bug, because the file exists, is accessible
>> and has proper name. A program bug which cannot be fixed is a language
>> design bug.
>
> I don't see it the same way. Extended attributes also exist, are
> accessible (to the system), and have names. Yet the standard doesn't
> allow you to access them.

It would be same if the standard would not allow to access file names at
all. But it allows that, though inconsistently.

Not doing something is not a bug. Bug is when something is done wrong.

> The issue of character set handling is slippery business, as you know.
> Perhaps the fundamental problem is that Unicode text is essentially
> binary data.

No, Unicode text is a sequence of code points, which can be represented
using various encodings. That particular representation is binary data.

> For example when reading a Unicode file one needs to treat
> it as a binary file and then decode the contents (into String,
> Wide_String or Wide_Wide_String as desired) as it is read.

Well, that depends on the semantics of these types. If we consider them
character strings, then you are wrong. Character strings are not
representations they are just chains of Unicode code points constrained to
some set of code points like Wide_String is [*].

Reading lines of a *text* file as Wide_String or as Wide_Wide_String
assumes an appropriate decoding rather than mindless shuffling of chunks of
memory. Ideally, from an *Ada* implementation I would expect that when an
UTF-8 encoded text file is read as Wide_String, I would get exactly same
sequences of code points as in UTF-8 or Data_Error for those, which cannot
be represented. I see no problem in implementing it this way and requiring
such implementations by the standard. For raw binary I/O there are streams
and direct I/O of Unsigned_8 or whatever octet/memory unit type.

> Personally the idea of holding on to encoded data in memory seems like a
> bad idea. I know some programming languages store strings internally in
> "UTF-8 format" but that never made sense to me. UTF-8 encoded data is
> binary data. It should be put into an array of bytes or have a new type
> for it. I definitely don't want to accidentally mix "normal" strings of
> (decoded) characters with UTF-8 encoded strings. I have a feeling,
> Dmitry, this is what you are also saying.

Yes, I too wished to have separate string types for UTF-8 and UTF-16. It is
IMO bad to mandate Ada.Directories UTF-8. Rather it should be extended with
Wide_Wide_String versions as well as Ada.Text_IO and all other packages
where file names appear.

I would also have file paths, file names, file extensions etc properly
typed, i.e. not as raw strings, but that is another story for another day.

-----------------------
* An alternative interpretation could be that Wide_String is UCS-2
(+endianness specification) encoding. But that would a bad idea for a
higher level language as Ada.

Pascal Obry

unread,
Oct 18, 2011, 10:06:35 AM10/18/11
to Randy Brukardt
Randy,

> Now, if an implementation on Windows doesn't have a way to use UTF-8
> encoding, that is an implementation problem, but not one that the Standard
> can do much about.

But I can tell you that supporting UTF-8 on Windows is not trivial at
all as there is encoding/decoding needed in many places. Doing that is
not trivial and we had the need to invent the "encoding=[UTF8|8BITS]"
mode for Text_IO.Open for example. As you say, implementation details,
but can be easily defeated:

If in my file I have:

Filename : constant String := "été";

And this file is saved using UTF-8 encoding, then:

Text_IO.Open (Filename, ..., Mode => "encoding=8bits");

Will just fail. A programmer error? Ok...

Now:

Text_IO.Get (Filename, Last);
Text_IO.Open (Filename, ..., Mode => "encoding=8bits");

What if the console is UTF-8?

Pascal.

--

--|------------------------------------------------------
--| Pascal Obry Team-Ada Member
--| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE
--|------------------------------------------------------
--| http://www.obry.net - http://v2p.fr.eu.org
--| "The best way to travel is by means of imagination"
--|
--| gpg --keyserver keys.gnupg.net --recv-key F949BD3B

Pascal Obry

unread,
Oct 18, 2011, 10:08:25 AM10/18/11
to Randy Brukardt
Le 18/10/2011 16:06, Pascal Obry a écrit :
> But I can tell you that supporting UTF-8 on Windows is not trivial at
> all as there is encoding/decoding needed in many places. Doing that is
> not trivial and we had the need to invent the "encoding=[UTF8|8BITS]"
> mode for Text_IO.Open for example. As you say, implementation details,
^^^^
form
> but can be easily defeated:
>
> If in my file I have:
>
> Filename : constant String := "été";
>
> And this file is saved using UTF-8 encoding, then:
>
> Text_IO.Open (Filename, ..., Mode => "encoding=8bits");
^^^^
Form

Adam Beneschan

unread,
Oct 18, 2011, 11:02:31 AM10/18/11
to
On Oct 17, 7:32 pm, ytomino <aghi...@gmail.com> wrote:
>
> I'm not confused. Your misreading.

I think we have a terminology problem. To me, Latin-1 is a set of
characters (a subset of the full Unicode character set). So I get
confused when people talk about Latin-1 versus UTF-8 strings as if
they were mutually exclusive. They're not, the way I understand the
terms. You can have a string composed of Latin-1 characters that's
represented using UTF-8 encoding; and the bits in that string would be
different from a string of the same Latin-1 characters using the
"regular" encoding, if any character in the string is in the 16#80#..
16#FF# range.

However, everyone else seems to be using "Latin-1" to talk about the
*representation* in addition to the subset of characters that's being
represented---in particular, the representation in which each symbol
is represented as one 8-bit byte. And I guess we don't really have a
good term to describe that representation. I think UCS-1 is best, but
it doesn't seem to be commonly used. So I guess I'll have to learn to
live with the misuse of the term "Latin-1" to refer to a
representation (encoding)---just as we older programmers have learned
to live with the terms "Julian Date" and "Gregorian Date" to mean a
dates in year/day-of-year form and in year/month/day form despite the
fact that this has nothing to do with the Julian or Gregorian
calendar. OK, then. I apologize for assuming that this was a sign of
your misunderstanding.

On the other hand, I was confused by your statement
"Ada.Character.Handling.To_Upper breaks UTF-8". I don't even see a
way for this to make sense. Ada.Characters.Handling works on
character types, and a character type is an enumeration type; but a
UTF-8 "character" can't be an enumeration type at all, since it's a
variable-length sequence of 8-bit bytes. I'm not quite sure what you
meant here.

As to having utilities such as versions of Ada.Strings.Unbounded or
Ada.Strings.Fixed that work directly on UTF-8-encoded strings (and
versions of Ada.Characters that operate on single UTF-8-encoded
characters): it's certainly possible to write a package like that, and
anyone is free to do so, but I just don't think they'd be widely used
enough to add to the Standard. I could be wrong.

-- Adam

Dmitry A. Kazakov

unread,
Oct 18, 2011, 11:16:02 AM10/18/11
to
On Tue, 18 Oct 2011 08:02:31 -0700 (PDT), Adam Beneschan wrote:

> On the other hand, I was confused by your statement
> "Ada.Character.Handling.To_Upper breaks UTF-8".

When String X contains UTF-8 encoded text (means: Character'Pos = octet
value), then To_Upper (X) would yield garbage for some texts.

Adam Beneschan

unread,
Oct 18, 2011, 11:34:14 AM10/18/11
to
On Oct 18, 12:55 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
wrote:
> On Mon, 17 Oct 2011 18:10:35 -0700 (PDT), Adam Beneschan wrote:
> > I have a feeling you're fundamentally confused about what UTF-8 is, as
> > compared to "Latin-1".  Latin-1 is a character mapping.  It defines,
> > for all integers in the range 0..255, what character that integer
> > represents (e.g. 77 represents 'M', etc.).  Unicode is a character
> > mapping that defines characters for a much larger integer range.
>
> No, Unicode is a standard describes character mappings. Both UTF-8 and
> Latin-1 are encodings. Latin-1 as an encoding has a property that there is
> 1-1 octet to code point correspondence, at the cost that some (most) of
> code points cannot be represented by the encoding. UTF-8 lacks this
> property, but is capable to represent all code points.

Sigh... I guess you're right about the term "Latin-1". It appears to
be *both* a character mapping *and* an encoding, based on a bit of
Wikipedia research. The problem for me is this: what does that make
Latin-2, Latin-3, KOI8-R, etc.? Those seem to describe the same
encoding mechanism as Latin-1 (each code represented as one 8-bit
byte), but with different meanings for the codes in the 16#A0#..16#FF#
range. So the same encoding scheme seems to have multiple different
names. That's very confusing to me.

I've tended to look at character-set issues as having two independent
parts: part 1 is how do we define the correspondence between integers
and the character symbols [or other "characters" with special meanings
like control characters]; and part 2 is, once we have a sequence of
integers that correspond to those characters, how do we represent that
sequence in memory, in a file, when sending bits over a wire, etc.
The two parts appear completely independent to me, which is why I get
confused when a term like "Latin-1" is used that straddles both
parts. (Unless we decree that Unicode is the only mapping in
existence, and things like Latin-2 or KOI8-R are encodings in which
bytes in the 16#A0#..16#FF# range represent integers which are totally
different and which are defined by the Unicode standard?)

I guess I'll have to learn what people mean by their terms. I had
some misimpressions.

And I think we could solve a lot by making String a more abstract type
defined by its operations rather than by its representation (array of
character). For a new language, as opposed to one in which we're
trying to maintain backward compatibility with a language designed in
the 1980s, that would be a great idea. (I *don't* think it was a good
idea to define UTF8_String as a subtype of String, and to decide that
a String could be used as a sequence of bytes that had no direct
correspondence to any characters from a character set. That seems
like a big compromise. On the other hand, doing it "right" would have
been a lot of work which I wouldn't have had to do, most of it
unpaid. So I'm hesitant to complain too much.)

-- Adam

J-P. Rosen

unread,
Oct 18, 2011, 1:27:37 PM10/18/11
to
Le 18/10/2011 17:34, Adam Beneschan a écrit :
> On Oct 18, 12:55 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
> wrote:
>> On Mon, 17 Oct 2011 18:10:35 -0700 (PDT), Adam Beneschan wrote:
>>> I have a feeling you're fundamentally confused about what UTF-8 is, as
>>> compared to "Latin-1". Latin-1 is a character mapping. It defines,
>>> for all integers in the range 0..255, what character that integer
>>> represents (e.g. 77 represents 'M', etc.). Unicode is a character
>>> mapping that defines characters for a much larger integer range.
>>
>> No, Unicode is a standard describes character mappings. Both UTF-8 and
>> Latin-1 are encodings. Latin-1 as an encoding has a property that there is
>> 1-1 octet to code point correspondence, at the cost that some (most) of
>> code points cannot be represented by the encoding. UTF-8 lacks this
>> property, but is capable to represent all code points.
>
> Sigh... I guess you're right about the term "Latin-1". It appears to
> be *both* a character mapping *and* an encoding, based on a bit of
> Wikipedia research. The problem for me is this: what does that make
> Latin-2, Latin-3, KOI8-R, etc.? Those seem to describe the same
> encoding mechanism as Latin-1 (each code represented as one 8-bit
> byte), but with different meanings for the codes in the 16#A0#..16#FF#
> range. So the same encoding scheme seems to have multiple different
> names. That's very confusing to me.
>
Not 100% sure, but I think here is the picture.
1) Code points are always 31 bits (or maybe 30).
2) Below is the lower left corner of BMP (use fixed fonts!):

|
|____________________
| | |
| Latin 1 | Latin 2 |
|_________|_________|_______

The lower halves of Latin-1 and Latin-2 are identical, i.e. the same
characters have two different code-points, differing by 256.

When you use Latin-1 with 8 bit bytes, you can view this as an encoding
with the 24 upper bits being 16#00_00_00#. When you use Latin-2 with 8
bit bytes, you can view this as an encoding with the 24 upper bits being
16#00_00_01#.

So in a sense, Latin-1 and Latin-2 are both character sets, and when
represented on only 8 bits, an encoding.

Does this make sense?

Adam Beneschan

unread,
Oct 18, 2011, 2:33:01 PM10/18/11
to
On Oct 18, 10:27 am, "J-P. Rosen" <ro...@adalog.fr> wrote:
> Le 18/10/2011 17:34, Adam Beneschan a crit :
No, I don't think so. In Latin-2 (ISO/IEC-8859-2), the code points
16#00#..16#A0# have the same meanings as in Latin-1 and Unicode. Past
that, though, the correspondence is all over the place. Thus, 16#A1#
in Latin-2 corresponds to 16#0104# in the Unicode BMP; 16#A2# ->
16#02D8#, 16#A3# -> 16#0141#, 16#A5# -> 16#013D#, etc.

-- Adam

Yannick Duchêne (Hibou57)

unread,
Oct 18, 2011, 3:54:24 PM10/18/11
to
Le Tue, 18 Oct 2011 19:27:37 +0200, J-P. Rosen <ro...@adalog.fr> a écrit:
> 1) Code points are always 31 bits (or maybe 30).
Less than that ;) The last valid code-point is actually 16#10FFFF#, which
is 21 bits wide. This is for valid code-points, only, because this one is
not even assigned to anything (belongs to the private-use-area, plan #16).
The last code-point with assigned semantic but without glyph, is 16#E0FFF#
and the last assigned code-point with assigned glyph and semantic is
16#2FFFF#. Well, beside these details, the last code-point will very-very
probably never go beyond 16#10FFFF#, and if an application does not expect
to define private code-point for internal use, then the last valid
code-point can be defined as 16#EOFFF# which is 20 bits wide. Counted in
bytes, this turn out to be 3 bytes in all cases, not 4.

ytomino

unread,
Oct 18, 2011, 5:18:37 PM10/18/11
to
On Oct 18, 8:02 pm, Yannick Duchêne (Hibou57)
<yannick_duch...@yahoo.fr> wrote:
> Le Tue, 18 Oct 2011 12:52:01 +0200, ytomino <aghi...@gmail.com> a écrit:> OK, I've understood.
Fuckin' great!

I downloaded and searched SmartEiffel's UNICODE_STRING.e.
It has two arrays of UTF-16 values.
UTF-16 array *A* has UCS-2 characters or first halfs of surrogate
pair.
UTF-16 array *B* has second halfs of surrogate pair.
*B* is never allocated without it's required to hold a surrogate pair
at least.

It's certain that memory efficient and its calculation order is not
increased.

(This string is not my liking, but interesting!)

ytomino

unread,
Oct 18, 2011, 6:54:10 PM10/18/11
to
On Oct 19, 12:02 am, Adam Beneschan <a...@irvine.com> wrote:
> I think we have a terminology problem.
OK, sorry that my point of the argument was not put in order well.
Do confirming.

> Latin-1 is a set of characters (a subset of the full Unicode character set).
Yes.
And it's also used as name of encoding. (ISO 8859-1, like Yannick
calls)

> So I get
> confused when people talk about Latin-1 versus UTF-8 strings as if
> they were mutually exclusive. They're not, the way I understand the
> terms. You can have a string composed of Latin-1 characters that's
> represented using UTF-8 encoding; and the bits in that string would be
> different from a string of the same Latin-1 characters using the
> "regular" encoding, if any character in the string is in the 16#80#..
> 16#FF# range.

Yes.
"Latin-1 as character set" is not exclusive with Unicode (UCS-2 or
UCS-4).
"Latin-1 as encoding" is exclusive with UTF-8.
And then, I (we?) talked about "Latin-1 as encoding".

> On the other hand, I was confused by your statement
> "Ada.Character.Handling.To_Upper breaks UTF-8". I don't even see a
> way for this to make sense. Ada.Characters.Handling works on
> character types, and a character type is an enumeration type; but a
> UTF-8 "character" can't be an enumeration type at all, since it's a
> variable-length sequence of 8-bit bytes. I'm not quite sure what you
> meant here.

Ada.Characters and Ada.Strings are defined to work with "Latin-1 as
encoding" in String type.
Some subprograms (like To_Upper) in these will replace upper half
characters (16#80#..) to meaningless values in String holding UTF-8,
if we invoke these with UTF-8 String. (Equal_Case_Insensitive does not
replace characters, but returns meaningless value if parameters have
upper half characters encoded as UTF-8.)

Of course, Ada.Wide_Wide_Characters.Handling.To_Upper
(UTF_Encoding.Wide_Wide_Strings.Decode (any UTF-8 encoded string))
works fine.

> As to having utilities such as versions of Ada.Strings.Unbounded or
> Ada.Strings.Fixed that work directly on UTF-8-encoded strings (and
> versions of Ada.Characters that operate on single UTF-8-encoded
> characters): it's certainly possible to write a package like that, and
> anyone is free to do so, but I just don't think they'd be widely used
> enough to add to the Standard. I could be wrong.

I throught the standard library is going to be separated UTF-8 from
Latin-1, when read about UTF-8 mode of Form parameter that Randy says.
Latin-1 is not familiar for me usually, so I has wanted UTF-8 versions
of Ada.Characters. Sorry that my personal wish was mixed.
But it's certain that the standard library has some lacks for handling
non-ASCII file names.

By the way...

I probably will confuse you more :-)
Do you know that single code-point is NOT single letter for display?
Unicode has "composed character". The cases is existing that plural
code-points represent single real letter.
(refer http://www.unicode.org/reports/tr15/tr15-33.html)
In addition, Unicode has "variation selector", This is a decorator for
previous letter (possible to mix with composed character).
(refer http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html)

Therefore, the difficulty of handling Wide_Wide_String is similar to
the difficulty of handling encoded (UTF-8 or other format) string, in
fact.

Adam Beneschan

unread,
Oct 18, 2011, 7:42:51 PM10/18/11
to
On Oct 18, 8:16 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
wrote:
> On Tue, 18 Oct 2011 08:02:31 -0700 (PDT), Adam Beneschan wrote:
> > On the other hand, I was confused by your statement
> > "Ada.Character.Handling.To_Upper breaks UTF-8".
>
> When String X contains UTF-8 encoded text (means: Character'Pos = octet
> value), then To_Upper (X) would yield garbage for some texts.

Oh, I see. I thought he was actually talking about UTF-8 encoded
characters, not "characters" in a UTF-8 encoded string. My impression
(apparently wrong) was that when String X contained UTF-8 encoded
text, that the programmer would understand that the characters in it
weren't really *characters* and thus wouldn't dream of calling
To_Upper. But I suppose that somebody working on a part of a program
that takes a String parameter and doesn't realize that the String
parameter could be an array of not-really-characters could get it
wrong. Which I think is more evidence of why it was wrong to have the
String type, an array of characters, do double duty as an array-of-
encoded-bytes type.

-- Adam

Dmitry A. Kazakov

unread,
Oct 19, 2011, 4:12:16 AM10/19/11
to
On Tue, 18 Oct 2011 16:42:51 -0700 (PDT), Adam Beneschan wrote:

> Which I think is more evidence of why it was wrong to have the
> String type, an array of characters, do double duty as an array-of-
> encoded-bytes type.

There is nothing wrong for a string to have an array interface. Wrong is
the language design which requires the implementation of that interface in
a certain way that is inconsistent with the type semantics.

It should have been:

type String_Index is range ...;
type Octet_Index is range ...;

type UTF8_String is
private array (String_Index range <>) of Wide_Wide_Character
and
private array (Octet_Index range <>) of Unsigned_8;

private
type UTF8_String is array (Octet_Index range <>) of Unsigned_8;

Randy Brukardt

unread,
Oct 19, 2011, 5:32:38 PM10/19/11
to
"Pascal Obry" <pas...@obry.net> wrote in message
news:4E9D87EB...@obry.net...
...
>> Now, if an implementation on Windows doesn't have a way to use UTF-8
>> encoding, that is an implementation problem, but not one that the
>> Standard
>> can do much about.
>
> But I can tell you that supporting UTF-8 on Windows is not trivial at all
> as there is encoding/decoding needed in many places. Doing that is not
> trivial and we had the need to invent the "encoding=[UTF8|8BITS]" mode for
> Text_IO.Open for example.

I would not claim that it is easy. I haven't done anything about it for
Janus/Ada, for example. (This falls into the "no one has complained, so
other things have priority" category).

> As you say, implementation details, but can be easily defeated:
>
> If in my file I have:
>
> Filename : constant String := "été";
>
> And this file is saved using UTF-8 encoding, then:
>
> Text_IO.Open (Filename, ..., Mode => "encoding=8bits");
>
> Will just fail. A programmer error? Ok...

Right. That's the problem with the weak-typing that we've adopted for UTF-8
and other encodings. It really has nothing to do with Open, it's a general
problem with Ada..

The obvious solution (if this is a real problem in practice) would be to
layer a strongly-typed layer on top of the existing facilities. Easy enough
to do, but probably not something that will be in the Standard.

> Now:
>
> Text_IO.Get (Filename, Last);
> Text_IO.Open (Filename, ..., Mode => "encoding=8bits");
>
> What if the console is UTF-8?

If you're expecting to get Wide_Wide_Characters, you really ought to read a
Wide_Wide_Character string. But I'm well aware that this solution is
sub-optimal (especially in that it wastes huge amounts of space).

Short of completely abandoning the existing I/O system (not the worst idea,
IMHO, but unlikely), I don't think there is any practical way to "fix" Ada
to deal easily with the *rare* possibility of non-Latin-1 characters. If I
was doing this from scratch, I would simply decree that all I/O strings are
represented in UTF-8, and use a dedicated type for them so that they can't
be mixed with "String" or "Wide_String".

Randy.


Randy Brukardt

unread,
Oct 19, 2011, 5:43:08 PM10/19/11
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:a3j4wzrhrj65$.bkkht9t97w84.dlg@40tude.net...
> On Tue, 18 Oct 2011 08:02:31 -0700 (PDT), Adam Beneschan wrote:
>
>> On the other hand, I was confused by your statement
>> "Ada.Character.Handling.To_Upper breaks UTF-8".
>
> When String X contains UTF-8 encoded text (means: Character'Pos = octet
> value), then To_Upper (X) would yield garbage for some texts.

You should have just said:
When String X contains UTF-8 encoded text (means: Character'Pos = octet
value), then virtually all existing string operations will yield garbage for
some texts.

The only way to safely use a UTF-8 string is opaquely, which means you can
store it whole, but any operation on it is performed after decoding it.
That's of course the best argument for having it be a separate type. The
problem is that Ada doesn't have any reasonable way to define conversions
for that type (and having long-winded conversion functions with long winded
names like "Ada.Strings.Unbounded.To_Unbounded_String" don't count in my
view). And there is just enough need to treat these things as
arrays-of-bytes (slicing is needed for storage of variable length UTF-8
strings in "plain Ada", for one example) that treating them as "opaque"
isn't ideal.

Randy.




Dmitry A. Kazakov

unread,
Oct 20, 2011, 3:37:57 AM10/20/11
to
On Wed, 19 Oct 2011 16:43:08 -0500, Randy Brukardt wrote:

> The only way to safely use a UTF-8 string is opaquely, which means you can
> store it whole, but any operation on it is performed after decoding it.
> That's of course the best argument for having it be a separate type.

Yes. It is worth to remember that Ada once was considered a strongly typed
language...

> The
> problem is that Ada doesn't have any reasonable way to define conversions
> for that type (and having long-winded conversion functions with long winded
> names like "Ada.Strings.Unbounded.To_Unbounded_String" don't count in my
> view).

This is a language type system problem, which must be fixed first.

> And there is just enough need to treat these things as
> arrays-of-bytes (slicing is needed for storage of variable length UTF-8
> strings in "plain Ada", for one example) that treating them as "opaque"
> isn't ideal.

That is an unrelated issue. Once the type system gets fixed, it would be no
problem to have an array view (or fancy "aspect", if you want) of encoded
strings.

Yannick Duchêne (Hibou57)

unread,
Oct 20, 2011, 7:04:43 AM10/20/11
to
Le Thu, 20 Oct 2011 09:37:57 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:

> On Wed, 19 Oct 2011 16:43:08 -0500, Randy Brukardt wrote:
>
>> The only way to safely use a UTF-8 string is opaquely, which means you
>> can
>> store it whole, but any operation on it is performed after decoding it.
>> That's of course the best argument for having it be a separate type.
>
> Yes. It is worth to remember that Ada once was considered a strongly
> typed
> language...
It still is!, the trouble is at library level, not language level.

Dmitry A. Kazakov

unread,
Oct 20, 2011, 8:21:16 AM10/20/11
to
On Thu, 20 Oct 2011 13:04:43 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Thu, 20 Oct 2011 09:37:57 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>
>> On Wed, 19 Oct 2011 16:43:08 -0500, Randy Brukardt wrote:
>>
>>> The only way to safely use a UTF-8 string is opaquely, which means you can
>>> store it whole, but any operation on it is performed after decoding it.
>>> That's of course the best argument for having it be a separate type.
>>
>> Yes. It is worth to remember that Ada once was considered a strongly
>> typed language...

> It still is!, the trouble is at library level, not language level.

No, the troubles at the library level are reflections of language problems.

The language ceased to evolve within its paradigm of a strongly typed
language. Instead of addressing new issues from the stand point of typed
approach, it tries solutions from the languages alien to its spirit.

Yannick Duchêne (Hibou57)

unread,
Oct 20, 2011, 8:38:27 AM10/20/11
to
Le Thu, 20 Oct 2011 14:21:16 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
>> It still is!, the trouble is at library level, not language level.
>
> No, the troubles at the library level are reflections of language
> problems.
>
> The language ceased to evolve within its paradigm of a strongly typed
> language. Instead of addressing new issues from the stand point of typed
> approach, it tries solutions from the languages alien to its spirit.
Can you draw a short (or less short) formal model ? Do you have clear
ideas ? Are they ideas inspired from known formalisms (may be like in
specific or research languages) ? I am interested in typing and typing
models.

Dmitry A. Kazakov

unread,
Oct 20, 2011, 10:31:59 AM10/20/11
to
On Thu, 20 Oct 2011 14:38:27 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Thu, 20 Oct 2011 14:21:16 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>>> It still is!, the trouble is at library level, not language level.
>>
>> No, the troubles at the library level are reflections of language
>> problems.
>>
>> The language ceased to evolve within its paradigm of a strongly typed
>> language. Instead of addressing new issues from the stand point of typed
>> approach, it tries solutions from the languages alien to its spirit.

> Can you draw a short (or less short) formal model ? Do you have clear
> ideas ? Are they ideas inspired from known formalisms (may be like in
> specific or research languages) ? I am interested in typing and typing
> models.

I am not a language designer. I have problems rather than solutions.

What I know is that the decomposition shall go along the types. Design
entities must be described as types. Their relationships should be as type
relationships. Substitutability should be decided on the basis of
manifested declarations, not the type structure. Interface must be clearly
separated from implementation. Implementation must be absolutely free too
choose. There shall be no procedures but operations on types. All types
shall have classes. Any syntax sugar (prefix notation, infix operations,
assignments, indexing, member extraction, aggregates, entries, attributes)
shall be operations. Construction model must be type safe (in particular,
each type must have constructors, including class-wide types). The type
system shall support both specialization and generalization. The programmer
should be able to enforce static type and constraint checks, in particular,
to convert any potentially dynamic checks into compile-time errors. All
exceptions must be typed, contracted and statically checked.

Yannick Duchêne (Hibou57)

unread,
Oct 20, 2011, 11:54:28 AM10/20/11
to
Le Thu, 20 Oct 2011 16:31:59 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> I am not a language designer. I have problems rather than solutions.
Like many of us here ;)

> What I know is that the decomposition shall go along the types.
You use to say you don't feel FP good, but I sware, I am sure you would
enjoy some part of it ;)

> Implementation must be absolutely free too
> choose. There shall be no procedures but operations on types. All types
> shall have classes.
What's missing from Interface type introduced with Ada 2005 ? Doesn't it
fulfill the above expectations ? (also keep in mind sometime efficiency is
required, and if you want place formalism over efficiency, then you have
to sacrifice efficiency, conscientiously).

> Any syntax sugar (prefix notation, infix operations,
> assignments, indexing, member extraction, aggregates, entries,
> attributes) shall be operations.
Are you sure you are not confused between concrete syntax and abstract
syntax ? Otherwise, if I may reword you, perhaps you are complaining there
are not enough user re-definable operations. Otherwise, I don't see what's
relevant in turning syntactic sugar into operations; these plays two
different roles and are of orthogonal domains.

> Construction model must be type safe (in particular,
> each type must have constructors, including class-wide types). The type
> system shall support both specialization and generalization.
Could you provide an example case of generalization you have in mind ?

> The programmer
> should be able to enforce static type and constraint checks, in
> particular,
> to convert any potentially dynamic checks into compile-time errors. All
> exceptions must be typed, contracted and statically checked.
This is not a language topic, instead, a technology level topic. I feel
runtime check is a reasonable fall-back for what cannot be statically
checked in th actual state of the technology. If you really require static
check, then you must restrict yourself to what can be statically checked.
If Ada 2012 defines some Design by Contract checks as runtime check, this
is not a language flaw, a pragmatic choice. Along with that, if a compiler
is able to statically check what Ada 2012 designate as runtime check, then
nothing in the language definition disallows the compiler to apply all
static checks it is able to.

Dmitry A. Kazakov

unread,
Oct 20, 2011, 1:35:21 PM10/20/11
to
On Thu, 20 Oct 2011 17:54:28 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Thu, 20 Oct 2011 16:31:59 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:

>> What I know is that the decomposition shall go along the types.
> You use to say you don't feel FP good, but I sware, I am sure you would
> enjoy some part of it ;)

No, FP is just too low level: procedural decomposition. Type systems
correspond to the categories - a better and more capable mathematics =>
safer design. Another fundamental problem of FP is a wrong premise about
being stateless. Computing is solely about states. You run a program to
have its side effects, there is no other reason for doing that.

>> Implementation must be absolutely free too
>> choose. There shall be no procedures but operations on types. All types
>> shall have classes.

> What's missing from Interface type introduced with Ada 2005 ?

1. Most Ada types do not have interfaces
2. Ada interface cannot be inherited from a concrete type
3. Ada interface cannot have implementation
4. Ada interface does not support ad-hoc supertypes

> Doesn't it
> fulfill the above expectations ? (also keep in mind sometime efficiency is
> required, and if you want place formalism over efficiency, then you have
> to sacrifice efficiency, conscientiously).

Not an issue. Scalar types may have interfaces at zero time/space cost. You
don't need to embed tag into by-value types.

>> Any syntax sugar (prefix notation, infix operations,
>> assignments, indexing, member extraction, aggregates, entries,
>> attributes) shall be operations.
> Are you sure you are not confused between concrete syntax and abstract
> syntax ?

I don't understand this. The problem is that, for example, for the record
type T and its member A, the ".A" is not the operation of T, because record
is not an interface. A'First is not an operation of array. ":=" is not an
operation (doubly dispatching) of its left and right sides etc.

>> Construction model must be type safe (in particular,
>> each type must have constructors, including class-wide types). The type
>> system shall support both specialization and generalization.

> Could you provide an example case of generalization you have in mind ?

Examples are:

1. Type extension (e.g. upon derivation, present in Ada)
2. Expansion of enumeration types
3. Cartesian product of types, e.g. Real x Real -> Complex
4. Lifting constraints, e.g. Float -> IEEE Float (number + NaN + +Inf ...)
5. Ad-hoc supertypes, e.g. String U Unbounded_String -> General_String,
creating new classes from existing ones by union.

>> The programmer should be able to enforce static type and constraint checks, in
>> particular, to convert any potentially dynamic checks into compile-time errors. All
>> exceptions must be typed, contracted and statically checked.

> This is not a language topic, instead, a technology level topic. I feel
> runtime check is a reasonable fall-back for what cannot be statically
> checked in th actual state of the technology.

No, it is inconsistent and unreasonable. Static checks are meant to detect
bugs. Bug is either there or not, independently on whether the program is
running, not running, will ever run. It is just not a function of the
execution state. Bug is a property of the program and all its possible
sates as a whole. A program cannot be both correct and incorrect. A program
checking itself as wrong is a Cretan Liar.

> If you really require static
> check, then you must restrict yourself to what can be statically checked.

Yes, and I want a firewall between static and dynamic checks. If some
proposition is declared statically true or false, while the compiler is
unable to prove it, that should make the program illegal. The programmer
must be forced to chose, and if it decides for a static check he must be
sure that the compiler indeed verified his assumption or else have to
change the program.

> If Ada 2012 defines some Design by Contract checks as runtime check, this
> is not a language flaw, a pragmatic choice.

Yet another generator of arbitrary exceptions. Lessons from accessibility
checks not learned...

> Along with that, if a compiler
> is able to statically check what Ada 2012 designate as runtime check, then
> nothing in the language definition disallows the compiler to apply all
> static checks it is able to.

See above, it is the difference between an illegal program and a program
raising exceptions, nothing in common.

J-P. Rosen

unread,
Oct 20, 2011, 1:40:40 PM10/20/11
to
Le 20/10/2011 09:37, Dmitry A. Kazakov a écrit :
> On Wed, 19 Oct 2011 16:43:08 -0500, Randy Brukardt wrote:
>
>> > The only way to safely use a UTF-8 string is opaquely, which means you can
>> > store it whole, but any operation on it is performed after decoding it.
>> > That's of course the best argument for having it be a separate type.
> Yes. It is worth to remember that Ada once was considered a strongly typed
> language...
>
Different types represent things that are of different nature. It is not
obvious that a difference in /encoding/ is sufficient to say that two
things are of different nature.

Consider also the problem with files. Is a UTF-8 file a text file? Do
you want a UTF8_IO package? Normally, a UTF-8 file starts with a BOM in
the first line, telling that the whole file is UTF8. How would you read
that? Excerpt from AI137:
---
When reading a file, a BOM can be expected as starting the first line
of the file, but not subsequent lines. The proposed handling of BOM
assumes the following pattern:

1) Read the first line. Call function Encoding on that line with an
appropriate default to use if the line does not start with a
BOM. Initialize the encoding scheme to the value returned by the
function.

2) Decode all lines (including the first one) with the chosen encoding
scheme. Since the BOM is ignored by Decode functions, it is not
necessary to slice the first line specially.
---

A possible alternative solution could be to make UTF_8_String a type
derived from String (rather than a subtype). With conversions allowed,
you would not lose Text_IO. I don't know if we'll have time to discuss
this in Denver, but if you are serious about it, by all means get in
touch with your standardization body and let them make a comment. There
is no point in saying "that's how it should have been", and taking no
action to that effect.

Dmitry A. Kazakov

unread,
Oct 20, 2011, 2:43:07 PM10/20/11
to
On Thu, 20 Oct 2011 19:40:40 +0200, J-P. Rosen wrote:

> Le 20/10/2011 09:37, Dmitry A. Kazakov a écrit :
>> On Wed, 19 Oct 2011 16:43:08 -0500, Randy Brukardt wrote:
>>
>>> > The only way to safely use a UTF-8 string is opaquely, which means you can
>>> > store it whole, but any operation on it is performed after decoding it.
>>> > That's of course the best argument for having it be a separate type.
>> Yes. It is worth to remember that Ada once was considered a strongly typed
>> language...
>>
> Different types represent things that are of different nature.

Depends on the meaning "different":

1. Differently implemented types representing same entities from the
problem domain;

2. Incompatible types representing semantically different entities.

> It is not
> obvious that a difference in /encoding/ is sufficient to say that two
> things are of different nature.

#1 if encoding is not the problem domain, but an implementation detail,
which should be the case for most application programming;

#2 otherwise, e.g. in systems programming.

> Consider also the problem with files. Is a UTF-8 file a text file? Do
> you want a UTF8_IO package?

Not likely. For text files I would prefer single Text_IO package
consistently applying an appropriate recoding from the file encoding to the
representation of the string type used in the operation. Of course, the
targets, which do not support identification of the file encoding, will use
the Form parameter to specify it explicitly.

> A possible alternative solution could be to make UTF_8_String a type
> derived from String (rather than a subtype). With conversions allowed,
> you would not lose Text_IO. I don't know if we'll have time to discuss
> this in Denver, but if you are serious about it, by all means get in
> touch with your standardization body and let them make a comment. There
> is no point in saying "that's how it should have been", and taking no
> action to that effect.

Yes, String types must be kept different in the sense #1 and same in the
sense #2. That means that the type system should support classes (e.g.
Wide_Wide_String'Class) comprising types of *different* implementation,
which don't inherit representations from each other. This is not an issue
of strings. It is a general problem, which must be approached generally. So
far Ada has classes of shared representations for which upcasting and
downcasting are view conversions. Classes of different representation
should have physical conversions for T<->T'Class, T->S etc, creating new
objects. Yes, it is inefficient, but when efficiency is an issue the type
specific operations could always be overridden rather that inherited
through conversion.

Vadim Godunko

unread,
Oct 21, 2011, 6:07:04 AM10/21/11
to
On Oct 20, 9:40 pm, "J-P. Rosen" <ro...@adalog.fr> wrote:
>
> A possible alternative solution could be to make UTF_8_String a type
> derived from String (rather than a subtype).

Why all around stick with concrete representation of textual
information? Lets define text as logical sequence of Unicode code
points, regardless of external representation (so, encoding); lets
define new kind of "string" as private type, provide useful 'syntax
sugar' to use it in 'usual' way and lets String/Wide_String/
Wide_Wide_String to die. I believe it is true Ada way to separate high
level concept and low level representation.

J-P. Rosen

unread,
Oct 21, 2011, 7:25:39 AM10/21/11
to
Le 21/10/2011 12:07, Vadim Godunko a écrit :

> Why all around stick with concrete representation of textual
> information? Lets define text as logical sequence of Unicode code
> points, regardless of external representation (so, encoding); lets
> define new kind of "string" as private type, provide useful 'syntax
> sugar' to use it in 'usual' way and lets String/Wide_String/
> Wide_Wide_String to die. I believe it is true Ada way to separate high
> level concept and low level representation.
But that is exactly what Wide_Wide_String is!

So you are proposing to drop Wide_Wide_String on the ground that it is
visibly an array, and then provide a private type with a lot of (costly)
machinery to allow it to be manipulated just as if it were an array?

Come on! That's ultra-purism that brings zero improvement in practice.

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 8:25:07 AM10/21/11
to
Le Fri, 21 Oct 2011 13:25:39 +0200, J-P. Rosen <ro...@adalog.fr> a écrit:

> Le 21/10/2011 12:07, Vadim Godunko a écrit :
>
>> Why all around stick with concrete representation of textual
>> information? Lets define text as logical sequence of Unicode code
>> points, regardless of external representation (so, encoding); lets
>> define new kind of "string" as private type, provide useful 'syntax
>> sugar' to use it in 'usual' way and lets String/Wide_String/
>> Wide_Wide_String to die. I believe it is true Ada way to separate high
>> level concept and low level representation.
> But that is exactly what Wide_Wide_String is!
>
> So you are proposing to drop Wide_Wide_String on the ground that it is
> visibly an array, and then provide a private type with a lot of (costly)
> machinery to allow it to be manipulated just as if it were an array?
>
> Come on! That's ultra-purism that brings zero improvement in practice.

I have to agree with that pragmatic point of view. We should stick to it.

Common boys and girls, if something is not good for you, design your own
stuff. Wide_Wide_String just hold the same status as Text_IO: not meant to
be universally suited to everything, but meant to be a basic
implementation sufficient to quickly draw an application (either for
pedagogical purpose or quick model delivery). Specific stuffs, requires
specific designs, and it's up to you to do the art performance ;) Ada will
never bring every thing for every purpose.

P.S. Still feel a trouble with file names by the way. Should still be
fixed, because this actually not even fulfill the above basic expectations.

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 8:53:11 AM10/21/11
to
Le Thu, 20 Oct 2011 19:35:21 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> No, FP is just too low level: procedural decomposition. Type systems
> correspond to the categories - a better and more capable mathematics =>
> safer design. Another fundamental problem of FP is a wrong premise about
> being stateless. Computing is solely about states. You run a program to
> have its side effects, there is no other reason for doing that.
You should write every thing down you know (your though about Ada, FP, and
so on). Would be useful to you and others.

>> What's missing from Interface type introduced with Ada 2005 ?
>
> 1. Most Ada types do not have interfaces
Eiffel has this, and this is 1) not perfect (may lead to performance
issue) 2) rarely used in practice

> 2. Ada interface cannot be inherited from a concrete type
You can have a concrete implementation, why is that not enough ?

> 3. Ada interface cannot have implementation
Derived types can. Why is that a trouble is one inheritance level is
purely abstract ?

> 4. Ada interface does not support ad-hoc supertypes
Can you tell more with an example ? (I don't know what supertypes are)

Feels like to need an even higher level language than Ada is. There are
some, however most are interpreted language, and are not targeting safety
(in the wide meaning) as much as Ada too.

> Not an issue. Scalar types may have interfaces at zero time/space cost.
> You
> don't need to embed tag into by-value types.
This is possible indeed, but at the cost of separate compilation.
SmallEiffel did this, but was relying on overall program analysis, which
were compiled as a whole. Some other Eiffel implementation using separate
compilation, could not optimize. If you make it part of the language
standard, you are imposing implementation requirements beyond the
reasonable. Very big applications need separate compilation. Although
attempted and suggested by Bertrand Meyer, Eiffel applications never
scaled large fine (except with global analysis, but re-compiling a whole
application whenever something change, although they may be some trick to
avoid real recompilation of everything, is not an acceptable option for
Ada niches).

>>> Any syntax sugar (prefix notation, infix operations,
>>> assignments, indexing, member extraction, aggregates, entries,
>>> attributes) shall be operations.
>> Are you sure you are not confused between concrete syntax and abstract
>> syntax ?
>
> I don't understand this. The problem is that, for example, for the record
> type T and its member A, the ".A" is not the operation of T, because
> record
> is not an interface. A'First is not an operation of array. ":=" is not an
> operation (doubly dispatching) of its left and right sides etc.

Same feeling as above. Seems you are looking for something which is higher
level than Ada is. There are some pleasant language in this area, but
which just end to be cool toys (although still cool to play with ;) ). May
be worth to recall Ada is not a modeling language, but an implementation
language with features to enforce safety as much as possible.

>> This is not a language topic, instead, a technology level topic. I feel
>> runtime check is a reasonable fall-back for what cannot be statically
>> checked in th actual state of the technology.
>
> No, it is inconsistent and unreasonable. Static checks are meant to
> detect bugs. Bug is either there or not, independently on whether the
> program is
> running, not running, will ever run.
Easy to say, less to do. You did not demonstrate this is not related to
actual technology, you just complained it is not as you wish.


Sorry if I've not replied to each point, to keep it short. I often agree
with many points you sometime raised about Ada. There, I feel you are
going to far for what Ada is intended to. You are not noticing any
inconsistencies in existing features, you are requiring new features.

Dmitry A. Kazakov

unread,
Oct 21, 2011, 9:13:54 AM10/21/11
to
On Fri, 21 Oct 2011 13:25:39 +0200, J-P. Rosen wrote:

> But that is exactly what Wide_Wide_String is!

Not really. Wide_Wide_String is one possible implementation of logical
Unicode string. There can be other implementations, e.g. String,
Wide_String, UTF8_String, UTF16_String, EBCDIC_String, ASCII_String... All
these implementation must be interchangeable and implement same logical
string interface. The same applies to unbouded and fixed length strings.

> Come on! That's ultra-purism that brings zero improvement in practice.

On the contrary:

1. It would reduce by the factor 10 the number of packages;

2. It would statically ensure that the encoding is handled correctly. (I
would bet that almost any Ada program is broken with that regard);

3. It would free the programmer from the burden of premature optimization;

4. It would make design of Ada bindings much simpler and safer. E.g.
C_String could be an implementation of logical Unicode string compatible
with null-terminated C strings.

Dmitry A. Kazakov

unread,
Oct 21, 2011, 9:41:59 AM10/21/11
to
On Fri, 21 Oct 2011 14:53:11 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Thu, 20 Oct 2011 19:35:21 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:

>>> What's missing from Interface type introduced with Ada 2005 ?
>>
>> 1. Most Ada types do not have interfaces
> Eiffel has this, and this is 1) not perfect (may lead to performance
> issue) 2) rarely used in practice

There is no performance loss.

>> 2. Ada interface cannot be inherited from a concrete type
> You can have a concrete implementation, why is that not enough ?

Because it is not what required: you have a concrete type and want to name
its interface to inherit from, only the interface or its part.

>> 3. Ada interface cannot have implementation
> Derived types can. Why is that a trouble is one inheritance level is
> purely abstract ?

Why I am forced to have it? If you have a reason, the implication is that
*each* type must have two declarations: the interface and the type itself.
Note that this does not solve the problem, because it would not give
partial interfaces.

The problem is fragile design: you don't know in advance all interfaces the
users of the package might you later on. It is very bad for large system
design.

>> 4. Ada interface does not support ad-hoc supertypes
> Can you tell more with an example ? (I don't know what supertypes are)

If A is a subtype of B, then B is a supertype of A. Subtype imports
operations, supertype exports them.

Ad-hoc means that you can hang on supertypes on existing types, e.g. coming
from a library, which cannot be changed. Doing so you could bring such
unrelated types under one roof, e.g. to be able to put them into a
container etc.

>> Not an issue. Scalar types may have interfaces at zero time/space cost. You
>> don't need to embed tag into by-value types.
> This is possible indeed, but at the cost of separate compilation.

It is possible without that.

>>>> Any syntax sugar (prefix notation, infix operations,
>>>> assignments, indexing, member extraction, aggregates, entries,
>>>> attributes) shall be operations.
>>> Are you sure you are not confused between concrete syntax and abstract
>>> syntax ?
>>
>> I don't understand this. The problem is that, for example, for the record
>> type T and its member A, the ".A" is not the operation of T, because record
>> is not an interface. A'First is not an operation of array. ":=" is not an
>> operation (doubly dispatching) of its left and right sides etc.
>
> Same feeling as above. Seems you are looking for something which is higher
> level than Ada is.

It is not higher level, it is just regular language. Ada 83 was designed in
the times when type systems were pretty fresh stuff. It bears marks of
elder languages which had only built-in types.

> May
> be worth to recall Ada is not a modeling language, but an implementation
> language with features to enforce safety as much as possible.

You mean that lacking constructors, user-defined assignment, safe
finalization adds something to safety? That must be a very strange kind of
safety then...

>>> This is not a language topic, instead, a technology level topic. I feel
>>> runtime check is a reasonable fall-back for what cannot be statically
>>> checked in th actual state of the technology.
>>
>> No, it is inconsistent and unreasonable. Static checks are meant to
>> detect bugs. Bug is either there or not, independently on whether the
>> program is running, not running, will ever run.
> Easy to say, less to do. You did not demonstrate this is not related to
> actual technology, you just complained it is not as you wish.

No, I complained that self correctness check is inconsistent.

As for raising exceptions from run-time checks, that plague is well known
to anybody who ever used access types. ARG keeps on struggling to repair
the damage made in Ada 95, while breaching another, bigger hole in the
language...

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 12:03:03 PM10/21/11
to
Le Fri, 21 Oct 2011 15:13:54 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> Not really. Wide_Wide_String is one possible implementation of logical
> Unicode string.
And precisely, that implementation is sufficient (*). You can't expect Ada
will provide a so much abstract implementation that it will cover all
possible implementations.

By the way, nothing disallows a compiler implementation to not use an
array of 32 bits item to implement an array of Wide_Wide_Character. As
long as the interface is preserved, this would be legal for a compiler to
use any implementation it could to provide a Wide_Wide_String.

As the purpose of Ada is to be a programming language, would be more
relevant to focus on whether or not it is possible in Ada, to design an
implementation rather than whether or not it provides a given
implementation embedded in the language. It's not a set of libraries, it's
a programming language (it's a common pitfall I feel, when people start
confusing between libraries provided with languages and languages one
their own).

(*) And that implementation is a clean view, unlike the one of String
holding UTF-8 data.

Dmitry A. Kazakov

unread,
Oct 21, 2011, 2:34:55 PM10/21/11
to
On Fri, 21 Oct 2011 18:03:03 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Fri, 21 Oct 2011 15:13:54 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:

>> Not really. Wide_Wide_String is one possible implementation of logical
>> Unicode string.
> And precisely, that implementation is sufficient (*).

Nope. Under Windows I rather need UTF-16 and ASCII. Under Linux it would be
UTF-8 and RADIX-50 for RSX-11.

> You can't expect Ada
> will provide a so much abstract implementation that it will cover all
> possible implementations.

Why not? Why should not a language provide abstractions for character
encoding?

> (*) And that implementation is a clean view, unlike the one of String
> holding UTF-8 data.

You are confusing interface and implementation. This is one of Ada's
problems that they are not clearly separated. Ada 83 pioneered the idea of
such separation for user-defined private types, but was not consequent to
support it for other types, especially, for arrays and records.

Vadim Godunko

unread,
Oct 21, 2011, 2:55:41 PM10/21/11
to
On Oct 21, 3:25 pm, "J-P. Rosen" <ro...@adalog.fr> wrote:
>
> But that is exactly what Wide_Wide_String is!
>
Wide_Wide_String is just another kind of representation - UCS-4/
UTF-32.

> So you are proposing to drop Wide_Wide_String on the ground that it is
> visibly an array, and then provide a private type with a lot of (costly)
> machinery to allow it to be manipulated just as if it were an array?
>
All kinds of strings are still useful in my model (String for
ISO-8859-1, Wide_String for UCS-2 and Wide_Wide_String for UCS-4), and
they are required to represent string literals.

Internal representation of data in such private type can be optimized
for use in concrete domain; but source code which use it still be
portable.

Actually, near to nobody use Wide_Wide_String in real applications.
Why?

> Come on! That's ultra-purism that brings zero improvement in practice.
>
Its done already. ;-)

J-P. Rosen

unread,
Oct 21, 2011, 3:18:49 PM10/21/11
to
Le 21/10/2011 20:55, Vadim Godunko a écrit :
> Actually, near to nobody use Wide_Wide_String in real applications.
> Why?
>
Because there is close to zero need, especially considering the kind of
domains where Ada is used. Wide_Wide_String was added only because it
was a requirement from JTC1.

And frankly, I prefer that implementers spend their precious time in
improving the parts of the compiler that most users need, rather than
satifying aesthetic views of abstract strings.

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 3:30:45 PM10/21/11
to
Le Fri, 21 Oct 2011 20:34:55 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
>>> Not really. Wide_Wide_String is one possible implementation of logical
>>> Unicode string.
>> And precisely, that implementation is sufficient (*).
>
> Nope. Under Windows I rather need UTF-16 and ASCII. Under Linux it would
> be
> UTF-8 and RADIX-50 for RSX-11.
This is implementation. The model on either an UTF-8 or an UTF-16 system,
would still be the one of Wide_Wide_Character. Linux may be UTF-8
internally, I use Unicode in Linux, not UTF-8. Windows may be UTF-16
internally, I use Unicode in Windows and not UTF-16. From within neither,
one will access UTF-8 or UTF-16 low level storage units, and instread will
access Unicode character at hight level. If a given compiler implements
Wide_Wide_Character using one or another encoding, is another story.

The error of using String in some areas of the standard packages, does not
invalidate Wide_Wide_String.

>> You can't expect Ada
>> will provide a so much abstract implementation that it will cover all
>> possible implementations.
>
> Why not? Why should not a language provide abstractions for character
> encoding?
A language is not a library, it provides most importantly, elementary
semantic with which you design more complex things, more or less
optionally built-ins models (which you can drop if you wish) for most
important things or things identified as such (which is a subjective
topic, you can just expect an average opinion), not for everything in the
world. Providing a model for Unicode is reasonably enough.

>> (*) And that implementation is a clean view, unlike the one of String
>> holding UTF-8 data.
>
> You are confusing interface and implementation. This is one of Ada's
> problems that they are not clearly separated. Ada 83 pioneered the idea
> of
> such separation for user-defined private types, but was not consequent to
> support it for other types, especially, for arrays and records.
Array and records are typically not to be publicly exposed. Most of time,
when you define a record type, the record view appears in the package
private part only, the same with arrays. The Ada standard library doesn't
expose records (or else I can't recall one), but indeed exposes some
arrays, which should be hidden in a clean design. However, this may be
justified as a naive while still valid implementation, as much as a simple
and efficient enough implementation. Arrays has an interface, even if this
cannot be tweaked from the programmer's point of view. Array and records
are basic bricks to implement types, not the core of the type models. This
does not disallow pure abstract data types.


There are non-perfect things in the library, but as long as Ada as a
language allows to define what you need, opinions should be measured.

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 3:41:28 PM10/21/11
to
Le Fri, 21 Oct 2011 20:55:41 +0200, Vadim Godunko <vgod...@gmail.com> a
écrit:
> Actually, near to nobody use Wide_Wide_String in real applications.
> Why?
Lack of habits (*), too much people are used to US-ASCII, or Latin-1 at
best, as much as depending on the application area. Also as Jean-Pierre
said, most Ada niches don't have to deal with. Some other areas will have
to bother, like UI, web applications, authoring applications, ….


(*) That's not Ada specific, the same with C/C++ and some other common
languages, even including Python. Most application designers only care of
their own native language and don't bother about foreign languages…
troubles will then come later ;)

Dmitry A. Kazakov

unread,
Oct 21, 2011, 4:02:55 PM10/21/11
to
On Fri, 21 Oct 2011 21:30:45 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Fri, 21 Oct 2011 20:34:55 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>>>> Not really. Wide_Wide_String is one possible implementation of logical
>>>> Unicode string.
>>> And precisely, that implementation is sufficient (*).
>>
>> Nope. Under Windows I rather need UTF-16 and ASCII. Under Linux it would
>> be UTF-8 and RADIX-50 for RSX-11.
> This is implementation.

You wrote about the implementation being sufficient, which is evidently
wrong.

The interface = an array of code points indexed by some cardinal number is
sufficient. The implementation = Wide_Wide_String is not.

Ada does not allow you multiple implementations for this interface forming
one class of types. Ada does not allow you constrained subtypes of the
interface, e.g. narrower sets of code points (String), narrower ranges of
the index (small embedded targets). Ada does not allow you alternative
implementations like unbounded strings in the same class.

This problem is a *fundamental* problem of the Ada type system. It must be
addressed if the Ada wishes to stay a strongly typed language.

>>> You can't expect Ada
>>> will provide a so much abstract implementation that it will cover all
>>> possible implementations.
>>
>> Why not? Why should not a language provide abstractions for character
>> encoding?
> A language is not a library, it provides most importantly, elementary
> semantic with which you design more complex things, more or less
> optionally built-ins models (which you can drop if you wish) for most
> important things or things identified as such (which is a subjective
> topic, you can just expect an average opinion), not for everything in the
> world.

Exactly this is what I want from Ada.

>>> (*) And that implementation is a clean view, unlike the one of String
>>> holding UTF-8 data.
>>
>> You are confusing interface and implementation. This is one of Ada's
>> problems that they are not clearly separated. Ada 83 pioneered the idea of
>> such separation for user-defined private types, but was not consequent to
>> support it for other types, especially, for arrays and records.
> Array and records are typically not to be publicly exposed.

How so? They are two most used public interfaces of composite types in Ada!

BTW the same applies to the numeric types, would you claim them used only
privately too?

Yannick Duchêne (Hibou57)

unread,
Oct 21, 2011, 4:36:03 PM10/21/11
to
Le Fri, 21 Oct 2011 22:02:55 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> You wrote about the implementation being sufficient, which is evidently
> wrong.
>
> The interface = an array of code points indexed by some cardinal number
> is
> sufficient. The implementation = Wide_Wide_String is not.
That's what I wanted to mean: array interface implemented with its most
naive implementation. The interface is good and is Ada side, the
implementation may vary, and is compiler side.

> Ada does not allow you multiple implementations for this interface
> forming
> one class of types. Ada does not allow you constrained subtypes of the
> interface, e.g. narrower sets of code points (String), narrower ranges of
> the index (small embedded targets). Ada does not allow you alternative
> implementations like unbounded strings in the same class.

That's not that easy. If you want to restrict the set of code point
allowed in a container, you must care about preserving class properties. A
subtype T1 of a type T0, is supposed to be a valid element where a type T0
is expected. However, if the actual is T1 and the expected type is T0 and
the object is a target of some operation, then as an example, appending a
code point outside of the restricted range, while valid with an actual of
type T0, would be illegal with an actual of type T1. On the contrary, as a
source T1 will always be valid where a T0 was expected.

Conclusion: the interface could not remain the same, and different
interface, means different type. Unsolvable (a common pitfall known of
ancient Eiffel users). That's by the way one of the reason why assertions
introduced with Ada 2012, are expected to be checked at runtime, because
to warrant it to be statically valid, would lead to a real nightmare for
the language maintainers (I keep in mind you don't enjoy runtime check,
which is OK, if you assume all of the consequences).

> This problem is a *fundamental* problem of the Ada type system. It must
> be
> addressed if the Ada wishes to stay a strongly typed language.
If address nicely enough (except with some area like String and some part
of access types) what it provide. Consistency in a narrow range is better
than a wide range with inconsistencies, and to most people, a narrow range
which can be reasonably implemented, is better than a perfect thing which
cannot be implemented. That was one of the error Bertrand Meyer did, when
he asserted language designers should not bother about whether of not a
given language property is certain to be implementable. In real life,
language designers have to care it is, and have to care it is reasonably.

As said in a prior message, such languages already exist, but as far as I
know, all I played with was either interpreted or inefficient languages
(and all had names I cannot remember, sorry), which is not OK for Ada (for
me it's OK if it lacks some purity, as long as it is safe and efficient
enough).

After all, may be what you need is not Ada! (would not be a shame).

> How so? They are two most used public interfaces of composite types in
> Ada!
The language does not enforce it, this only occur in the standard library,
and you remain free to not follow this design and choose your own if you
wish. Just like the naming convention, I don't enjoy the one of the
standard package, that does not prevent me from using my own,, the
language does not enforce anything there.

> BTW the same applies to the numeric types, would you claim them used only
> privately too?
Arguable in theory, not in practice. If it ever is, just use a language
better suited for your very specific area.

Michael Rohan

unread,
Oct 22, 2011, 2:32:44 AM10/22/11
to
Hi,

There seems to be two major issues being considered here

* The handling of "string" data with within Ada applications, i.e., should String be opaque with, perhaps, class type interfaces giving views into this data as Latin-1, UTF8, UCS-2, etc.

* The more immediate issue I raised initially, what to do when you have a Wide_String and want to use it as a file name. I'm currently just converting such names to UTF8 which works well on Linux but probably will have issue on Windows if I were to use non-Latin-1 type strings.

While the first issue is relatively involved, the second issue could be handled by the run-time (with the possibilities of exceptions if the name could not be mapped, but that would be up to the application to handle).

My initial question suggested there should be Wide_* versions of the packages that interface with the OS (Directories, Command_Line, Environment_Variables, etc). Having implemented wrappers for these it seems to me extending the existing packages to have additional routines for Wide_String/Wide_Wide_String would be cleaner.

This extension of the existing packages would be something that might be possible to consider for the next revision (but maybe too late?).

Take care,
Michael.

Yannick Duchêne (Hibou57)

unread,
Oct 22, 2011, 3:25:47 AM10/22/11
to
Le Sat, 22 Oct 2011 08:32:44 +0200, Michael Rohan <mic...@zanyblue.com> a
écrit:
> My initial question suggested there should be Wide_* versions of the
> packages that interface with the OS (Directories, Command_Line,
> Environment_Variables, etc). Having implemented wrappers for these it
> seems to me extending the existing packages to have additional routines
> for Wide_String/Wide_Wide_String would be cleaner.
I vote for a single additional Wide_Wide_String version (the String
version is required for compatibility, and a Wide_String version would be
useless, as a Wide_Wide_String version could do the more, and the less and
the mean time).

Dmitry A. Kazakov

unread,
Oct 22, 2011, 3:54:07 AM10/22/11
to
On Fri, 21 Oct 2011 22:36:03 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Fri, 21 Oct 2011 22:02:55 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:

> A
> subtype T1 of a type T0, is supposed to be a valid element where a type T0
> is expected.

This is handled in Ada by contracting Constraint_Error in the interfaces.

> Conclusion: the interface could not remain the same, and different
> interface, means different type.

Here you confirm that LSP does work. But Ada does not base its type system
on LSP. There cannot be any usable LSP-conform type system.

> Unsolvable (a common pitfall known of ancient Eiffel users).

They should better understand LSP and its implications.

> That's by the way one of the reason why assertions
> introduced with Ada 2012,

Something non-substitutable remains non-substitutable independently on
whatever assertions. The solution is trivial and was known already in Ada
83: add exception propagation *to* the postcondition.

>> This problem is a *fundamental* problem of the Ada type system. It must be
>> addressed if the Ada wishes to stay a strongly typed language.
> If address nicely enough (except with some area like String and some part
> of access types) what it provide. Consistency in a narrow range is better
> than a wide range with inconsistencies, and to most people, a narrow range
> which can be reasonably implemented, is better than a perfect thing which
> cannot be implemented. That was one of the error Bertrand Meyer did, when
> he asserted language designers should not bother about whether of not a
> given language property is certain to be implementable. In real life,
> language designers have to care it is, and have to care it is reasonably.

Sorry, I don't understand the above, it reads like C advocacy, but I am not
sure. What is your point? Strong typing is not necessary because
inefficient?

>> How so? They are two most used public interfaces of composite types in
>> Ada!
> The language does not enforce it, this only occur in the standard library,
> and you remain free to not follow this design and choose your own if you
> wish. Just like the naming convention,

I am not forced to use strings either. After all, there exist successful
languages without strings and arrays, e.g. C...

>> BTW the same applies to the numeric types, would you claim them used only
>> privately too?
> Arguable in theory, not in practice. If it ever is, just use a language
> better suited for your very specific area.

To summarize your point: for practical reasons, Ada better become C.

Yannick Duchêne (Hibou57)

unread,
Oct 22, 2011, 4:28:34 PM10/22/11
to
Le Sat, 22 Oct 2011 09:54:07 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> Sorry, I don't understand the above, it reads like C advocacy, but I am
> not
> sure. What is your point? Strong typing is not necessary because
> inefficient?
No, not efficiency via weakness (which works against efficiency anyway, as
Python, JavaScript and others shows well), efficiency and safety via a
“world” narrowed to what we are able to automatically handle, like SPARK
do.

Yannick Duchêne (Hibou57)

unread,
Oct 22, 2011, 6:23:14 PM10/22/11
to
Le Sat, 22 Oct 2011 09:54:07 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
>> Conclusion: the interface could not remain the same, and different
>> interface, means different type.
>
> Here you confirm that LSP does work. But Ada does not base its type
> system
> on LSP. There cannot be any usable LSP-conform type system.
It actually do. But it will never with a broken design, obviously. With
the above example of sub‑typing a container for a subtype element type,
the trouble does not come with Ada, but with the merge of two interfaces:
input interface and output interface. It's a rule of thumb for me, to
separate both, because after some experiences, I've learned sooner or
later, you face troubles if you do not distinguish both (*). If you have
two interfaces, one for input and one for output, there is no more
trouble, you can subtype the Input interface to follow an element subtype.
If you don't want to separate both, then you just can't subtype this way.
You have to make a choice, and that's not Ada's fault, that's the domain's
“fault”.

Although Jean-Pierre underlined they are with Ada, some matters which are
in practice above some others (which is true and OK to notice), nothing in
Ada prevents you from using the good design, even if that design does
match Ada niches and typical use cases (you may just have to not tell
anyone, cheese).

(*) I like to redesign sometime, with a read only abstract T1 and a
concrete derived read/write T2.

If you really believe Ada subtypes, *as Ada allows to use it*, does not
conform to the substitution principle, can you provide an example ?
Personally I don't see a trouble if the language does not allow you to do
something it would not be able to handle. You know… better not run
anything at all, than running an erroneous thing.

Please Dmitry, could write down once a whole, all you comments about Ada.
Even do it with a funny title, kind of “Ada criticisms (and _proposals_)”
if you feel it. At least this would help to follow the story, because
that's easy to forget what was already said and what was not, along with
rationales and examples… I have a strange feeling of repeating myself,
sometime, when I reply to you when you complain. Also, this would be an
opportunity for better formalization of your comments. I don't enjoy talks
which too much seems to be about taste when the subject is a rather formal
thing (the language) and also when the subject is something into which
many people invested a lot. Your paper could also be opened for comments
(just like Ada do with its definition). An opportunity for better, longer
and clearer clarifications in a single reference place would be nice.

Dmitry A. Kazakov

unread,
Oct 23, 2011, 3:53:47 AM10/23/11
to
On Sun, 23 Oct 2011 00:23:14 +0200, Yannick Duchêne (Hibou57) wrote:

> If you really believe Ada subtypes, *as Ada allows to use it*, does not
> conform to the substitution principle, can you provide an example ?

Specialization (Ada subtype is a specialization) breaks LSP in
out-operation (an operation with out parameters and/or result of the
subtype)

Generalization breaks in-operations.

This does not mean that there is something wrong with specialization or
generalization, only that subtyping cannot be based on LSP. Which is the
reason why programming languages use so-called "subclassing" instead, read:
non-LSP subtyping. Ada 83 missed Newspeak and called subtyping "subtyping".

> Please Dmitry, could write down once a whole, all you comments about Ada.

Not necessary, you can skip my moans and get right to the response:

"The change .......(fill as appropriate)........ could break existing Ada
programs, which is unacceptable, unless the cases when it would make Ada
look more like Java, LISP, Perl, ....(put a disgusting language here).....,
but it does not."

(:-))

Randy Brukardt

unread,
Oct 25, 2011, 3:16:14 PM10/25/11
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:1l7zxjcrre04c.1...@40tude.net...
...
> Not necessary, you can skip my moans and get right to the response:
>
> "The change .......(fill as appropriate)........ could break existing Ada
> programs, which is unacceptable, unless the cases when it would make Ada
> look more like Java, LISP, Perl, ....(put a disgusting language
> here).....,
> but it does not."
>
> (:-))

This is correct :-), with the exception of the "unless". All of the changes
that make Ada look more like some "disgusting language" don't break any
existing programs. We wouldn't have made the change otherwise.

The few changes that could break existing programs are all about doing what
we believe Ada was meant to do (such as properly composing "="); none of
them have anything to do with looking like some other language. (I'm
presuming that you are talking about things like prefix calls and
conditional expressions here.)

In addition, the new "indexing" sugar is intended to get us closer to your
ideal of a fully abstract interface for arrays. It should make it possible
to define a strongly typed Unicode_String that could have alternate
implementations for different representations. (We don't yet have a good way
to get literals for private types, a problem that we've never been able to
solve although we haven't tried as hard as we should have.)

Randy.


Randy Brukardt

unread,
Oct 25, 2011, 3:22:27 PM10/25/11
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:5279agttaub8.1pl7pt496l1am$.dlg@40tude.net...
> On Fri, 21 Oct 2011 14:53:11 +0200, Yannick Duchęne (Hibou57) wrote:
>
>> Le Thu, 20 Oct 2011 19:35:21 +0200, Dmitry A. Kazakov
>> <mai...@dmitry-kazakov.de> a écrit:
>
>>>> What's missing from Interface type introduced with Ada 2005 ?
>>>
>>> 1. Most Ada types do not have interfaces
>> Eiffel has this, and this is 1) not perfect (may lead to performance
>> issue) 2) rarely used in practice
>
> There is no performance loss.

Anytime you have a construct that allows multiple inheritance, there is a
large performance loss (whether or not you use the multiple inheritance).
You can move the performance loss from one construct to another (i.e.
dispatching calls, access types, etc.) but you can't get rid of it. Keep in
mind that "performance loss" means not just run-time but also space
efficiency (which is important in a language used mainly in embedded
systems).

Randy.




Randy Brukardt

unread,
Oct 25, 2011, 3:26:23 PM10/25/11
to
"Michael Rohan" <mic...@zanyblue.com> wrote in message
news:20586225.484.1319265164765.JavaMail.geo-discussion-forums@prgt10...
...
>This extension of the existing packages would be something that might be
>possible to
>consider for the next revision (but maybe too late?).

Ada 2012 is essentially finished (it will be really finished as soon as I
finish fixing the latest batch of editorial comments). So it is way too late
for anything but the most trivial changes.

At this point, all suggestions are going into Ada 2020 (the provisional name
for the following revision - where we'll have perfect vision of what Ada
should be ;-).

Randy.


Dmitry A. Kazakov

unread,
Oct 25, 2011, 3:35:42 PM10/25/11
to
On Tue, 25 Oct 2011 14:22:27 -0500, Randy Brukardt wrote:

> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
> news:5279agttaub8.1pl7pt496l1am$.dlg@40tude.net...
>> On Fri, 21 Oct 2011 14:53:11 +0200, Yannick Duchêne (Hibou57) wrote:
>>
>>> Le Thu, 20 Oct 2011 19:35:21 +0200, Dmitry A. Kazakov
>>> <mai...@dmitry-kazakov.de> a écrit:
>>
>>>>> What's missing from Interface type introduced with Ada 2005 ?
>>>>
>>>> 1. Most Ada types do not have interfaces
>>> Eiffel has this, and this is 1) not perfect (may lead to performance
>>> issue) 2) rarely used in practice
>>
>> There is no performance loss.
>
> Anytime you have a construct that allows multiple inheritance, there is a
> large performance loss (whether or not you use the multiple inheritance).
> You can move the performance loss from one construct to another (i.e.
> dispatching calls, access types, etc.) but you can't get rid of it.

There is no time/memory loss, at all. For the types in question any legal
Ada 2005 program would generate exactly same code as it would be the
change.

The performance argument is bogus, because it considers programs, which are
presently impossible to write.

Randy Brukardt

unread,
Oct 26, 2011, 6:41:30 PM10/26/11
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
news:ci96gr5yzmpp$.1mwky141c6e78$.dlg@40tude.net...
> On Tue, 25 Oct 2011 14:22:27 -0500, Randy Brukardt wrote:
>
>> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> wrote in message
>> news:5279agttaub8.1pl7pt496l1am$.dlg@40tude.net...
>>> On Fri, 21 Oct 2011 14:53:11 +0200, Yannick Duchêne (Hibou57) wrote:
>>>
>>>> Le Thu, 20 Oct 2011 19:35:21 +0200, Dmitry A. Kazakov
>>>> <mai...@dmitry-kazakov.de> a écrit:
>>>
>>>>>> What's missing from Interface type introduced with Ada 2005 ?
>>>>>
>>>>> 1. Most Ada types do not have interfaces
>>>> Eiffel has this, and this is 1) not perfect (may lead to performance
>>>> issue) 2) rarely used in practice
>>>
>>> There is no performance loss.
>>
>> Anytime you have a construct that allows multiple inheritance, there is a
>> large performance loss (whether or not you use the multiple inheritance).
>> You can move the performance loss from one construct to another (i.e.
>> dispatching calls, access types, etc.) but you can't get rid of it.
>
> There is no time/memory loss, at all. For the types in question any legal
> Ada 2005 program would generate exactly same code as it would be the
> change.

First of all, I was including Ada 2005 interfaces in this complaint -- so
"Ada 2005" is irrelevant (you've already gone over the edge at that point).
You *might* be right about Ada 95 programs, but it would require a
substantial increase in compiler complexity in order to support that. But
Ada compilers are already very complex - fairly close to the point where the
complexity would overwhelm the ability to get them correct. It's much more
likely that a much simpler design would be used for a pervasively multiple
inheriting language where everything is much more expensive.

You might think that such a compiler's output could be optimized to a more
efficient version. Indeed, that was the original premise behind Janus/Ada
(optimization could eliminate the cost of generic sharing, pervasive heap
allocation of objects, etc.) But it didn't work, the optimizations were too
complex to be very practical other than in the simplest of circumstances.
Ultimately, we bit the bullet and supported multiple representations for
arrays, records, and the like, because that got rid of a lot of the expense
at the source. But it also added a whole lot of complexity to the compiler.

It's possible that a from-scratch compiler design could do better, but I
doubt it. And it seems unlikely that anyone will be doing one of those for
Ada anytime soon.

Randy.


Dmitry A. Kazakov

unread,
Oct 27, 2011, 3:43:22 AM10/27/11
to
Because the language is in a mess. That surely makes compilers complex.
Without an overhaul it will collapse in some not so distant future anyway
under the weight of arbitrary language patches. You wanted it complex, here
you are!

> It's much more
> likely that a much simpler design would be used for a pervasively multiple
> inheriting language where everything is much more expensive.

Note that it was not about multiple inheritance. Yannick suggested that
making types like Boolean, String, Integer etc to have classes and
primitive operations would mean a performance loss. That is wrong.
Introducing classes and primitive operation will cost strictly zero in
*all* use cases, which are legal now. Other use cases (e.g. using
class-wide objects and dispatching) are presently illegal, so the whole
argument is bogus.

As for MI, I doubt it very much that MI for *tagged* types would imply any
overhead in *comparable* cases. But this is another discussion. Again, any
such comparison should be correct. I don't care which cost MI inflicts on
record members inherited through it, because it is not legal now, thus,
irrelevant. Would inheritance from interfaces become more expensive? (a
comparable case) I don't believe it.

Yannick Duchêne (Hibou57)

unread,
Oct 27, 2011, 11:13:28 AM10/27/11
to
Le Thu, 27 Oct 2011 09:43:22 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:

> Note that it was not about multiple inheritance. Yannick suggested that
> making types like Boolean, String, Integer etc to have classes and
> primitive operations would mean a performance loss. That is wrong.
I exactly said this would require program analysis as a whole, at the cost
of separate compilation, and thus also at the cost of dropping any kind of
library, either shared or static. If any type is potentially the root of a
class, then you have to avoid dynamic dispatching every where possible,
and to do so, you need global analysis. If you don't, you get the direct
performance issues, typical of interpreted languages.

But I may be wrong if I am not replying to what you had in mind (not sure
anymore I understand the topic).

an...@att.net

unread,
Oct 27, 2011, 1:40:30 PM10/27/11
to
Here is a reason from a link at Unicode.org:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

"...An ASCII or Latin-1 file can be transformed into a UCS-2 file by
simply inserting a 0x00 byte in front of every ASCII byte. If we
want to have a UCS-4 file, we have to insert three 0x00 bytes instead
before every ASCII byte.

Using UCS-2 (or UCS-4) under Unix would lead to very severe problems.
Strings with these encodings can contain as parts of many wide
characters bytes like "\0" or "/" which have a special meaning in
filenames and other C library function parameters. In addition, the
majority of UNIX tools expects ASCII files and cannot read 16-bit
words as characters without major modifications. For these reasons,
UCS-2 is not a suitable external encoding of Unicode in filenames,
text files, environment variables, etc."


So Wide_Character could cause problems in other parts of the OS
or Ada/C libraries. And Ada has does have a "Safety and Security"
concerns. Like paragraph 4 in Annex H.

4 Restricting language constructs whose usage might complicate the
demonstration of program correctness

Plus, the goal of "reliability, maintainability, and efficiency" could
not be keep if Ada_Directory was Wide_Character. Because the storage
of Wide_Character rather 16-bit or 32-bit is not as efficient as 8 bit
for filenames. Just think about the old simple 8 by 3 character file
names. In Wide_Characters that would minimally be 16 by 6 byte (UCS-2)
or even 32 by 12 byte (UCS-4). Which means searching and comparing names
could take 2 to 4 time longer and 2 or 4 time more storage for the name.
Which is less efficiency. A quick note on maintainability, and how many
systems will be using the (16/32) Unicode for their filenames.

So, to be reliability and efficiency, Wide_Characters should be keep
to the routines and data that requires the addition storage to be
accurate, not to files that are already hurt because they are normally
on a slower access media. And causing more time is defeat the purpose
of timely reliability program.


In <9937871.172.1318575525468.JavaMail.geo-discussion-forums@prib32>, Michael Rohan <michael...@gmail.com> writes:
>Hi,
>
>I've working a little on accessing files and directories using Ada.Director=
>ies and have been using a thin wrapper layer to convert from Wide_String to=
> UTF8 and back. It does, however, seem strange there is no Wide_Directorie=
>s version in the std library. Was there a technical reason it wasn't inclu=
>ded?
>
>Take care,
>Michael

Robert A Duff

unread,
Oct 27, 2011, 3:39:31 PM10/27/11
to
"Yannick Duch�ne (Hibou57)" <yannick...@yahoo.fr> writes:

> Le Thu, 27 Oct 2011 09:43:22 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a �crit:
>
>> Note that it was not about multiple inheritance. Yannick suggested that
>> making types like Boolean, String, Integer etc to have classes and
>> primitive operations would mean a performance loss. That is wrong.

> I exactly said this would require program analysis as a whole, at the
> cost of separate compilation, and thus also at the cost of dropping any
> kind of library, either shared or static. If any type is potentially
> the root of a class, then you have to avoid dynamic dispatching every
> where possible, and to do so, you need global analysis. If you don't,
> you get the direct performance issues, typical of interpreted languages.

I'm not sure what whole-program analysis you're thinking of.

In Ada, you can tell whether a procedure is dispatching at compile time
of the declaration of that procedure. And you can tell whether a given
call is dispatching at compile time of that call. No whole-program
analysis needed.

There would be some overhead when converting a Boolean to Boolean'Class.
A Boolean should fit in 1 byte (or 1 bit if packed). So you don't want
to store a Tag with every Boolean. Instead, you want to gin up the
Tag on conversion to class-wide. But this overhead is not DISTRIBUTED
overhead, so it doesn't matter.

- Bob

Yannick Duchêne (Hibou57)

unread,
Oct 27, 2011, 5:09:42 PM10/27/11
to
Le Thu, 27 Oct 2011 21:39:31 +0200, Robert A Duff
<bob...@shell01.theworld.com> a écrit:
> In Ada, you can tell whether a procedure is dispatching at compile time
> of the declaration of that procedure. And you can tell whether a given
> call is dispatching at compile time of that call. No whole-program
> analysis needed.
If a call is not dispatching, it may be anything you want, this will be
the same: a deterministic call. If you want classes, this is for
dispatching call I suppose (*), if you do not expect to use dispatching
calls, you may not need classes. Finally, I suppose if one want classes,
that means he/she want dispatching calls. If someone want dispatching call
on some high level custom types, that's OK; if someone wants dispatching
calls on general purpose and basic type like Boolean, that's not OK, this
cost too much, unless optimized.

To know when a call is dispatching or static, does not make dispatching
calls less costly. If you want dispatching call to cost not too much, you
have to optimize these calls, which requires the mentioned global
analysis. If you do not think about dispatching calls, then you may not
need classes.

Or else, what kind of classes was this all about ?

I may just have misunderstood the topic, I keep this in mind too.

(*) Not necessarily, but I feel this is really how things typically goes.

Dmitry A. Kazakov

unread,
Oct 28, 2011, 3:50:03 AM10/28/11
to
On Thu, 27 Oct 2011 23:09:42 +0200, Yannick Duchêne (Hibou57) wrote:

> if you do not expect to use dispatching
> calls, you may not need classes.

No. If you don't expect to use the number 123, that does not imply that
integer numbers shall not have that value.

> Finally, I suppose if one want classes,
> that means he/she want dispatching calls. If someone want dispatching call
> on some high level custom types, that's OK; if someone wants dispatching
> calls on general purpose and basic type like Boolean, that's not OK,

1. You should explain the difference. Why some types are more types than
others?

2. Dispatching never happens on *a* type, it does on a *set* of types. In
Ada you simply cannot have a dispatching call on a specific type. Ada is a
typed language.

> this cost too much, unless optimized.

Nope, it does not cost anything. You are comparing costs of using something
with the costs of not writing (and thus not executing) the program at all.
Non-existing programs consume no resources.

Again, in order to make comparison meaningful you have to consider
comparable cases. For example, having a class you can put class-wide
instances into a container. Without the class you have to write some
variant record wrapper type with alternatives of different types and the
discriminant playing the role of a tag. Now you could compare the
performance of this poor man's class implementation and one of the proper
class.

> To know when a call is dispatching or static, does not make dispatching
> calls less costly. If you want dispatching call to cost not too much, you
> have to optimize these calls,

Not in Ada, where specific and class-wide types are distinct. If you
statically know the type you declare the object of that type. If you don't
know the type, then presently for types which are not tagged, you cannot
write the program at all.

> Or else, what kind of classes was this all about ?

Class = set of types closed upon inheritance.

Yannick Duchêne (Hibou57)

unread,
Oct 28, 2011, 4:45:07 AM10/28/11
to
Le Fri, 28 Oct 2011 09:50:03 +0200, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:
> 2. Dispatching never happens on *a* type, it does on a *set* of types.
I meant on the class-wide view of a type (ok, sorry for the dirty wording).

Dmitry A. Kazakov

unread,
Oct 28, 2011, 10:59:31 AM10/28/11
to
On Fri, 28 Oct 2011 10:45:07 +0200, Yannick Duchêne (Hibou57) wrote:

> Le Fri, 28 Oct 2011 09:50:03 +0200, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>> 2. Dispatching never happens on *a* type, it does on a *set* of types.
> I meant on the class-wide view of a type (ok, sorry for the dirty wording).

For non-tagged types there will no class-wide view at all. Conversion to
class-wide will create new object.

This is also the schema for ad-hoc supertypes and the subtypes which do not
inherit the representation. For them type conversions must be "physical".
0 new messages