Code gen - calling sequences

25 views
Skip to first unread message

James Harris

unread,
Aug 24, 2021, 1:25:43 PMAug 24
to
It's all your fault. You guys keep bringing up in other discussions
topics which are so interesting and important that they deserve their
own threads. :-)

I've been taking the top off a chimney stack (as one does!, horrible job
for someone who doesn't like heights!) and I have quite a few messages
to read and some to respond to but I did see the discussion on calling
conventions and, like some of you, I have doubts about the standards.

In particular, IMO a program should go through whole-program
optimisation, including subroutine calls. There are significant savings
to be had from either embedding the callee in the caller or getting the
callee to save only the registers which it needs to or in tweaking which
registers a callee will or or in stripping out unneeded bracketing, etc.

My preferred distribution format is object files or ASTs, certainly not
linked files. IMO object or AST distribution is the best way to (1)
obfuscate and/or lock in source code (for those who wish to do so) and
yet (2) allow important decisions about the execution image to be made
for the specific target machine.

Failing that, if one cannot treat an entire program as optimisable,
another option is to have each subroutine /publish/ which registers it
uses and a caller to maintain a note of which registers are live at each
call. Then the link step can add code for the caller to save only those
registers which are in the intersection of those sets and which cannot
be easily reloaded.

These days why use calling conventions at all? Perhaps they are only
needed for when there's complete ignorance of the callee. The
traditional concept of calling conventions may be pass\acute/e. ;-)


--
James Harris

Bart

unread,
Aug 24, 2021, 3:07:03 PMAug 24
to
On 24/08/2021 18:25, James Harris wrote:

> These days why use calling conventions at all? Perhaps they are only
> needed for when there's complete ignorance of the callee. The
> traditional concept of calling conventions may be pass\acute/e. ;-)

Funnily enough I never paid much attention until last autumn when I
wanted to generate more optimised code.

But the reasons to [mostly] use the official ABI on Win64 are as follows:

* To be able to call across FFIs (eg. call into existing, separately
compiled code in .dll or .so files, or, if you're writing the library,
allow another programs to call your code).

* Be able to have callbacks (OS and library functions calling a function
in your program) or, in general, be able to pass a function pointer to a
separately compiled code.

* If you have your own scheme for allocations of volatile, non-volatile
and parameter-passing registers, then it greatly simplifies things at
those cross-program boundaries if they use the same scheme.

* In my case, having no idea about register allocation since I'd never
bothered before, and needing to sometimes deal with the ABI anyway, as
well as for the last reason, I thought I might as well use the
tried-and-tested scheme used in the Win64 ABI.

Note that I'm talking about 64-bit processors where languages do
generally use the same ABI. And the Win64 ABI, perhaps SYS V too,
requires a 16-byte-aligned stack [just before CALL] which causes grief
if you're doing your own thing but need to call via FFI.

I understand it's still a free-for-all on 32-bit systems so there,
there's more of a choice still.


David Brown

unread,
Aug 24, 2021, 4:56:04 PMAug 24
to
On 24/08/2021 21:06, Bart wrote:
> On 24/08/2021 18:25, James Harris wrote:
>
>> These days why use calling conventions at all? Perhaps they are only
>> needed for when there's complete ignorance of the callee. The
>> traditional concept of calling conventions may be pass\acute/e. ;-)

James, aren't you using Linux? The compose key makes it easy to write
letters like é - it's just compose, ´, e - "passé". (It's even easier
if you have a non-English keyboard layout, in Windows or Linux, as these
usually have "dead keys" for accents.)

>
> Funnily enough I never paid much attention until last autumn when I
> wanted to generate more optimised code.
>
> But the reasons to [mostly] use the official ABI on Win64 are as follows:

Of course, you would only want to use the Win64 ABI on Win64... But
your reasons apply to using the standard ABI on whatever target you are
working with at the time.

>
> * To be able to call across FFIs (eg. call into existing, separately
> compiled code in .dll or .so files, or, if you're writing the library,
> allow another programs to call your code).
>

Yes.

> * Be able to have callbacks (OS and library functions calling a function
> in your program) or, in general, be able to pass a function pointer to a
> separately compiled code.
>

Yes.

> * If you have your own scheme for allocations of volatile, non-volatile
> and parameter-passing registers, then it greatly simplifies things at
> those cross-program boundaries if they use the same scheme.
>

Yes.

> * In my case, having no idea about register allocation since I'd never
> bothered before, and needing to sometimes deal with the ABI anyway, as
> well as for the last reason, I thought I might as well use the
> tried-and-tested scheme used in the Win64 ABI.
>

No single scheme is going to be ideal in all circumstances. But
standard ABI's for a given platform are likely to be solid
general-purpose choices, and you'd have to do a lot of work to get
something significantly better. Remember, good artists copy - great
artists steal!

> Note that I'm talking about 64-bit processors where languages do
> generally use the same ABI. And the Win64 ABI, perhaps SYS V too,
> requires a 16-byte-aligned stack [just before CALL] which causes grief
> if you're doing your own thing but need to call via FFI.
>

As long as you know about these things in advance, they shouldn't be too
hard to handle. There are a few SIMD instructions in x86-64 that have
very strict alignment requirements, which could be the reason for these
restrictions.

> I understand it's still a free-for-all on 32-bit systems so there,
> there's more of a choice still.
>

It's chaos in the 32-bit /Windows/ world. Every other 32-bit system has
solid, fixed, documented ABI's followed by all toolchains (at least for
their FFI's).


But there is a lot to be said for having more advanced object code
formats, as James suggested, for code within the toolchain. Host
computers are a lot more powerful than when the traditional compile,
assemble, link arrangement was developed. Modern compilers like
clang/llvm and gcc can do "link-time optimisation" - C (or C++, Fortran,
Ada, whatever) files are compiled and optimised to an internal
representation format for the object files, not assembly or machine
code. Then there is a first round of linking that joins these together,
and then the combination is optimised (function inlining, outlining,
cloning, partial inlining, constant propagation, etc.), before
generating assembly and machine code.

The results can be quite significantly more efficient. It also lets you
program in a different way, and lets you design your language in a
different way - "interface" files no longer need implementation details
for efficiency.

But it is also a good deal more complicated to work with, and very much
more difficult to implement in a scalable fashion. Early versions of
gcc's LTO did the linking and LTO optimisation and code generation in a
serial fashion - pretty horrible for compiling Firefox or LibreOffice.
Getting a good system for partitioning the problem to scale over
multiple cores, and to avoid needing absurd amounts of memory to do so,
has taken time and is still a work in progress. LTO gives good results,
but it is not for the faint of heart implementer.

It also plays buggery with your debugging. Such is life.

Bart

unread,
Aug 24, 2021, 6:54:31 PMAug 24
to
On 24/08/2021 21:56, David Brown wrote:

> But there is a lot to be said for having more advanced object code
> formats, as James suggested, for code within the toolchain. Host
> computers are a lot more powerful than when the traditional compile,
> assemble, link arrangement was developed.

Yes, that's why I'm exploring different approaches like whole-program
compiling.

If I had a better optimiser, that would allow some interesting things.
But as I have it, it wouldn't be able to cross boundaries between
executables and shared libraries. (I'm not sure gcc/lto can either.)

On the hand, if the DLL library (lib.dll) was also written in my
language, I would import that as a shared library using 'importx lib' in
application. But by simply changing that to 'import lib', it would
compile its sources as part of the application, so becoming part of the
same whole program.

> Modern compilers like
> clang/llvm and gcc can do "link-time optimisation" - C (or C++, Fortran,
> Ada, whatever) files are compiled and optimised to an internal
> representation format for the object files, not assembly or machine
> code. Then there is a first round of linking that joins these together,
> and then the combination is optimised (function inlining, outlining,
> cloning, partial inlining, constant propagation, etc.), before
> generating assembly and machine code.
>
> The results can be quite significantly more efficient. It also lets you
> program in a different way, and lets you design your language in a
> different way - "interface" files no longer need implementation details
> for efficiency.

I've long forgotten what interface files are when all code is in my
language.

> But it is also a good deal more complicated to work with, and very much
> more difficult to implement in a scalable fashion. Early versions of
> gcc's LTO did the linking and LTO optimisation and code generation in a
> serial fashion - pretty horrible for compiling Firefox or LibreOffice.

I used to get whole-program optimisation a different way with C: when my
tools used to generate C, it would be a single C source file (the kind
you despised). But since it represented the whole program, it would
allow gcc to do whole-program optimations without involving the linker.

David Brown

unread,
Aug 25, 2021, 4:47:39 AMAug 25
to
On 25/08/2021 00:54, Bart wrote:
> On 24/08/2021 21:56, David Brown wrote:
>
>> But there is a lot to be said for having more advanced object code
>> formats, as James suggested, for code within the toolchain.  Host
>> computers are a lot more powerful than when the traditional compile,
>> assemble, link arrangement was developed.
>
> Yes, that's why I'm exploring different approaches like whole-program
> compiling.
>
> If I had a better optimiser, that would allow some interesting things.
> But as I have it, it wouldn't be able to cross boundaries between
> executables and shared libraries. (I'm not sure gcc/lto can either.)
>

Executables and shared libraries are in formats defined by the OS and
target, and are almost always machine code. That severely limits your
optimisation options.

gcc/LTO can optimise builds using static libraries with IR code, just as
it can with any object files it has compiled - a static library is
nothing more than a combined collection of object files. But it doesn't
make sense to talk about optimising code in shared libraries that is
already compiled to machine code.

It's possible to have shared files in bytecode for a VM, and then use a
JIT compiler to optimise. But you won't have that for the lowest level
libraries.

> On the hand, if the DLL library (lib.dll) was also written in my
> language, I would import that as a shared library using 'importx lib' in
> application. But by simply changing that to 'import lib', it would
> compile its sources as part of the application, so becoming part of the
> same whole program.

That's not a shared library, that's a static library used at compile
time (or "link" time, or "build" time if you prefer). A "fat" DLL
containing both shared object code for use at run-time and code (in
machine code, or in an IR code) is perfectly reasonable, and could be a
good way to distribute libraries for flexible and efficient use. To
work well, however, there would need to be an agreement on the IR that
any toolchain could use for any language.

>
>> Modern compilers like
>> clang/llvm and gcc can do "link-time optimisation" - C (or C++, Fortran,
>> Ada, whatever) files are compiled and optimised to an internal
>> representation format for the object files, not assembly or machine
>> code.  Then there is a first round of linking that joins these together,
>> and then the combination is optimised (function inlining, outlining,
>> cloning, partial inlining, constant propagation, etc.), before
>> generating assembly and machine code.
>>
>> The results can be quite significantly more efficient.  It also lets you
>> program in a different way, and lets you design your language in a
>> different way - "interface" files no longer need implementation details
>> for efficiency.
>
> I've long forgotten what interface files are when all code is in my
> language.

Does your language not make a distinction between the public interface
for a module/unit/package/whatever, and the implementation?

>
>> But it is also a good deal more complicated to work with, and very much
>> more difficult to implement in a scalable fashion.  Early versions of
>> gcc's LTO did the linking and LTO optimisation and code generation in a
>> serial fashion - pretty horrible for compiling Firefox or LibreOffice.
>
> I used to get whole-program optimisation a different way with C: when my
> tools used to generate C, it would be a single C source file (the kind
> you despised). But since it represented the whole program, it would
> allow gcc to do whole-program optimations without involving the linker.
>

I disapprove of single massive files as a way of writing source code.
Any kind of development (not just programming) should be modularised and
split into manageable pieces.

Generated C as an intermediary step in compiling a language is not
source code, and does not have to follow the same kinds of rules.
Collecting all your /generated/ C into one big file and compiling it is
not unreasonable, and is a technique used by a number of language tools.

Of course, this does not scale well for large projects, but usually
large projects would be done in more mainstream languages with more
mature tools.

Bart

unread,
Aug 25, 2021, 6:23:05 AMAug 25
to
On 25/08/2021 09:47, David Brown wrote:
> On 25/08/2021 00:54, Bart wrote:

>> I've long forgotten what interface files are when all code is in my
>> language.
>
> Does your language not make a distinction between the public interface
> for a module/unit/package/whatever, and the implementation?

Between the modules of a program? I export entities in a module by
marking them with a 'global' attribute.

Using that module requires compiling it so the source is always needed.
Exports can be summarised in one file, eg. for docs, by extra options
(see below).

A separate interface is only used when I create a discrete library, in
the form of a DLL file. In such cases (remember the library may comprise
multiple modules with shared functions), names exported from the library
are marked with an 'export' attribute rather than 'global':

C:\mapps>mm -dll jpeg
Compiling jpeg.m-------- to jpeg.dll
Writing exports file to jpeg.exp

It automatically writes an interface file called jpeg.exp:

C:\mapps>type jpeg.exp
importlib $jpeg =
mlang function loadjpeg(ref char file,ref i64 width,height) => ref u8
mlang proc freejpeg(ref u8 p)
end importlib

But this is for use from my language:

importx jpeg # incorporates jpeg.exp, and automatically
# adds jpeg.dll to build files

Other languages need to manually build bindings in that language to use
this library, although generating C headers in the same way would be a
simple matter.

(There is another option "-docs", but this is intended for functions
with doc-strings. All those functions with associated doc-strings are
listed in a separate file. It is not meant to provide full information
for other languages to base bindings on.)

I'm also finding a need for a 'package', which is a group of related
modules used as a library, but compiled with the other modules rather
than be an independent DLL. I've been experimenting with ways of
expressing and dealing with that.

To summarise how I'd import those various interfaces in my static language:

import A # import module A.m directly; all 'global'
# names from A become visible

importx B # import names from B.exp
# B.dll will be attached to the build

importd C # (proposed) same as importx but the
# necessary info is inside C.dll

import* D # Will 'broadcast' whatever imports are
# used in D.M, as though all those modules
# were imported here

importdll E = # Used for FFI imports from E.dll, followed
# by list of FFI declarations. This is for
# external libraries not in my language.
# Usually such a block is wrapped in a regular
# module, so I just do, eg: 'import clib'
# E.dll is added to the build
....
end

importdll $F = # The $ signififies a dummy DLL not added to
# the build. Other ways are used to inform the
# compiler of which DLLs are needed.
# This is used eg when the FFI functions come
# from various DLLs

For each of the import statements, the module name becomes a namespace.


>> I used to get whole-program optimisation a different way with C: when my
>> tools used to generate C, it would be a single C source file (the kind
>> you despised). But since it represented the whole program, it would
>> allow gcc to do whole-program optimations without involving the linker.
>>
>
> I disapprove of single massive files as a way of writing source code.

As I keep saying, I never do that.

Large single files are /always/ generated; here these single output
files always represent the entire program, but they are generated from
all the true modules of the project:

.asm Single ASM file
.c (On some products) Single C source file
.dll Single shared library file
.exe Single executable file
.ma Single file 'amalgamated' version of M sources
.obj Single object file (this one is unusual)
.pcl (New) Single IL source file
.qa Single file 'amalgamated' version of Q sources

It's just what I do. EXE/DLL are universally used as single-file
representations of a program or library; the others are less common.

Rod Pemberton

unread,
Aug 26, 2021, 9:22:09 PMAug 26
to
On Tue, 24 Aug 2021 18:25:40 +0100
James Harris <james.h...@gmail.com> wrote:

> It's all your fault.

Never isn't.

> You guys keep bringing up in other discussions topics which are so
> interesting and important that they deserve their own threads. :-)

So, you've admitted to becoming as dull and boring as the rest of us?
Oh crud, the end is nigh ... ;-)

> I've been taking the top off a chimney stack

Why?

And, you say that as if it's actually something normal to do:

> (as one does!,

No, no, no. No one does that. Ever. I've even had the opportunity to
remove unused chimneys in houses **twice**, e.g., for more closet
space. Never did it. Not once. The chimneys are still standing.

People pay "professionals" to do that, i.e., brick layers and chimney
builders. E.g., removing an unused chimney in a rental property is
dangerous because of the weight of the bricks, i.e., the wood floor
can't support the weight of the removed bricks of an entire chimney.
The chimney weight was supported by the foundation, not the wood floor.
Nobody tensions their own garage springs either. Too dangerous.
That's even if they understand the process, safety measures, have the
tools, and in theory could do it themselves. Does anyone here do their
own dental work? ... No. I'm all for DIY, but some things can be done
better or more safely by someone else, for which you'll have to pay up.

> horrible job for someone who doesn't like heights!)

Don't look down. Block your down-view below a virtual "floor level".
Think of a horse blinder, but horizontal.

(If you aren't ROFL saying OMFG! right now, you probably should be ...
Afterward, you'll probably delve into the psychology of why you're
actually afraid of heights. It ain't pretty. BTDT.)

> In particular, IMO a program should go through whole-program
> optimisation, including subroutine calls.

Why? Does the entire program need optimized? Or, does just the
portion which does the majority of the work or consumes the most
processor time need to be optimized?

> There are significant savings to be had from either embedding the
> callee in the caller or getting the callee to save only the registers
> which it needs to or in tweaking which registers a callee will or or
> in stripping out unneeded bracketing, etc.

Do your programs follow the same basic design patterns?

E.g., quite a few of utilities of mine, mostly very simple programs,
follow the same format: gather some setup info, call a function which
loops through the work, return to do cleanup and exit. Of course,
there are a bunch of helper functions which are only called a few
times. I.e., the program is really one main loop which does the
majority of the work.

If your programs follow similar patterns of design, then does the
entire program need to be optimized? E.g., for the utilities I
mentioned above, only the central loop and related functions really
need to be optimized. The amount of time in set up, clean up, and
ancillary functions is minimal.

--
"There was never a good time to withdraw U.S. forces," said President
Joe Biden. But, there were less terrible times ...

Rod Pemberton

unread,
Aug 26, 2021, 9:25:13 PMAug 26
to
On Wed, 25 Aug 2021 10:47:37 +0200
David Brown <david...@hesbynett.no> wrote:

> On 25/08/2021 00:54, Bart wrote:
> > On 24/08/2021 21:56, David Brown wrote:
> >

> >> But it is also a good deal more complicated to work with, and very
> >> much more difficult to implement in a scalable fashion.  Early
> >> versions of gcc's LTO did the linking and LTO optimisation and
> >> code generation in a serial fashion - pretty horrible for
> >> compiling Firefox or LibreOffice.
> >
> > I used to get whole-program optimisation a different way with C:
> > when my tools used to generate C, it would be a single C source
> > file (the kind you despised). But since it represented the whole
> > program, it would allow gcc to do whole-program optimations without
> > involving the linker.
> >
>
> I disapprove of single massive files as a way of writing source code.
> Any kind of development (not just programming) should be modularised
> and split into manageable pieces.

Well, I'd say that disrupts numerous optimizations which could be done
by the programmer ... "Why, you can't see the forest for them there
trees in the way!" And, if you're not looking at all of the trees,
because they're off hiding somewhere else, you'll never see the forest
at all. How does being "scatterbrained" about programming help? ...
(Meaning everything you need being somewhere else than where you need
it so you can't make any real sense of it. Obfuscation.)

James Harris

unread,
Aug 27, 2021, 11:04:06 AMAug 27
to
On 27/08/2021 03:23, Rod Pemberton wrote:
> On Tue, 24 Aug 2021 18:25:40 +0100
> James Harris <james.h...@gmail.com> wrote:
>
>> It's all your fault.
>
> Never isn't.
>
>> You guys keep bringing up in other discussions topics which are so
>> interesting and important that they deserve their own threads. :-)
>
> So, you've admitted to becoming as dull and boring as the rest of us?

"becoming"???

...

>> I've been taking the top off a chimney stack
>
> Why?

It's letting water in.

I was thinking to remove the whole chimney stack as it is no longer used
but then I wondered whether the local authority might tell me to
reinstate it. That would not be good, especially as the bricks I've
removed so far are soft and are breaking. So current idea is to remove
the dodgy top layer or two and cap it while we have some dry weather.
(Though I am very wary about being able to lift a 2' square concrete cap
up the ladder without putting so much sideways pressure on the stack
such that it falls over!)

...

>
> No, no, no. No one does that. Ever. I've even had the opportunity to
> remove unused chimneys in houses **twice**, e.g., for more closet
> space. Never did it. Not once. The chimneys are still standing.
>
> People pay "professionals" to do that, i.e., brick layers and chimney
> builders. E.g., removing an unused chimney in a rental property is
> dangerous because of the weight of the bricks, i.e., the wood floor
> can't support the weight of the removed bricks of an entire chimney.
> The chimney weight was supported by the foundation, not the wood floor.
> Nobody tensions their own garage springs either. Too dangerous.
> That's even if they understand the process, safety measures, have the
> tools, and in theory could do it themselves. Does anyone here do their
> own dental work? ... No. I'm all for DIY, but some things can be done
> better or more safely by someone else, for which you'll have to pay up.

I quite like the idea of doing it myself - partly for the cost and
partly for the experience - though it's not my line of work and I am
fearful of the chimney breaking in two due to the sideways pressure from
the ladder so this is not without risk. I do have a way of bracing the
stack on the other side so it /should/ be OK. :-)

Let's just say that if I don't post for a long time I'm probably
recovering in hospital and my house is in ruins. ;-)

...

>> In particular, IMO a program should go through whole-program
>> optimisation, including subroutine calls.
>
> Why? Does the entire program need optimized? Or, does just the
> portion which does the majority of the work or consumes the most
> processor time need to be optimized?

That's because function calls can require a lot of work that isn't
really necessary. For example, consider a simple function to return the
product of the sum and difference of two numbers:

int psd(int a, int b) {
return (a + b) * (a - b)
}

Under simple calling conventions an executable's invocation of it would

* save any in-use caller-save registers
* push the two arguments
* push the return address
* start the function
* save any callee-save registers
* adjust the base register and the stack pointer
* compute the result
* adjust the base register and the stack pointer again
* restore any callee-save registers
* return
* adjust the stack pointer to remove any pushed arguments
* reload caller-save registers (or allow to happen organically)

and that's without a cost for checking for stack overflow (because it
could be avoided when calling a leaf function).

Essentially, with whole-program optimisation many of the above steps can
be removed, even if the callee is kept separate rather than being inlined.

For example, the above could be reduced to

* save any registers the callee would trash
* push the return address
* start the function
* compute the result
* return

The shorter processing would be possible if the callee published which
registers it expected the parameters to be passed in and which registers
it trashed, and the caller used that info as part of its
register-allocation constraints. That would be a rather cool way to do it!

Interestingly, asm programmers have been writing subroutines that way
for decades.

Of course, there are downsides to the shorter version. For instance, if
the callee were to be updated and recompiled such that it used different
registers then it may be that all callers would also need to be recompiled.


>
>> There are significant savings to be had from either embedding the
>> callee in the caller or getting the callee to save only the registers
>> which it needs to or in tweaking which registers a callee will or or
>> in stripping out unneeded bracketing, etc.
>
> Do your programs follow the same basic design patterns?
>
> E.g., quite a few of utilities of mine, mostly very simple programs,
> follow the same format: gather some setup info, call a function which
> loops through the work, return to do cleanup and exit. Of course,
> there are a bunch of helper functions which are only called a few
> times. I.e., the program is really one main loop which does the
> majority of the work.

That's very interesting. I don't know if my programs to date follow a
common pattern but I am getting more and more into a certain way of
thinking about processing - and that includes a main loop, as you mention.

>
> If your programs follow similar patterns of design, then does the
> entire program need to be optimized? E.g., for the utilities I
> mentioned above, only the central loop and related functions really
> need to be optimized. The amount of time in set up, clean up, and
> ancillary functions is minimal.
>

Agreed. Code inside and outside of loops would normally be in scope for
a conventional optimiser but whole-program optimisation is, AISI,
largely about calls and returns.


--
James Harris

Bart

unread,
Aug 27, 2021, 4:07:39 PMAug 27
to
How? Calling a leaf function can still consume stack space, perhaps a
lot of it. It would need some extra analysis to ahow that the call-depth
would never that deep.

> Essentially, with whole-program optimisation many of the above steps can
> be removed, even if the callee is kept separate rather than being inlined.

Whole program optimisation is secondary here. The above is just routine
optimisation.

My compiler barely does any (it stores a handful of locals into
non-volatile registers, and that's pretty much it; the rest is some
tidying up), yet it converts your psd() into only 5 x64 intructions.

A call to psd(10,20) is done in 3 instructions.

Whole-program compilers (which make whole-program optimisation easier)
would just ensure that all function bodies within the program are
visible at any call-site.

So for my example call, that would allow it to be inlined and reduced.

(Mine doesn't do inlining; I'd have to write psd() as a macro - a proper
one not what C has - then it reduces psd(10,20) to 'mov rax, -300'.)


> For example, the above could be reduced to
>
>   * save any registers the callee would trash
>   * push the return address
>   * start the function
>   * compute the result
>   * return
>
> The shorter processing would be possible if the callee published which
> registers it expected the parameters to be passed in and which registers
> it trashed, and the caller used that info as part of its
> register-allocation constraints. That would be a rather cool way to do it!

A first pass generating code would result in a set of registers used by
each function. This could result in simpler call sequences, and in turn
perhaps even fewer registers required on calls. Then a subsequence
iteration could improve it further.

So no reason for the call to publish anything (if you mean declare
things in source code).

However, for separately compiled code (eg. existing binary code inside
a DLL), such extra metadata could be added by the compiler. This
wouldn't afffect normal use, but a compiler (it would need to know the
DLLs used) could look for that info to garner useful hints.


> Agreed. Code inside and outside of loops would normally be in scope for
> a conventional optimiser but whole-program optimisation is, AISI,
> largely about calls and returns.

It can do a bit more. For example, work out that a function [one not
exported from the program] is never used, or only used in one place, or
only used with constant arguments. It can do a similar analysis of
variables and other entities.

But it's tricky to apply to current languages which still seem to be
dominated by separate compilation of modules, so we're seeing those
complex schemes that David Brown mentioned of specialised object file
formats, and special kinds of linkers.

(The linker that comes with LLVM is 63MB, 1300 times bigger than the
smallest discrete Windows linker I used. My own linker, not discrete,
adds perhaps 10KB to my assembler.)

James Harris

unread,
Aug 28, 2021, 10:22:46 AMAug 28
to
I have (or had, it was some time ago) some ideas on that. If you'd like
to start a new topic I'll look out my notes and respond.

>
>> Essentially, with whole-program optimisation many of the above steps
>> can be removed, even if the callee is kept separate rather than being
>> inlined.
>
> Whole program optimisation is secondary here. The above is just routine
> optimisation.
>
> My compiler barely does any (it stores a handful of locals into
> non-volatile registers, and that's pretty much it; the rest is some
> tidying up), yet it converts your psd() into only 5 x64 intructions.
>
> A call to psd(10,20) is done in 3 instructions.
>
> Whole-program compilers (which make whole-program optimisation easier)
> would just ensure that all function bodies within the program are
> visible at any call-site.
>
> So for my example call, that would allow it to be inlined and reduced.
>
> (Mine doesn't do inlining; I'd have to write psd() as a macro - a proper
> one not what C has - then it reduces psd(10,20) to 'mov rax, -300'.)

Constants are easy. What's your full call, execute, return sequence for

psd(x, y)

and what convention are you using?

>
>
>> For example, the above could be reduced to
>>
>>    * save any registers the callee would trash
>>    * push the return address
>>    * start the function
>>    * compute the result
>>    * return
>>
>> The shorter processing would be possible if the callee published which
>> registers it expected the parameters to be passed in and which
>> registers it trashed, and the caller used that info as part of its
>> register-allocation constraints. That would be a rather cool way to do
>> it!
>
> A first pass generating code would result in a set of registers used by
> each function. This could result in simpler call sequences, and in turn
> perhaps even fewer registers required on calls. Then a subsequence
> iteration could improve it further.
>
> So no reason for the call to publish anything (if you mean declare
> things in source code).

Ah, no. In the current context I mean that the /compiled/ code would
publish its register usage (inputs, outputs and trashed).

You could think of this as akin to a mul instruction which has
constraints on which registers can be used. The compiler would treat the
subroutine as an instruction which had register constraints.

>
> However, for separately compiled code (eg. existing binary code inside a
> DLL), such extra metadata could be added by the compiler. This wouldn't
> afffect normal use, but a compiler (it would need to know the DLLs used)
> could look for that info to garner useful hints.
>
>
>> Agreed. Code inside and outside of loops would normally be in scope
>> for a conventional optimiser but whole-program optimisation is, AISI,
>> largely about calls and returns.
>
> It can do a bit more. For example, work out that a function [one not
> exported from the program] is never used, or only used in one place, or
> only used with constant arguments. It can do a similar analysis of
> variables and other entities.

Agreed.

>
> But it's tricky to apply to current languages which still seem to be
> dominated by separate compilation of modules, so we're seeing those
> complex schemes that David Brown mentioned of specialised object file
> formats, and special kinds of linkers.

Wny not just stop compilation at an earlier stage?


--
James Harris

James Harris

unread,
Aug 28, 2021, 10:35:30 AMAug 28
to
On 24/08/2021 21:56, David Brown wrote:
> On 24/08/2021 21:06, Bart wrote:
>> On 24/08/2021 18:25, James Harris wrote:
>>
>>> These days why use calling conventions at all? Perhaps they are only
>>> needed for when there's complete ignorance of the callee. The
>>> traditional concept of calling conventions may be pass\acute/e. ;-)
>
> James, aren't you using Linux? The compose key makes it easy to write
> letters like é - it's just compose, ´, e - "passé". (It's even easier
> if you have a non-English keyboard layout, in Windows or Linux, as these
> usually have "dead keys" for accents.)

Thanks, I've now enabled the compose key though I wrote passé in the way
I did as it's the way I am thinking of for my language - which, as it
was unfamiliar to others was why I added the smiley.


--
James Harris

Bart

unread,
Aug 28, 2021, 12:14:24 PMAug 28
to
On 28/08/2021 15:22, James Harris wrote:
> On 27/08/2021 21:07, Bart wrote:
>> On 27/08/2021 16:04, James Harris wrote:

> I have (or had, it was some time ago) some ideas on that. If you'd like
> to start a new topic I'll look out my notes and respond.

It's OK, I'm not planning to do any stack overflow checks. Just wondered
if you had something in mind.


>> So for my example call, that would allow it to be inlined and reduced.
>>
>> (Mine doesn't do inlining; I'd have to write psd() as a macro - a
>> proper one not what C has - then it reduces psd(10,20) to 'mov rax,
>> -300'.)
>
> Constants are easy. What's your full call, execute, return sequence for
>
>   psd(x, y)
>
> and what convention are you using?

For the function itself the code is:

t.psd:
R.a = D10
R.b = D11

lea D0, [R.a+R.b]
mov D1, R.a
sub D1, R.b
imul2 D0, D1

ret

And for the call x:=psd(10,20) it's:

mov D10, 10
mov D11, 20
call t.psd
mov R.x, D0

(Just noticed you asked for psd(x,y); that's not much different: mov
D10, R.x etc, when x and y fit into registers in the called, otherwise
each is a memory load.)

The call convention is Win64 ABI for x64. The registering numbering is
non-standard; here D registers are 64 bits, and organised as:

D0-D2 Volatile
D3-D9 Non-volatiles (must be preserved by callee)
D10..13 1st 4 arguments, also volatile
D14, D15 Frame and stack pointers (aka Dframe, Dstack)

Since PSD is a leaf function, its parameters can be left in the
parameter-passing registers. No volatile registers are used, so no
saving is needed. And no local variables, so no stack frame needs to be
created.

In the caller, the 32-byte stack shadow space required by the ABI is
allocated as a 32-bit extension to the stack frame, by entry/exit code
not shown, as this is shared by other calls.


If I turn off the peephole optimiser, which usually does nothing at all
for speeding things up, just makes the code tidier, then the psd body
becomes:

R.a = D10
R.b = D11
mov D0, R.a
add D0, R.b
mov D1, R.a
sub D1, R.b
imul2 D0, D1
ret

And if I disable the optimiser completely (which mainly consists of
allocating some variables to registers) the code turns into:

t.psd:
psd.a = 16
psd.b = 24
push Dframe
mov Dframe, Dstack
sub Dstack, 32
mov [Dframe+16], D10
mov [Dframe+24], D11

mov D0, [Dframe+psd.a]
add D0, [Dframe+psd.b]
mov D1, [Dframe+psd.a]
sub D1, [Dframe+psd.b]
imul2 D0, D1

add Dstack, 32
pop Dframe
ret

This is pretty the code generated by non-optimising C compilers for the
same target, although C will use a 32-bit 'int' type.


>> But it's tricky to apply to current languages which still seem to be
>> dominated by separate compilation of modules, so we're seeing those
>> complex schemes that David Brown mentioned of specialised object file
>> formats, and special kinds of linkers.
>
> Wny not just stop compilation at an earlier stage?


I don't get you. How can you attempt optimising a call to a function in
a module that you haven't yet compiled?

Without that extra info, all you might have is:

extern int psd(int,int);

This is effectively what you have for routines in binary DLLs (which on
Windows, have the extra overhead of being called indirectly: psd(10,20)
calls a local stub function inside a table, which consists of an
indirect jump to the external routine, fixed up when the executable is
loaded).


David Brown

unread,
Aug 28, 2021, 12:21:49 PMAug 28
to
Perhaps he means in cases where the leaf function is now inlined?

>
>> Essentially, with whole-program optimisation many of the above steps
>> can be removed, even if the callee is kept separate rather than being
>> inlined.
>
> Whole program optimisation is secondary here. The above is just routine
> optimisation.

Agreed. Compilers routinely skip much of that for small functions,
because they don't need it. They don't save callee-save registers if
they don't need to use any. They don't use a base register at all,
unless there are special requirements (nested functions or closures,
alloca/VLA dynamic stack allocations, etc.).

>
> My compiler barely does any (it stores a handful of locals into
> non-volatile registers, and that's pretty much it; the rest is some
> tidying up), yet it converts your psd() into only 5 x64 intructions.
>
> A call to psd(10,20) is done in 3 instructions.
>
> Whole-program compilers (which make whole-program optimisation easier)
> would just ensure that all function bodies within the program are
> visible at any call-site.
>
> So for my example call, that would allow it to be inlined and reduced.
>

Whole-program optimisation allows more than just inlining. Basically,
it allows all kinds of inter-procedural optimisations to be done across
units. Those include inlining, but also out-lining (when similar bits
of code are combined into a single function to save space), function
cloning (multiple specialisations of a function), constant propagation,
parameter simplification (when all calls to a function have the same
value for a particular parameter, it can be removed from the parameter
list and turned into a local constant), and many other cases.

It also allows better static error checking and analysis across units.
For example, it can catch inconsistent definitions of types across the
program.

> (Mine doesn't do inlining; I'd have to write psd() as a macro - a proper
> one not what C has - then it reduces psd(10,20) to 'mov rax, -300'.)
>

If you can implement it (I know it is not easy), inlining is an
optimisation that can greatly improve code efficiency. And it also lets
you write significantly better source code, because you can split
functions more without worrying about the cost.

>
>> Agreed. Code inside and outside of loops would normally be in scope
>> for a conventional optimiser but whole-program optimisation is, AISI,
>> largely about calls and returns.
>
> It can do a bit more. For example, work out that a function [one not
> exported from the program] is never used, or only used in one place, or
> only used with constant arguments. It can do a similar analysis of
> variables and other entities.
>
> But it's tricky to apply to current languages which still seem to be
> dominated by separate compilation of modules, so we're seeing those
> complex schemes that David Brown mentioned of specialised object file
> formats, and special kinds of linkers.
>

As James suggested, the object files are basically just the internal
representation of the compilation before code generation.

It would be possible to make a somewhat simpler linker here that just
combined these object files, and then passed it back to the next stage
of the compiler that handles the inter-procedural optimisations and code
generation.

However, that is not scalable. Much of the complication comes in the
partitioning process to let you divide the task amongst multiple
processes. Even for a single process, naïve IPO algorithms are commonly
quadratic or more (maybe even exponential) in their scaling. And you
have a somewhat iterative process - partitioning the code, linking those
bits, bringing the results together again for more linking, until you
parts that you can run through code generates and then the final
"traditional" link.

Single-threaded whole-program optimisers have been around for decades
for small-systems embedded targets, handling programs of tens of
thousands of lines. It's only in the last decade that there have been
compilers that will handle tens of millions of lines in sensible time
frames.


David Brown

unread,
Aug 28, 2021, 12:21:50 PMAug 28
to
On 28/08/2021 16:22, James Harris wrote:
> On 27/08/2021 21:07, Bart wrote:
>> On 27/08/2021 16:04, James Harris wrote:

>>> and that's without a cost for checking for stack overflow (because it
>>> could be avoided when calling a leaf function).
>>
>> How? Calling a leaf function can still consume stack space, perhaps a
>> lot of it. It would need some extra analysis to ahow that the
>> call-depth would never that deep.
>
> I have (or had, it was some time ago) some ideas on that. If you'd like
> to start a new topic I'll look out my notes and respond.
>

I too am curious about your ideas here.

>>
>> But it's tricky to apply to current languages which still seem to be
>> dominated by separate compilation of modules, so we're seeing those
>> complex schemes that David Brown mentioned of specialised object file
>> formats, and special kinds of linkers.
>
> Wny not just stop compilation at an earlier stage?
>

That is exactly what the compilers do. They do whatever analysis and
optimisation can be handled locally within the compilation unit, then
dump the entire internal representation to the object file.

Bart

unread,
Aug 28, 2021, 2:45:36 PMAug 28
to
On 28/08/2021 17:14, Bart wrote:
> On 28/08/2021 15:22, James Harris wrote:
>> On 27/08/2021 21:07, Bart wrote:
>>> On 27/08/2021 16:04, James Harris wrote:
>
>> I have (or had, it was some time ago) some ideas on that. If you'd
>> like to start a new topic I'll look out my notes and respond.
>
> It's OK, I'm not planning to do any stack overflow checks. Just wondered
> if you had something in mind.
>
>
>>> So for my example call, that would allow it to be inlined and reduced.
>>>
>>> (Mine doesn't do inlining; I'd have to write psd() as a macro - a
>>> proper one not what C has - then it reduces psd(10,20) to 'mov rax,
>>> -300'.)
>>
>> Constants are easy. What's your full call, execute, return sequence for
>>
>>    psd(x, y)
>>
>> and what convention are you using?
>
> For the function itself the code is:
>
>     t.psd:
>           R.a = D10
>           R.b = D11
>
>           lea       D0, [R.a+R.b]
>           mov       D1, R.a
>           sub       D1, R.b
>           imul2     D0, D1
>
>           ret

Here's the code produced by gcc -O3, working with C with 'long long
int', for a function by itself in a module so the compiler doesn't have
any other info:

psd:
lea rax, [rcx+rdx]
sub rcx, rdx
imul rax, rcx
ret

It's one instruction smaller than mine, because it knows that 'b' (R.b)
will not be subsequently needed so no need to copy it to another
register first.

If I have code calling it in the same module, then it will inline it,
but psd stays, unless it's marked 'static', then it disappears.

Andy Walker

unread,
Aug 28, 2021, 7:03:36 PMAug 28
to
On 28/08/2021 17:21, David Brown wrote:
> [...] It's only in the last decade that there have been
> compilers that will handle tens of millions of lines in sensible time
> frames.

I'm not sure that compilers /should/ handle tens of millions
of lines, whether or not in sensible time. Not human-written lines,
anyway. Perhaps they should stop somewhere around 10K lines and say
"Error found"; nothing further*. You can be tolerably sure that
indeed there is one.

A project much bigger than that -- say an OS, or a browser,
or a large compiler -- would surely benefit from being broken up
into a steering program together with a number of separate modules
that it invokes.

_____
* Think of it as the computing equivalent of the professor or
politician who says that if something can't be written on
one side of A4, it's not worth reading. Doesn't apply to
novels, but they're fiction. Long reports come with an
executive summary [== steering program].

--
Andy Walker, Nottingham.
Andy's music pages: www.cuboid.me.uk/andy/Music
Composer of the day: www.cuboid.me.uk/andy/Music/Composers/Hause

Bart

unread,
Aug 28, 2021, 8:01:28 PMAug 28
to
On 28/08/2021 17:21, David Brown wrote:
> On 27/08/2021 22:07, Bart wrote:

> As James suggested, the object files are basically just the internal
> representation of the compilation before code generation.

Then 'object file' is a complete misnomer. It'll be some sort of
intermediate representation. Eventually you will get to what most will
think of as an object file, containing binary, relocatable machine code
for one module, if you don't go straight to executable.

(The project I'm working on now generates such an intermediate
representation, in my case somewhat further advanced in the process as
it is a form of linear, portable bytecode. In my case also, the file
represents a whole program.)

> It would be possible to make a somewhat simpler linker here that just
> combined these object files, and then passed it back to the next stage
> of the compiler that handles the inter-procedural optimisations and code
> generation.

Then, you might as well not bother writing it out; keep it in memory,
and you have a whole-project compiler.

> However, that is not scalable. Much of the complication comes in the
> partitioning process to let you divide the task amongst multiple
> processes. Even for a single process, naïve IPO algorithms are commonly
> quadratic or more (maybe even exponential) in their scaling. And you
> have a somewhat iterative process - partitioning the code, linking those
> bits, bringing the results together again for more linking, until you
> parts that you can run through code generates and then the final
> "traditional" link.
>
> Single-threaded whole-program optimisers have been around for decades
> for small-systems embedded targets, handling programs of tens of
> thousands of lines. It's only in the last decade that there have been
> compilers that will handle tens of millions of lines in sensible time
> frames.

A rule of thumb I've sometimes observed is that, for x64 anyway, 1 line
of source code maps to about 10 bytes of binary machine code.

So 10 million lines of code represents a single 100MB program,
approximately.

On my Windows machine very few programs (EXE or DLL files) are that big;
99% of them are under 10MB, and 85% under 1MB, which latter would be
approx 100K lines of code.

But let's go with that 100MB/10Mloc program; it's very unlikely that
something that big will be completely unstructured, just 1000s of
functions each of which can be called from any other.

For a start, it should consist of separate modules. Each module can only
see its local functions, plus whatever ones are visible from imported
modules.

Even an exported function may only be visible to a handful of modules
out of 100s, if the module hierarchy is done properly.

Likely the program will consist of groups of self-contained packages -
groups of modules - with limited interfaces to the rest of the program.

That's a long way of saying that the whole program optimisation problem
isn't as daunting as it might appear.

And it might be faster than you think: on a decent machine, unoptimised
code (or mildly optimised like mine) can probably be generated at
5-10MB/second, using a single core. So there is plenty of capacity to do
interprocedural optimisation without it taking forever.

Further, this is something you might do for production code, after you
already have a working set of sources codes, so you don't need all those
other analyses.

David Brown

unread,
Aug 29, 2021, 5:16:55 AMAug 29
to
On 29/08/2021 01:03, Andy Walker wrote:
> On 28/08/2021 17:21, David Brown wrote:
>> [...]  It's only in the last decade that there have been
>> compilers that will handle tens of millions of lines in sensible time
>> frames.
>
>     I'm not sure that compilers /should/ handle tens of millions
> of lines, whether or not in sensible time.  Not human-written lines,
> anyway.  Perhaps they should stop somewhere around 10K lines and say
> "Error found";  nothing further*.  You can be tolerably sure that
> indeed there is one.

While it is not true of all languages, in C compilations you can
typically have many thousands of lines of code passed to the compiler
before you even get to a single "real" line of the C code - all the
include files are compiled. In an embedded project I am working on, the
headers defining the hardware registers for the microcontroller
constitute some 50,000 lines. Combined with other headers for drivers,
OS, and so on, most of the compiler runs when building the project will
be between perhaps 20,000 and 80,000 lines - even though none of the
modules I write are above a thousand or so lines long.

In C++, in is common practice now to have header-only libraries.
Compiles of many tens of thousands of lines are normal for compilation
units that use some of the bigger parts of the library. Add in, say,
headers for a graphics library like QT or GTK, and hundred thousand line
compiles are quite reasonable.

Take a browser, or a big game, or a productivity suite (like
LibreOffice), with thousands of C++ (or other languages) files, and you
can see the scale of the number of lines compiled in a build.


Yes, there will be bugs - statistically, in a big enough system it is
pretty much guaranteed, regardless of the programming language, or the
way it is compiled or modularised.

There are situations where bugs of any kind are not tolerated (I have
worked with such systems), and situations where limited bugs are
acceptable because the result still does a useful job that is better
than nothing. A glitch in the regulation of an aeroplane flight
controller is unacceptable - a glitch in the movement of a character in
a game is not nearly as critical.


>
>     A project much bigger than that -- say an OS, or a browser,
> or a large compiler -- would surely benefit from being broken up
> into a steering program together with a number of separate modules
> that it invokes.
>

Bigger projects /are/ broken up into modules that are specified,
written, tested and managed separately. It does not matter in the
slightest whether you are talking about parts compiled separately,
linked separately, distributed separately. All that matters is that the
development processes are kept in manageable chunks.

> _____
>   * Think of it as the computing equivalent of the professor or
>     politician who says that if something can't be written on
>     one side of A4, it's not worth reading.  Doesn't apply to
>     novels, but they're fiction.  Long reports come with an
>     executive summary [== steering program].
>

And do you think the details within the long reports have more or fewer
errors if they are printed on separate printers, or on the same printer?

David Brown

unread,
Aug 29, 2021, 5:36:21 AMAug 29
to
On 29/08/2021 02:01, Bart wrote:
> On 28/08/2021 17:21, David Brown wrote:
>> On 27/08/2021 22:07, Bart wrote:
>
>> As James suggested, the object files are basically just the internal
>> representation of the compilation before code generation.
>
> Then 'object file' is a complete misnomer.

Yes, that's a fair comment. "Linking" is also a misnomer in link-time
optimisation. The names are historical, rather than technically accurate.

> It'll be some sort of
> intermediate representation. Eventually you will get to what most will
> think of as an object file, containing binary, relocatable machine code
> for one module, if you don't go straight to executable.
>
> (The project I'm working on now generates such an intermediate
> representation, in my case somewhat further advanced in the process as
> it is a form of linear, portable bytecode. In my case also, the file
> represents a whole program.)
>
>> It would be possible to make a somewhat simpler linker here that just
>> combined these object files, and then passed it back to the next stage
>> of the compiler that handles the inter-procedural optimisations and code
>> generation.
>
> Then, you might as well not bother writing it out; keep it in memory,
> and you have a whole-project compiler.

Writing them to files lets you scale better, and split up the work, both
across multiple processes (or even multiple computers), and across time
with incremental builds.

>
>> However, that is not scalable.  Much of the complication comes in the
>> partitioning process to let you divide the task amongst multiple
>> processes.  Even for a single process, naïve IPO algorithms are commonly
>> quadratic or more (maybe even exponential) in their scaling.  And you
>> have a somewhat iterative process - partitioning the code, linking those
>> bits, bringing the results together again for more linking, until you
>> parts that you can run through code generates and then the final
>> "traditional" link.
>>
>> Single-threaded whole-program optimisers have been around for decades
>> for small-systems embedded targets, handling programs of tens of
>> thousands of lines.  It's only in the last decade that there have been
>> compilers that will handle tens of millions of lines in sensible time
>> frames.
>
> A rule of thumb I've sometimes observed is that, for x64 anyway, 1 line
> of source code maps to about 10 bytes of binary machine code.
>

That would be excluding headers, which often make up the bulk of the
lines compiled in a run of the compiler. And while it might be a rough
guide for C, it is no guide at all for C++.

> So 10 million lines of code represents a single 100MB program,
> approximately.

The biggest single executable I see on my machine (without digging too
hard) is 25 MB. I have also found a shared library at 125 MB. (This is
Linux, however - unlike Windows, the OS does not load entire binaries
into memory. Pages are loaded in when they are needed, or predicted to
be needed.)

I also did not mean to imply that these big builds result in a single
binary - they are often split into multiple "shared" libraries. (I put
"shared" in quotations, because the libraries are typically dedicated to
the program rather than shared by other applications.) This can be
convenient during development, building and testing.

>
> On my Windows machine very few programs (EXE or DLL files) are that big;
> 99% of them are under 10MB, and 85% under 1MB, which latter would be
> approx 100K lines of code.
>
> But let's go with that 100MB/10Mloc program; it's very unlikely that
> something that big will be completely unstructured, just 1000s of
> functions each of which can be called from any other.
>
> For a start, it should consist of separate modules. Each module can only
> see its local functions, plus whatever ones are visible from imported
> modules.
>
> Even an exported function may only be visible to a handful of modules
> out of 100s, if the module hierarchy is done properly.
>
> Likely the program will consist of groups of self-contained packages -
> groups of modules - with limited interfaces to the rest of the program.
>
> That's a long way of saying that the whole program optimisation problem
> isn't as daunting as it might appear.
>

I agree, at least somewhat. But the problem of finding out how to split
things up for optimisation is still daunting.

You always have to find balances with optimisation. Yes, you have the
development of the huge project split up into modules, sub-modules,
etc., in a hierarchy. But one part deep in one branch can still end up
calling code deep in another branch. Doing it via the hierarchy can
mean multiple layers of wrappers, interfaces, cross-DLL boundaries, etc.
LTO could mean inlining it or at least making a single direct function
call. The number of ways each part of the code can interact with other
parts starts quadratic and quickly gets worse when you want to find out
the best way to combine them - optimisation is an NP problem. So you
are always looking for an approximate solution, not an exact one, and
scaling and trade-offs are always difficult.

Dmitry A. Kazakov

unread,
Aug 29, 2021, 6:50:20 AMAug 29
to
On 2021-08-29 11:36, David Brown wrote:
> On 29/08/2021 02:01, Bart wrote:

>> So 10 million lines of code represents a single 100MB program,
>> approximately.
>
> The biggest single executable I see on my machine (without digging too
> hard) is 25 MB. I have also found a shared library at 125 MB.

If you use GCC and generic instances put in a shared library, you easily
come to such numbers. GCC generates lots of stuff.

Funny thing, you cannot even build some of such shared libraries under
Windows because the number of exported symbols easily exceeds 2**16-1
(Windows limit). You must split the library into parts...

> I also did not mean to imply that these big builds result in a single
> binary - they are often split into multiple "shared" libraries. (I put
> "shared" in quotations, because the libraries are typically dedicated to
> the program rather than shared by other applications.) This can be
> convenient during development, building and testing.

100-200MB is a medium-sized production application: peripheral devices,
HTTP server, database, cloud connectivity, user management, things start
to explode quickly.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Bart

unread,
Aug 29, 2021, 7:04:32 AMAug 29
to
The largest EXE/DLL on my Windows machine is about 170MB. It's a DLL for
a web browser.

Probably it's unlikely to be a monolithic program, as both EXE/DLL can
be used to package multiple components, including data, into one file.

And as I said, 99% of such files on my machine under 10MB; the vast
majority well under.

BTW what peripheral device needs 200MB of code?

Dmitry A. Kazakov

unread,
Aug 29, 2021, 7:34:14 AMAug 29
to
On 2021-08-29 13:04, Bart wrote:

> BTW what peripheral device needs 200MB of code?

Modern protocols are extremely complicated as well as the end devices.
Consider a radiator thermostat. It is a very simple device. Yet it has
hundred parameters, a dozen of modes, a weekly schedule you must be able
to query and program. So you can imagine the complexity of its protocol.
If you are very lucky that would be a vendor-specific protocol. If it is
a "standard" protocol you are in a deep trouble. The standard protocols
are gigantic piles of cra*p. You can take a look on AMQP or any of ASN.1
based protocols to get an impression. ASN.1 description of certificate
files is almost comical, if you do not need to implement it.

Worse, you could not throw the useless stuff out, because you must
certify your implementation of the protocol.

On top of that come configuration stuff you must address in the GUI, in
the persistent storage. The on-line data you have to handle and log and
so on. Procedures to replace defective device, flash the device's firmware.

Then you have not just one device, you have an array of, e.g. several
radiator thermostats and a dozen of other device types, e.g. shutter
contacts, wall panels, sensors etc.

anti...@math.uni.wroc.pl

unread,
Aug 29, 2021, 8:24:02 AMAug 29
to
Bart <b...@freeuk.com> wrote:
> On 28/08/2021 17:21, David Brown wrote:
> > On 27/08/2021 22:07, Bart wrote:
>
> > As James suggested, the object files are basically just the internal
> > representation of the compilation before code generation.
>
> Then 'object file' is a complete misnomer. It'll be some sort of
> intermediate representation. Eventually you will get to what most will
> think of as an object file, containing binary, relocatable machine code
> for one module, if you don't go straight to executable.
>
> (The project I'm working on now generates such an intermediate
> representation, in my case somewhat further advanced in the process as
> it is a form of linear, portable bytecode. In my case also, the file
> represents a whole program.)
>
> > It would be possible to make a somewhat simpler linker here that just
> > combined these object files, and then passed it back to the next stage
> > of the compiler that handles the inter-procedural optimisations and code
> > generation.
>
> Then, you might as well not bother writing it out; keep it in memory,
> and you have a whole-project compiler.
>
> > However, that is not scalable. Much of the complication comes in the
> > partitioning process to let you divide the task amongst multiple
> > processes. Even for a single process, na?ve IPO algorithms are commonly
> > quadratic or more (maybe even exponential) in their scaling. And you
> > have a somewhat iterative process - partitioning the code, linking those
> > bits, bringing the results together again for more linking, until you
> > parts that you can run through code generates and then the final
> > "traditional" link.
> >
> > Single-threaded whole-program optimisers have been around for decades
> > for small-systems embedded targets, handling programs of tens of
> > thousands of lines. It's only in the last decade that there have been
> > compilers that will handle tens of millions of lines in sensible time
> > frames.
>
> A rule of thumb I've sometimes observed is that, for x64 anyway, 1 line
> of source code maps to about 10 bytes of binary machine code.

Depends on the language. For C it may be lower, for some other
languages much higher.

> So 10 million lines of code represents a single 100MB program,
> approximately.

I work on a program when executable is 64 M. However, significant
part of executable code is in loadable modules that take another
64 M. Guess how big is the source?

> On my Windows machine very few programs (EXE or DLL files) are that big;
> 99% of them are under 10MB, and 85% under 1MB, which latter would be
> approx 100K lines of code.
>
> But let's go with that 100MB/10Mloc program; it's very unlikely that
> something that big will be completely unstructured, just 1000s of
> functions each of which can be called from any other.
>
> For a start, it should consist of separate modules. Each module can only
> see its local functions, plus whatever ones are visible from imported
> modules.

Well, "most" higher level languages are object-oriented. You have
methods for specific class, but those can be used from places where
class itself is not visible -- for method call it is enough that
interface is visible which may happen via inheritance from parent.
Once you inline a method it may call other methods. So you quickly
get loads of code which is there only due to optimization.

> And it might be faster than you think: on a decent machine, unoptimised
> code (or mildly optimised like mine) can probably be generated at
> 5-10MB/second, using a single core. So there is plenty of capacity to do
> interprocedural optimisation without it taking forever.

Well, there is also issue of memory size. SmartEiffel used (uses???)
whole-program optimization and compiled very fast. But for really
large program it used to run out of memory. I am not sure if this is
still problem on modern machines, but resonable estimate is that keeping
all needed info in memory you may need 1000 times of memory as for source.
So you need to carefully optimize space use...

--
Waldek Hebisch

Bart

unread,
Aug 29, 2021, 8:49:57 AMAug 29
to
By my measure, 200MB would equate to (very roughly) 20M lines of code.
If you were to print it out at 60 lines/page, on 80gsm paper, you would
get a printout 100 feet high (30m).

Exactly how complicated is that thermostat again? How tall was the pile
of documents that constitutes the datasheet?

Bart

unread,
Aug 29, 2021, 9:06:31 AMAug 29
to
On 29/08/2021 13:24, anti...@math.uni.wroc.pl wrote:
> Bart <b...@freeuk.com> wrote:

>> A rule of thumb I've sometimes observed is that, for x64 anyway, 1 line
>> of source code maps to about 10 bytes of binary machine code.
>
> Depends on the language. For C it may be lower, for some other
> languages much higher.
>
>> So 10 million lines of code represents a single 100MB program,
>> approximately.
>
> I work on a program when executable is 64 M. However, significant
> part of executable code is in loadable modules that take another
> 64 M. Guess how big is the source?

By my metric it would be about 6M lines of source code, if most of the
64KB was executable x64 code (rather than initialised data, embedded
data files, or other exe overheads).

That assumes a certain proportion of declaration lines to lines of
executable code.

Now you're going to tell me it's either a lot fewer or a lot more.

If the language is C, then I guess that could be anything: you can have
macros that expand to many times there size, and instantiated at
multiple sites; include files that can do the same trick. Or lot of
boilerplate code that reduces to nothing.

Or there is lots of inlining that pushes the size the other way again.


>> And it might be faster than you think: on a decent machine, unoptimised
>> code (or mildly optimised like mine) can probably be generated at
>> 5-10MB/second, using a single core. So there is plenty of capacity to do
>> interprocedural optimisation without it taking forever.
>
> Well, there is also issue of memory size. SmartEiffel used (uses???)
> whole-program optimization and compiled very fast. But for really
> large program it used to run out of memory. I am not sure if this is
> still problem on modern machines, but resonable estimate is that keeping
> all needed info in memory you may need 1000 times of memory as for source.
> So you need to carefully optimize space use...

3 compilers of mine I've just tested use memory equivalent to 15x (C
compiler), 20x (Interpreter), and 80x (my systems language) the source size.

But they all use persistent data structures, especially the last which
creates arrays of tokens, a bad idea I've since dropped. All those
include the source itself.

All the memory is recovered on program termination. If it becomes an
issue, then unneeded data structured can be destroyed earlier.

But if we say 40x source size, then capacity of 8GB means /currently/
being able to deal with source code of something over 10M lines,
depending one code density.

It just means being more resourceful, and reintroducing long-forgotten
techniques of working with memory-limited hardware.

ATM, 10M lines is 200 times the size of my typical projects.

Dmitry A. Kazakov

unread,
Aug 29, 2021, 9:12:14 AMAug 29
to
On 2021-08-29 14:49, Bart wrote:
> On 29/08/2021 12:34, Dmitry A. Kazakov wrote:
>> On 2021-08-29 13:04, Bart wrote:
>>
>>> BTW what peripheral device needs 200MB of code?
>>
>> Modern protocols are extremely complicated as well as the end devices.
>> Consider a radiator thermostat. It is a very simple device. Yet it has
>> hundred parameters, a dozen of modes, a weekly schedule you must be
>> able to query and program. So you can imagine the complexity of its
>> protocol. If you are very lucky that would be a vendor-specific
>> protocol. If it is a "standard" protocol you are in a deep trouble.
>> The standard protocols are gigantic piles of cra*p. You can take a
>> look on AMQP or any of ASN.1 based protocols  to get an impression.
>> ASN.1 description of certificate files is almost comical, if you do
>> not need to implement it.
>>
>> Worse, you could not throw the useless stuff out, because you must
>> certify your implementation of the protocol.
>>
>> On top of that come configuration stuff you must address in the GUI,
>> in the persistent storage. The on-line data you have to handle and log
>> and so on. Procedures to replace defective device, flash the device's
>> firmware.
>>
>> Then you have not just one device, you have an array of, e.g. several
>> radiator thermostats and a dozen of other device types, e.g. shutter
>> contacts, wall panels, sensors etc.
>>
>
> By my measure, 200MB would equate to (very roughly) 20M lines of code

You must count the language run-time and other system libraries. E.g.
libc is 1.6MB, SQLite3 is 1.3MB, GTK is about 25MB and so on.

> Exactly how complicated is that thermostat again? How tall was the pile
> of documents that constitutes the datasheet?

Datasheet has nothing to do with technical documentation. Typically, if
exists, it is many thousands of pages.

Bart

unread,
Aug 29, 2021, 9:38:32 AMAug 29
to
GTK would be statically linked into an application (which I thought you
said was to do with peripherals)?

That doesn't make any sense. So if 50 apps all needed GTK, each would
carry their own copies. And if several are running at the same time,
there will be multiple copies of the code in memory.

(I've just downloaded the GTK runtime, which was rather elusive to find.

There are about 100 DLLs totalling 55MB, out of a total installation of
9000 files. So even if statically incorporated into an application, it
would still need a home directory with all the other junk.

However, suppose 50MB of that 200MB /was/ GTK. It seems GTK itself
already is logically divided into dozens of separate libraries.

This is the point I made some posts ago.

Dmitry A. Kazakov

unread,
Aug 29, 2021, 9:57:39 AMAug 29
to
GTK cannot be linked statically.

> That doesn't make any sense. So if 50 apps all needed GTK, each would
> carry their own copies. And if several are running at the same time,
> there will be multiple copies of the code in memory.

You run 50 GUIs at a time? But no, GTK is linked dynamically due to some
licensing decisions, I believe. I do not remember.

> However, suppose 50MB of that 200MB /was/ GTK.

No it is not only GTK. It was an example that 200MB is very modest
assuming the number of protocols a typical application uses. Each
protocol comes with several libraries each of them might be 1MB or so.
And as I said on top of that there are layers of application code
necessary to run the protocol stack, to configure, to store/restore
configurations, to visualize etc.

It seems that you think that a typical application reads from the
keyboard and prints on printer. It is not so, many decades, actually.

> It seems GTK itself
> already is logically divided into dozens of separate libraries.

Yes, it is.

> This is the point I made some posts ago.

Maybe. My comment was that 200MB of code is not that much.

David Brown

unread,
Aug 29, 2021, 10:54:34 AMAug 29
to
On 29/08/2021 12:50, Dmitry A. Kazakov wrote:
> On 2021-08-29 11:36, David Brown wrote:
>> On 29/08/2021 02:01, Bart wrote:
>
>>> So 10 million lines of code represents a single 100MB program,
>>> approximately.
>>
>> The biggest single executable I see on my machine (without digging too
>> hard) is 25 MB.  I have also found a shared library at 125 MB.
>
> If you use GCC and generic instances put in a shared library, you easily
> come to such numbers. GCC generates lots of stuff.
>
> Funny thing, you cannot even build some of such shared libraries under
> Windows because the number of exported symbols easily exceeds 2**16-1
> (Windows limit). You must split the library into parts...
>

I didn't know of that limit. I did know that Windows was still limited
by its 16-bit ancestry, but not that specific one.

anti...@math.uni.wroc.pl

unread,
Aug 29, 2021, 12:19:16 PMAug 29
to
Bart <b...@freeuk.com> wrote:
> On 29/08/2021 13:24, anti...@math.uni.wroc.pl wrote:
> > Bart <b...@freeuk.com> wrote:
>
> >> A rule of thumb I've sometimes observed is that, for x64 anyway, 1 line
> >> of source code maps to about 10 bytes of binary machine code.
> >
> > Depends on the language. For C it may be lower, for some other
> > languages much higher.
> >
> >> So 10 million lines of code represents a single 100MB program,
> >> approximately.
> >
> > I work on a program when executable is 64 M. However, significant
> > part of executable code is in loadable modules that take another
> > 64 M. Guess how big is the source?
>
> By my metric it would be about 6M lines of source code, if most of the
> 64KB was executable x64 code (rather than initialised data, embedded
> data files, or other exe overheads).
>
> That assumes a certain proportion of declaration lines to lines of
> executable code.
>
> Now you're going to tell me it's either a lot fewer or a lot more.

40 M in executable is "statically" linked code from outside, probably
corresponding to 0.5M lines os source. 24 M corresponds to about 80 K
lines. 64 M in loadable modules corresponds to 210 K lines (actual
code lines is closer to 120 K, rest is comments and empty lines).

It is hard to distinguish between executable code and data. Due
to semantics initialized data needs executable code to perform
initialization. There are dispatch tables, all data and code is
tagged (has identifying headers). There is runtime type info.
OTOH, there is lot of code due to compiler aggressivly optimizing
for speed at cost of code size. There is exception handling code
inserted by compiler.

> If the language is C, then I guess that could be anything: you can have
> macros that expand to many times there size, and instantiated at
> multiple sites; include files that can do the same trick. Or lot of
> boilerplate code that reduces to nothing.
>
> Or there is lots of inlining that pushes the size the other way again.

Compiler may compile the same code multiple times, each time with
different assumptions about type (effectively producing several
specialized variants from the same code).

> > Well, there is also issue of memory size. SmartEiffel used (uses???)
> > whole-program optimization and compiled very fast. But for really
> > large program it used to run out of memory. I am not sure if this is
> > still problem on modern machines, but resonable estimate is that keeping
> > all needed info in memory you may need 1000 times of memory as for source.
> > So you need to carefully optimize space use...
>
> 3 compilers of mine I've just tested use memory equivalent to 15x (C
> compiler), 20x (Interpreter), and 80x (my systems language) the source size.
>
> But they all use persistent data structures, especially the last which
> creates arrays of tokens, a bad idea I've since dropped. All those
> include the source itself.

ATM I have to keep parse tree of large part of program in memory.
The parse tree is about 8 times larger than corresponding source.
Representation of parse tree is unoptimized and in principle
packed representation could be smaller. OTOH this is just parse
tree, without any extra data like types or source locations.
Once compiler collects enough data to do interesting optimizations,
data structures may be much larger...

> All the memory is recovered on program termination. If it becomes an
> issue, then unneeded data structured can be destroyed earlier.
>
> But if we say 40x source size, then capacity of 8GB means /currently/
> being able to deal with source code of something over 10M lines,
> depending one code density.
>
> It just means being more resourceful, and reintroducing long-forgotten
> techniques of working with memory-limited hardware.
>
> ATM, 10M lines is 200 times the size of my typical projects.

I deal with code written by other folks. And I like generating
code. You may easily end up with quite large amount of code
to compile.

--
Waldek Hebisch

Bart

unread,
Aug 29, 2021, 2:12:20 PMAug 29
to
On 29/08/2021 17:19, anti...@math.uni.wroc.pl wrote:
> Bart <b...@freeuk.com> wrote:

>>> I work on a program when executable is 64 M. However, significant
>>> part of executable code is in loadable modules that take another
>>> 64 M. Guess how big is the source?
>>
>> By my metric it would be about 6M lines of source code, if most of the
>> 64KB was executable x64 code (rather than initialised data, embedded
>> data files, or other exe overheads).
>>
>> That assumes a certain proportion of declaration lines to lines of
>> executable code.
>>
>> Now you're going to tell me it's either a lot fewer or a lot more.
>
> 40 M in executable is "statically" linked code from outside, probably
> corresponding to 0.5M lines os source. 24 M corresponds to about 80 K
> lines. 64 M in loadable modules corresponds to 210 K lines (actual
> code lines is closer to 120 K, rest is comments and empty lines).

Those are some very large ratios between code lines and bytes of output,
some 80:1, 300:1 and (assuming 150K for /some/ blank lines and
comments), about 400:1.

The largest I've come across is 2500:1, for a program (not mine) with
some very deeply nested macros.

It makes it harder to get an idea of the true complexity of a 1MB
program for example; would it be 100K lines (my 10:1 code), or 2.5K
lines (your 400:1 code), or something between the two?

But I think that even C code is typically more like mine than yours. If
I take the 230Kloc file sqlite3.c, which is very comment-heavy, and
strip the comments but leaving blank lines, then I get 170Kloc.

I compile that to a 1.1MB object file, which is between 6:1 and 7:1
bytes per line of source.

If I take one of my 740KLoc benchmark programs (fannkuch() repeated
10,000 times), I get executables of 6MB to 8MB, so bytes:lines ratios of
8:1 to 11:1 (optimising on/off).

If you applied that 400:1 ratio to the 10Mloc programs David was talking
about, then you'd end up with 4GB of code per 10Mloc. My 40Kloc compiler
would be 16MB in size instead of 0.4MB!

So I'd say that your programs are rather atypical.


> It is hard to distinguish between executable code and data. Due
> to semantics initialized data needs executable code to perform
> initialization. There are dispatch tables, all data and code is
> tagged (has identifying headers).

That sounds more like my interpreted languages. If I take that same
740Kloc benchmark, which is 670Kloc in this language, it uses 30MB of
64-bit bytecode, so 45:1 here, ignoring all other requirements.

> ATM I have to keep parse tree of large part of program in memory.
> The parse tree is about 8 times larger than corresponding source.

I think only 8 times larger is pretty good. Although it does depend on
whether you like long or short identifiers...

David Brown

unread,
Aug 30, 2021, 4:13:34 AMAug 30
to
I don't imagine anyone is going to want to write "pass\acute/e" as an
identifier in any language. And the last thing anyone needs is another
way to write that kind of thing.

There are, I think, only two sensible options here:

1. Disallow any identifier letters outside of ASCII.
2. Make everything UTF-8.


If you desperately want to allow some way to write non-ASCII characters
without UTF-8, then please do not invent your own new way to do it.
There are more than enough standards here already - use HTMl/XML names,
or Unicode descriptions.

Dmitry A. Kazakov

unread,
Aug 30, 2021, 4:38:21 AMAug 30
to
On 2021-08-30 10:13, David Brown wrote:

> There are, I think, only two sensible options here:
>
> 1. Disallow any identifier letters outside of ASCII.
> 2. Make everything UTF-8.

Yes. Though people preferring #2 are usually English speakers who are
not really aware of the consequences. Like having E, Ε, Е three
different identifies. One could try to maintain language-defined
homographs in order to prevent mess, introducing even bigger mess...

David Brown

unread,
Aug 30, 2021, 5:50:50 AMAug 30
to
On 30/08/2021 10:38, Dmitry A. Kazakov wrote:
> On 2021-08-30 10:13, David Brown wrote:
>
>> There are, I think, only two sensible options here:
>>
>> 1. Disallow any identifier letters outside of ASCII.
>> 2. Make everything UTF-8.
>
> Yes. Though people preferring #2 are usually English speakers who are
> not really aware of the consequences. Like having E, Ε, Е three
> different identifies. One could try to maintain language-defined
> homographs in order to prevent mess, introducing even bigger mess...
>

I'm an English speaker, and a Norwegian speaker (we have three extra
letters, åøæ). And I am well aware of the potential complication of
different Unicode code points with very similar (or even identical) glyphs.

It can also be difficult for people to type, which can quickly be a pain
for collaboration. How would you type "bøk", for example? That's
"book" in Norwegian, and I have a key labelled "ø". James, on Linux,
can use compose + / + o to get the letter. But for you on Windows, with
a German keyboard layout (I'm guessing from your email address), I
expect you are stuck with copy-and-paste from my post, or using the
"character map" utility, or typing "alt+0248".

Then there is the question of displaying the characters. I have a font
that includes vast numbers of obscure symbols, so I could use ↀ for the
Roman numeral for 1000 (using the traditional symbol, rather than the
modern replacement of M). Other people reading this might not see it.

All in all, non-ASCII letters in identifiers can pose a lot of
challenges. But they are nonetheless important for people around the
world, and despite the disadvantages, UTF-8 is far and away the best
choice. You simply have to trust programmers to be sensible in their
usage. (You need to to that anyway, even with ASCII - in many fonts, l,
1 and I can be hard to distinguish, as can O and 0.)

Dmitry A. Kazakov

unread,
Aug 30, 2021, 7:37:07 AMAug 30
to
On 2021-08-30 11:50, David Brown wrote:
> On 30/08/2021 10:38, Dmitry A. Kazakov wrote:
>> On 2021-08-30 10:13, David Brown wrote:
>>
>>> There are, I think, only two sensible options here:
>>>
>>> 1. Disallow any identifier letters outside of ASCII.
>>> 2. Make everything UTF-8.
>>
>> Yes. Though people preferring #2 are usually English speakers who are
>> not really aware of the consequences. Like having E, Ε, Е three
>> different identifies. One could try to maintain language-defined
>> homographs in order to prevent mess, introducing even bigger mess...
>
> I'm an English speaker, and a Norwegian speaker (we have three extra
> letters, åøæ). And I am well aware of the potential complication of
> different Unicode code points with very similar (or even identical) glyphs.
>
> It can also be difficult for people to type, which can quickly be a pain
> for collaboration. How would you type "bøk", for example? That's
> "book" in Norwegian, and I have a key labelled "ø". James, on Linux,
> can use compose + / + o to get the letter. But for you on Windows, with
> a German keyboard layout (I'm guessing from your email address), I
> expect you are stuck with copy-and-paste from my post, or using the
> "character map" utility, or typing "alt+0248".

Right, character map is what I use.

Germans have it easy way, you can drop diacritical marks ä=ae ö=oe ü=ue
and the ligature SZ ß=ss.

> Then there is the question of displaying the characters. I have a font
> that includes vast numbers of obscure symbols, so I could use ↀ for the
> Roman numeral for 1000 (using the traditional symbol, rather than the
> modern replacement of M). Other people reading this might not see it.

It is a lesser problem now than it was before. I remember the time
Windows was unable to display most of special symbols.

> All in all, non-ASCII letters in identifiers can pose a lot of
> challenges. But they are nonetheless important for people around the
> world, and despite the disadvantages, UTF-8 is far and away the best
> choice. You simply have to trust programmers to be sensible in their
> usage. (You need to to that anyway, even with ASCII - in many fonts, l,
> 1 and I can be hard to distinguish, as can O and 0.)

Actually, this is again sort of Europocentric POV. In reality, if you
have a truly international team with speakers outside Western Europe,
you must agree on some strict rules regarding comments and identifiers.

You might be able to remember a German or even a Czech word. Cyrillic
would be rather more challenging. But what would you do with Armenian or
Chinese?

And the least common denominator is English.

James Harris

unread,
Aug 30, 2021, 8:04:30 AMAug 30
to
On 30/08/2021 09:13, David Brown wrote:
> On 28/08/2021 16:35, James Harris wrote:
>> On 24/08/2021 21:56, David Brown wrote:
>>> On 24/08/2021 21:06, Bart wrote:
>>>> On 24/08/2021 18:25, James Harris wrote:
>>>>
>>>>> These days why use calling conventions at all? Perhaps they are only
>>>>> needed for when there's complete ignorance of the callee. The
>>>>> traditional concept of calling conventions may be pass\acute/e. ;-)
>>>
>>> James, aren't you using Linux?  The compose key makes it easy to write
>>> letters like é - it's just compose, ´, e - "passé".  (It's even easier
>>> if you have a non-English keyboard layout, in Windows or Linux, as these
>>> usually have "dead keys" for accents.)
>>
>> Thanks, I've now enabled the compose key though I wrote passé in the way
>> I did as it's the way I am thinking of for my language - which, as it
>> was unfamiliar to others was why I added the smiley.
>>
>
> I don't imagine anyone is going to want to write "pass\acute/e" as an
> identifier in any language.

It's for string literals!

IMO programs and identifiers should use ascii, even in non-English
languages.


--
James Harris

David Brown

unread,
Aug 30, 2021, 2:13:12 PMAug 30
to
You can do that too in Norwegian (though people are not always
consistent about their choices of transliteration), if you can't use the
proper letters (you can also substitute the Swedish versions). But the
preference is to use the correct letters.

>> Then there is the question of displaying the characters.  I have a font
>> that includes vast numbers of obscure symbols, so I could use ↀ for the
>> Roman numeral for 1000 (using the traditional symbol, rather than the
>> modern replacement of M).  Other people reading this might not see it.
>
> It is a lesser problem now than it was before. I remember the time
> Windows was unable to display most of special symbols.

Slowly, in some ways, Windows has been catching up with the *nix world.

>
>> All in all, non-ASCII letters in identifiers can pose a lot of
>> challenges.  But they are nonetheless important for people around the
>> world, and despite the disadvantages, UTF-8 is far and away the best
>> choice.  You simply have to trust programmers to be sensible in their
>> usage.  (You need to to that anyway, even with ASCII - in many fonts, l,
>> 1 and I can be hard to distinguish, as can O and 0.)
>
> Actually, this is again sort of Europocentric POV. In reality, if you
> have a truly international team with speakers outside Western Europe,
> you must agree on some strict rules regarding comments and identifiers.
>

If you have an international team, then it is standard practice to keep
everything in English. But most teams are not international. Why
should a group of Greek or Japanese programmers be forced to write
everything in a foreign language? You can view the keywords as fixed -
almost like symbols, rather than words - but they may prefer to have
other parts written in their own language.

> You might be able to remember a German or even a Czech word. Cyrillic
> would be rather more challenging. But what would you do with Armenian or
> Chinese?
>
> And the least common denominator is English.
>

It is the least common denominator for most international groups, but
not for most national teams.

David Brown

unread,
Aug 30, 2021, 2:16:22 PMAug 30
to
See the rest of the thread for a discussion on non-ASCII identifiers.
(I am not suggesting that you implement them, or don't implement them -
that's your choice. Some languages go one way, others go the other way.)

But don't make up your own language for special characters in strings or
comments. Again, UTF-8 is far and away the best option. If you feel
that is a problem, then at least stick to an existing standard -
HTML/XML character entities would almost certainly be the most
convenient choice: "pass&eacute;".

Dmitry A. Kazakov

unread,
Aug 30, 2021, 3:10:53 PMAug 30
to
On 2021-08-30 20:13, David Brown wrote:
> On 30/08/2021 13:37, Dmitry A. Kazakov wrote:

>> It is a lesser problem now than it was before. I remember the time
>> Windows was unable to display most of special symbols.
>
> Slowly, in some ways, Windows has been catching up with the *nix world.

I must defend Windows. Linux adopted UTF-8 very late. I well remember
the mess it had with 8-bit code pages.

BTW, there still exist file utilities to check filenames in Linux. I had
an old filesystem with some file names in German encoded in Latin-1. It
was connected to a FreeNAS (BSD-based). These files caused mysterious
FreeNAS crashes when a remote host tried to browse files over a network
share. Once I fixed the names it almost stopped crashing. I ditched
FreeNAS anyway in favor of Ubuntu.

David Brown

unread,
Aug 30, 2021, 3:18:15 PMAug 30
to
On 30/08/2021 21:10, Dmitry A. Kazakov wrote:
> On 2021-08-30 20:13, David Brown wrote:
>> On 30/08/2021 13:37, Dmitry A. Kazakov wrote:
>
>>> It is a lesser problem now than it was before. I remember the time
>>> Windows was unable to display most of special symbols.
>>
>> Slowly, in some ways, Windows has been catching up with the *nix world.
>
> I must defend Windows. Linux adopted UTF-8 very late. I well remember
> the mess it had with 8-bit code pages.

Windows also had a mess with 8-bit code pages.

Windows /was/ earlier with Unicode, that's true - unfortunately, they
picked UCS-2 and then got stuck with that instead of UTF-8. Linux
picked UTF-8 by laziness, as pretty much everything involving strings
(except displaying them) just works as before. There is no need to
re-invent everything in a 16-bit manner, as Windows did, and there are
no problems when it turns out 16 bits are not enough.

>
> BTW, there still exist file utilities to check filenames in Linux. I had
> an old filesystem with some file names in German encoded in Latin-1. It
> was connected to a FreeNAS (BSD-based). These files caused mysterious
> FreeNAS crashes when a remote host tried to browse files over a network
> share. Once I fixed the names it almost stopped crashing. I ditched
> FreeNAS anyway in favor of Ubuntu.
>

FreeNAS is BSD, which is not Linux. Not that BSD has any problems with
non-ASCII filenames either. An application might be made ASCII only,
however, regardless of the system.

Bart

unread,
Aug 30, 2021, 3:36:33 PMAug 30
to
If they are using a mainstream language, then it's about more than using
Unicode in identifiers:

* Keywords are likely to be in English still

* Standard type names will be English-based (and, in C, codes like %ll
and -LL and INT_MAX)

* The function names in the standard library will probably be English-based

* Compiler option names may be English based (eg. --version)

* Error messages from the compiler may be in English (I don't know how
internationalised such programs are)

* Most of the exported functions and enums of general-purpose libraries
are likely to be in English (eg. SDL_BUTTON_LEFT)

So I'd say it's hard to get away from English even if they wanted.

But string literals and comments in source code: they can be anything;
the language just needs to allow UTF8.

Dmitry A. Kazakov

unread,
Aug 30, 2021, 3:37:29 PMAug 30
to
On 2021-08-30 21:18, David Brown wrote:
> On 30/08/2021 21:10, Dmitry A. Kazakov wrote:
>> On 2021-08-30 20:13, David Brown wrote:
>>> On 30/08/2021 13:37, Dmitry A. Kazakov wrote:
>>
>>>> It is a lesser problem now than it was before. I remember the time
>>>> Windows was unable to display most of special symbols.
>>>
>>> Slowly, in some ways, Windows has been catching up with the *nix world.
>>
>> I must defend Windows. Linux adopted UTF-8 very late. I well remember
>> the mess it had with 8-bit code pages.
>
> Windows also had a mess with 8-bit code pages.

Oh, yes.

If I correctly remember, you needed "professional" rather than "home" in
order to switch the system default.

> Windows /was/ earlier with Unicode, that's true - unfortunately, they
> picked UCS-2 and then got stuck with that instead of UTF-8.

Worse, later they changed UCS-2 to UTF-16 under the rug. All system
calls are duplicated, one ASCII A-call, another UTF-16 W-call.

> Linux
> picked UTF-8 by laziness, as pretty much everything involving strings
> (except displaying them) just works as before. There is no need to
> re-invent everything in a 16-bit manner, as Windows did, and there are
> no problems when it turns out 16 bits are not enough.

It is UTF-16 now. But of course, UTF-16 is a monstrosity compared with
UTF-8. Fortunately third party libraries ignore the mess. E.g. GTK port
for Windows converts all filenames to UTF-8.

James Harris

unread,
Aug 30, 2021, 3:53:01 PMAug 30
to
On 30/08/2021 19:16, David Brown wrote:
> On 30/08/2021 14:04, James Harris wrote:
>> On 30/08/2021 09:13, David Brown wrote:

...

>>> I don't imagine anyone is going to want to write "pass\acute/e" as an
>>> identifier in any language.
>>
>> It's for string literals!
>>
>> IMO programs and identifiers should use ascii, even in non-English
>> languages.
>>
>
> See the rest of the thread for a discussion on non-ASCII identifiers.
> (I am not suggesting that you implement them, or don't implement them -
> that's your choice. Some languages go one way, others go the other way.)
>
> But don't make up your own language for special characters in strings or
> comments. Again, UTF-8 is far and away the best option. If you feel
> that is a problem, then at least stick to an existing standard -
> HTML/XML character entities would almost certainly be the most
> convenient choice: "pass&eacute;".

Any UTF is no good for source code - e.g. for reasons Dmitry mentioned.
In addition, characters which people cannot identify or recognise should
not be part of source code because they make it unreadable.

I am considering allowing external identifier names to include unusual
characters so as to link with routines which use such characters - but
the programmer would have to write the identifiers in ascii characters.

I doubt I'd use HTML entities as they are a mess (e.g. having multiple
names for the same character) but I would need the names to come from an
online database.


--
James Harris

James Harris

unread,
Aug 30, 2021, 3:59:07 PMAug 30
to
On 29/08/2021 10:36, David Brown wrote:
> On 29/08/2021 02:01, Bart wrote:
>> On 28/08/2021 17:21, David Brown wrote:
>>> On 27/08/2021 22:07, Bart wrote:
>>
>>> As James suggested, the object files are basically just the internal
>>> representation of the compilation before code generation.
>>
>> Then 'object file' is a complete misnomer.
>
> Yes, that's a fair comment. "Linking" is also a misnomer in link-time
> optimisation. The names are historical, rather than technically accurate.

This is a first: three of us in agreement!

In my outline design the IR does a lot of the heavy lifting, including
being the preferred form for distributing software.


--
James Harris

David Brown

unread,
Aug 31, 2021, 3:36:53 AMAug 31
to
My understanding (which may be wrong, as I don't do much Windows
programming) is that there is a gradual move to UTF-8 support in
Windows. These things take time of course, and while there is no doubt
that Microsoft backed the wrong horse here with 16-bit encodings, they
made the right choice at the time. I blame MS for a lot of bad things,
but not this one! And they are not alone - Java, QT and Python are
other big players that picked UCS-2, leading to much regret and slow
progress towards a changeover to UTF-8.

Dmitry A. Kazakov

unread,
Aug 31, 2021, 5:33:57 AMAug 31
to
On 2021-08-31 09:36, David Brown wrote:

> My understanding (which may be wrong, as I don't do much Windows
> programming) is that there is a gradual move to UTF-8 support in
> Windows.

I think you are right. Actually they could proclaim A-calls UTF-8 as
they did with W-calls. That would break some legacy code, only French
will be annoyed. Germans will be apathic, small European countries
resigned, I guess...

> These things take time of course, and while there is no doubt
> that Microsoft backed the wrong horse here with 16-bit encodings, they
> made the right choice at the time.

> I blame MS for a lot of bad things,
> but not this one! And they are not alone - Java, QT and Python are
> other big players that picked UCS-2, leading to much regret and slow
> progress towards a changeover to UTF-8.

I believe that UTF-8 was introduced later. It is impossible that
everybody was wrong. E.g. Ada also adopted UCS-2 in 1995. Later on Ada
added UCS-4. Just same mess as with Windows, alas. But most Ada
programmers ignore UCS-2/4 and use UTF-8 where the standard mandates
Latin-1.

David Brown

unread,
Aug 31, 2021, 7:06:20 AMAug 31
to
On 31/08/2021 11:33, Dmitry A. Kazakov wrote:
> On 2021-08-31 09:36, David Brown wrote:
>
>> My understanding (which may be wrong, as I don't do much Windows
>> programming) is that there is a gradual move to UTF-8 support in
>> Windows.
>
> I think you are right. Actually they could proclaim A-calls UTF-8 as
> they did with W-calls. That would break some legacy code, only French
> will be annoyed. Germans will be apathic, small European countries
> resigned, I guess...

You are just listing the advantages :-)

>
>> These things take time of course, and while there is no doubt
>> that Microsoft backed the wrong horse here with 16-bit encodings, they
>> made the right choice at the time.
>
>> I blame MS for a lot of bad things,
>> but not this one!  And they are not alone - Java, QT and Python are
>> other big players that picked UCS-2, leading to much regret and slow
>> progress towards a changeover to UTF-8.
>
> I believe that UTF-8 was introduced later.

Yes. Unicode was first conceives as 16-bit, with UCS-2. Then they
started extending it beyond 16-bit, and had to make UCS-4. UTF-16 was
developed as a way to access the rest of the characters with 16-bit code
units, and then I think UTF-8 came after that. (UTF-32 is the same as
UCS-4.)

> It is impossible that
> everybody was wrong.

They were not wrong at the time - it was later changes that made them
wrong. It is a sometimes unfortunate fact of life that backwards
compatibility is king, and it's hard to undo decisions even when we know
things could have been better. (That's why x86 is popular, despite
being an appallingly bad architecture, it's why we have Windows, it's
why we have qwerty keyboards, it's why we all use English with its silly
inconsistent spelling.)

James Harris

unread,
Aug 31, 2021, 2:37:37 PMAug 31
to
On 30/08/2021 19:13, David Brown wrote:
> On 30/08/2021 13:37, Dmitry A. Kazakov wrote:
>> On 2021-08-30 11:50, David Brown wrote:

...

>>> All in all, non-ASCII letters in identifiers can pose a lot of
>>> challenges.  But they are nonetheless important for people around the
>>> world, and despite the disadvantages, UTF-8 is far and away the best
>>> choice.  You simply have to trust programmers to be sensible in their
>>> usage.  (You need to to that anyway, even with ASCII - in many fonts, l,
>>> 1 and I can be hard to distinguish, as can O and 0.)
>>
>> Actually, this is again sort of Europocentric POV. In reality, if you
>> have a truly international team with speakers outside Western Europe,
>> you must agree on some strict rules regarding comments and identifiers.
>>
>
> If you have an international team, then it is standard practice to keep
> everything in English. But most teams are not international. Why
> should a group of Greek or Japanese programmers be forced to write
> everything in a foreign language? You can view the keywords as fixed -
> almost like symbols, rather than words - but they may prefer to have
> other parts written in their own language.

AISI: Have the master copy of /all/ programs in American English, and
support translation of identifier names, comments, string literals etc
to other languages.


--
James Harris

David Brown

unread,
Sep 1, 2021, 4:16:15 AMSep 1
to
Why would anyone choose the dialect of one particular ex colony, rather
than using /real/ English?

I know that in the USA it is common to think that America is the only
country, or at least the only one worth considering, but the rest of the
world begs to differ.

anti...@math.uni.wroc.pl

unread,
Sep 1, 2021, 11:18:26 AMSep 1
to
David Brown <david...@hesbynett.no> wrote:
> On 31/08/2021 11:33, Dmitry A. Kazakov wrote:
> > On 2021-08-31 09:36, David Brown wrote:
> >
> >> My understanding (which may be wrong, as I don't do much Windows
> >> programming) is that there is a gradual move to UTF-8 support in
> >> Windows.
> >
> > I think you are right. Actually they could proclaim A-calls UTF-8 as
> > they did with W-calls. That would break some legacy code, only French
> > will be annoyed. Germans will be apathic, small European countries
> > resigned, I guess...
>
> You are just listing the advantages :-)
>
> >
> >> These things take time of course, and while there is no doubt
> >> that Microsoft backed the wrong horse here with 16-bit encodings, they
> >> made the right choice at the time.
> >
> >> I blame MS for a lot of bad things,
> >> but not this one!? And they are not alone - Java, QT and Python are
> >> other big players that picked UCS-2, leading to much regret and slow
> >> progress towards a changeover to UTF-8.
> >
> > I believe that UTF-8 was introduced later.
>
> Yes. Unicode was first conceives as 16-bit, with UCS-2. Then they
> started extending it beyond 16-bit, and had to make UCS-4. UTF-16 was
> developed as a way to access the rest of the characters with 16-bit code
> units, and then I think UTF-8 came after that. (UTF-32 is the same as
> UCS-4.)

Well, there was insane ISO proposal, which was then partially
unified with Unicode: ISO had 31-bit characters, with first
2^16 codes (BMP) identical to Unicode. At that time ISO proposed
their 8-bit transportation format. Around this time UTF-8 was
born, as simpler alternative to ISO format. Later, ISO
agreed to limit charaters to about 20 bits, Unicode agreed to expand
to match and UTF-16 was born. So, in fact UTF-8 came first
and UTF-16 later. Of course, 16-bit Unicode was before UTF-8.

--
Waldek Hebisch

James Harris

unread,
Sep 6, 2021, 2:30:15 PMSep 6
to
On 30/08/2021 10:50, David Brown wrote:
> On 30/08/2021 10:38, Dmitry A. Kazakov wrote:
>> On 2021-08-30 10:13, David Brown wrote:
>>
>>> There are, I think, only two sensible options here:
>>>
>>> 1. Disallow any identifier letters outside of ASCII.
>>> 2. Make everything UTF-8.

IMO there's a better option:

3. Use purely ASCII but allow escape sequences to be coded in ASCII.

...

> It can also be difficult for people to type, which can quickly be a pain
> for collaboration. How would you type "bøk", for example?

I'd could allow that to be used in string literals with something like

"b\slash:o/k"

As well as string literals it is unlikely but possible that a program
written in my language would have to call a function from another
language which has been written in Norwegian where the function name
included a non-ASCII character. For that, I am considering allowing

\slash:o/

and similar to appear in the name of external functions. It would be
ugly but clear. And programmers could limit the ugliness to one place by
defining an alias as in

namedef book = b\slash:o/k

book()

Wouldn't that be better than either pure ASCII or allowing Unicode?

Have I got all bases covered? I hope so!


--
James Harris

James Harris

unread,
Sep 6, 2021, 2:34:11 PMSep 6
to
On 30/08/2021 19:13, David Brown wrote:
> On 30/08/2021 13:37, Dmitry A. Kazakov wrote:

...

>> Actually, this is again sort of Europocentric POV. In reality, if you
>> have a truly international team with speakers outside Western Europe,
>> you must agree on some strict rules regarding comments and identifiers.

...

>> And the least common denominator is English.
>>
>
> It is the least common denominator for most international groups, but
> not for most national teams.

A program whose master copy was in a well-known language - such as
American English - would be a lot easier to translate to other languages
than normal prose.


--
James Harris

David Brown

unread,
Sep 6, 2021, 5:12:35 PMSep 6
to
All the bases except for the ones concerning what people writing other
languages would actually see as usable. If this is your "solution", you
are better off saying "pure 7-bit ASCII only" and be done with it,
because no one would /ever/ want to use that.

James Harris

unread,
Sep 7, 2021, 3:13:07 AMSep 7
to