Paper about ISO C

cla...@hotmail.com

unread,

Oct 7, 2021, 6:52:33 AM10/7/21

to

Might strike a cord with Anton and/or Mich

https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/

BGB

unread,

Oct 7, 2021, 2:37:50 PM10/7/21

to

On 10/7/2021 5:52 AM, cla...@hotmail.com wrote:
> Might strike a cord with Anton and/or Mich
>
> https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/
>

I am a little less strict on this topic, but my general opinion is that
things like strict aliasing should be "opt in" rather than "opt out".

But, yeah, the guy who wrote that in-general has a point, but also tends
to be a bit overly alarmist IMO about these issues.

For example, consider a case of caching struct loads, there are multiple
approaches:
1, Always do the load (my compiler did this originally);
2, Cache, but only within a single basic-block, flush on store (current
rule);
3, Cache, invalidate entries on store if matching types (strict aliasing);
4, Assume that no aliasing occurs whatsoever (playing with fire).

I have noted experimentally that 2 can gain a fair bit of speedup over
1, and is pretty much invisible within a single thread.

The 3 option is closer to what GCC and Clang do by default, but I opt
against this by default given it is unsafe. Doom and Quake and similar
still work with this option. It also seems to be only modestly faster
than 2.

The 4 option is unsafe; it is a little faster than 3 at Dhrystone, but
Doom and Quake are prone to crash if one uses this option.

A variation of this optimization has also been applied to array loads.

My preference is also that things like signed integer operations remain
sign modulo (as well as unsigned operations being unsigned modulo), that
signed right shift behaves in the usually-expected ways (arithmetic
shift), ...

So, ideally, one can try to optimize things in that don't turn into
semantic quicksand, even if this means skipping over some possible types
of optimizations.

Similarly, corner cutting is acceptable in cases where the expected
results can be reasonably well-defined, or are "basically invisible"
within reasonable use-cases.

For example, there can be a fairly significant cost difference between
"FPU that implements IEEE-754 correctly", and one that gives answers
that are "more or less correct".

Expensive FPU:
Separate FPU registers (arguably less register pressure);
FADD, FSUB, FMUL, FDIV, FSQRT, FMAC, ...
Always +/- 0.5 ULP;
Implements Denormals;
...

Cheaper FPU:
Reuse integer registers;
FADD, FSUB, FMUL;
Relaxed rounding;
Denormal as Zero;
...

For most programs, the difference between the cheap and expensive FPU is
not noticeable; and the sorts of workloads which "actually care" are
unlikely to be run on a processor that is too cheap to be able to afford
the fancier FPU (and where the relative slowness of doing everything
with software emulation is undesirable).

One might find that a more useful property though, is not so much that
the FPU is accurate, so much that things like "Single->Double->Single"
conversions can be done while keeping the original value intact.

Compiler optimization is an ongoing challenge though.

Most have been fixing various issues, recent notable bug fixes:
Binary operators needlessly converting arguments to the destination type;
Type conversion operators doing operations via unnecessary temporaries;
Unsigned loads using redundant zero-extension ops;
...

An, some sorts of partially addressed issues:
Temporaries always spilling to stack even when the value is no longer
needed (tried to add 'phi' operators to the IR, but this was a fail; did
add logic though for "don't bother storing temporary if it is never
reloaded" which also helps, but is less fine-grained than a 'phi' would
have been).

Had recently (briefly) gotten Dhrystone up to ~ 65k, but it has since
dropped to 63k, seems very sensitive to code-generation options. This is
on my custom ISA at 50MHz, so ~ 0.74 DMIPS/MHz.

While, conceivably, one can push it a little closer to ~0.78 via strict
aliasing, I don't consider this worthwhile ATM.

This seems to be in the sort of intermediate area between "slow" and
"fast" cores (though, not yet figured out what is the nature of the
dividing line). Though, having looked briefly at the SweRV core, I have
noted that they deal with the pipeline in a very different way (for
example, they seem to handle load/store with a queue and interlocks,
rather than running the L1 D$ in lockstep with the pipeline and stalling
the pipeline for each L1 miss).

It does seem like this could be a possible way to improve the speed of a
core, but would introduce its own complexities. I am not sure if this
accounts for a lot of the differences I am seeing (but, in
testing/models, the core is dumping a lot of its clock-cycles into cache
misses, and in theory this could at least help here, if it could be done
cheaply enough).

Though, as can also be noted in my case, my CPU core lacks an integer
divider, and only provides a 32-bit integer multiplier. A full 64-bit
multiplier would be slow and/or expensive. Generally, faking it as an
instruction-blob has been "fast enough".

Pretending to have instructions, but faking them with traps, is the
worst of both worlds (pointless and slow), at least a runtime call to an
ASM function can be reasonable.

I did recently fiddle around with seeing if I could make the ASM code
for integer divide faster, where options are:
Binary Long Division (current mainline option);
Lookup table of reciprocals (sorta works);
Lookup table + shifted-reciprocal (didn't win).

I had recently tried switching from solely using long division as the
default, to using a lookup table with long-division as a fallback. This
is a little faster for small divisors, but a little slower for divisors
that fall out of range of the table. It also adds the cost of the lookup
table (trades logic for cache misses). It also uses up space, and so
favors a fairly small lookup table (under 1K), where for
divide-intensive code the table will mostly fit in the L1 cache.

Shifted reciprocal can extend the range of such a small lookup table to
cover the entire integer range, but comes with a big issue:
The result is not integer exact;
In effect, the error is proportional to the relative size of the input
values.

So, it is fast for things like 3D rendering calculations where one
doesn't really care if the result is integer exact, but for C's "a/b"
operator, being integer exact is mandatory.

The problem then becomes that in many cases, trying to fix this ends up
costing more than had one just used the long division loop.

Most of the workarounds tend to be "maybe helps, maybe makes it worse".

However, most other code doesn't lean quite so heavily on integer divide
performance, so it is usually isn't quite as relevant.

I am hesitant about adding an instruction for IDIV, as for years even
fairly high end x86 CPU lines were plagued with an IDIV instruction
which was slower than doing it in software, and which performance would
likely have been better had the C compilers just called a runtime
function and done the binary long division in the C library.

Including the divide instruction in an ISA carries a non-trivial risk of
such a thing happening again (if compilers decide to use it by default
because "hey, there is an instruction for it, and dedicated instruction
implies fast, right?...").

But, in some sense, it is easier to use special case mechanisms to make
a runtime call faster, than it is to avoid a potential significant
overhead from emulating instructions which don't exist in hardware.

Granted, a possible intermediate option could be to have an instruction
whose dedicated purpose is to call into a location in a table while also
doing two register moves, ..., which then branches to the appropriate
runtime function (sorta like a special GOT mostly for stuff that was too
expensive to add to the ISA directly, but where one kinda wants to
pretend that a CPU instructions exists).

Then one can just sort of declare that all the scratch registers will be
left in an undefined state following the operation.

...

David Brown

unread,

Oct 7, 2021, 3:35:29 PM10/7/21

to

On 07/10/2021 12:52, cla...@hotmail.com wrote:
> Might strike a cord with Anton and/or Mich
>
> https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/
>

It seems to me that the authors have misunderstood the language entirely.

C was never meant to be a "high-level assembler" or a "portable
assembler". It was never intended that you could write low-level
systems programs in portable C. It was designed to reduce the need to
write code in assembler, and to improve the portability of code.
(Re-read that sentence if you like.) It was intended to be useable for
writing some kinds of code in a highly portable manner, and also to be
useful in non-portable systems code that depended strongly on the target
and the compiler.

From day one, operating systems "written in C" contained assembly code,
compiler extensions, and other non-portable code. That remains true
today. "Coding tricks" were often used to get efficient results due to
more limited compilers - these are much less common now as compilers
optimise better.

As I got further in the document, it seems to be just another "the
compiler should do what I want it to do, not necessarily what I /told/
it to do" rant. "C today should work the way /I/ want it to do - the
way /I/ interpret the precise wording of what someone decades ago".
"Thompson and Ritchie were saints and prophets, and their words were
holy and eternal - regardless of how computers, languages, and
programming needs have changed since the inception of C". "We were all
happy and no bugs existed in code until gcc sinned by optimising code,
and we were thrown out of Eden into a hell of undefined behaviour".
"Yes, we know that compilers often give us ways to get the semantics we
want by specific flags, but that's not good enough - everyone else
should get slower code so that we don't have to think as much or follow
the rules of the language".

Are there imperfections in C as a language standard? Sure. Show me a
perfect language standard as an alternative. Are there questionable
judgements in some of the design decisions of C toolchains? Sure -
again, show me a perfect toolchain. Is C not quite an exact fit for all
your needs for writing an OS? Sure. It's a general purpose language
that is good for a very wide range of uses - but it's unlikely to be
/perfect/ for any given use-case. And it never has been, and it never
will be. Nor will anything else. Modern C is a vastly better language
than the earliest K&R C. Modern C toolchains are vastly better than the
tools of that era. As well as the hundred steps forward, there have no
doubt been a dozen steps backwards for some people and some uses.
That's life. If you don't like it, give up programming and stick to a
safer hobby.

(For the record, I don't think type-based alias analysis actually gives
very significant optimisation opportunities in most code. On the other
hand, I also don't think it causes problems very often - not nearly as
often as the "optimisation is evil" crowd seem to behave. I write
low-level code all the time, and it is extraordinarily rarely that I
have to find a way around it.)

BGB

unread,

Oct 7, 2021, 4:47:23 PM10/7/21

to

The author of the paper (Victor Yodaiken) has sort of a long history of
"making a mountain out of a molehill" about all this stuff...

I have my opinions, but will note that "the system" is self-regulating
to some extent, as compiler people would have a hard-time keeping people
using their compilers if they set out to break large amounts of existing
"real world" code (and will tend to back down when they face sufficient
backlash).

In cases where stuff has broken in more significant ways, it has usually
been fairly isolated (a particular build of a particular compiler),
rather than a long-term change to the status quo.

>
> Are there imperfections in C as a language standard? Sure. Show me a
> perfect language standard as an alternative. Are there questionable
> judgements in some of the design decisions of C toolchains? Sure -
> again, show me a perfect toolchain. Is C not quite an exact fit for all
> your needs for writing an OS? Sure. It's a general purpose language
> that is good for a very wide range of uses - but it's unlikely to be
> /perfect/ for any given use-case. And it never has been, and it never
> will be. Nor will anything else. Modern C is a vastly better language
> than the earliest K&R C. Modern C toolchains are vastly better than the
> tools of that era. As well as the hundred steps forward, there have no
> doubt been a dozen steps backwards for some people and some uses.
> That's life. If you don't like it, give up programming and stick to a
> safer hobby.
>

>
> (For the record, I don't think type-based alias analysis actually gives
> very significant optimisation opportunities in most code. On the other
> hand, I also don't think it causes problems very often - not nearly as
> often as the "optimisation is evil" crowd seem to behave. I write
> low-level code all the time, and it is extraordinarily rarely that I
> have to find a way around it.)
>

Will agree on this point.

In my own testing, basic caching while also aggressively invalidating
the cached results (on any explicit memory store), can give much of the
same performance benefit, while still being fairly conservative
semantically.

The problem though is when optimizations start being treated as "all or
nothing", eg, "you either accept TBAA or can't use optimizations at
all", which is a problem IMO.

Say:
x=foo->x;
*ptr=3;
y=foo->x; //Do we reuse prior result here?

Under TBAA, the second "foo->x" may reuse the result of the first, but
under more conservative semantics, the "*ptr=" would effectively
invalidate the cached result regardless of the pointer type.

Whereas, say:
x=foo->bar->x;
y=foo->bar->y;
Becomes, effectively:
t0=foo->bar;
x=t0->x;
y=t0->y;

Where no invalidation happens because local variables don't count as
stores (it is generally pretty safe to assume that local variables exist
off in their own space, roughly entirely independent from physical
memory, unless their address is taken).

One option is to specify subsets of optimizations based on settings, say:
Os, optimize for size (take slower options if smaller);
O1, optimize for speed, favoring size minimization over speed;
O2, optimize for speed, favoring speed over size;
O3, optimize for speed, enable more aggressive optimizations.

So, for example, with Os, O1, and O2, TBAA is not used, whereas for O3
or similar, things like TBAA would be enabled (with secondary options to
enable or disable TBAA semantics).

MSVC does something vaguely similar to this, though from what I have
found, for some types of programs (such as my emulators), O1 tends to be
faster than O2 (I suspect because MSVC's O2 option is a lot more prone
to make use of "misguided attempts at autovectorization" when compared
with O1).

Granted, arguably I could upgrade MSVC and see if anything is
difference, but given I am kinda battling with space on "C:\" (and there
is no real way to make it bigger), not particularly inclined to do so.

Eg, "320GB should be more than enough space for anyone."
Windows: "Nope!"
Me: "Can you at least give me an option to expand the partition?"
Windows: "What sort of insanity is this?..."

Still have space on other drives, but this isn't of much use if one
wants to install another version of Visual Studio or similar.

...

MitchAlsup

unread,

Oct 7, 2021, 8:46:28 PM10/7/21

to

On Thursday, October 7, 2021 at 3:47:23 PM UTC-5, BGB wrote:
> On 10/7/2021 2:35 PM, David Brown wrote:
> > On 07/10/2021 12:52, cla...@hotmail.com wrote:
> >> Might strike a cord with Anton and/or Mich
> >>
> >> https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/
> >>
> >
>

> Under TBAA, the second "foo->x" may reuse the result of the first, but
> under more conservative semantics, the "*ptr=" would effectively
> invalidate the cached result regardless of the pointer type.
>
> Whereas, say:
> x=foo->bar->x;
> y=foo->bar->y;
> Becomes, effectively:
> t0=foo->bar;
> x=t0->x;
> y=t0->y;
<

In Operating System code; if one write the top 2 statements, one should
get at least 4 loads--it is simply UNSAFE to assume otherwise--even
if x happens to be allocated into a register (which may happen FAR after
aliasing analysis is performed.)
<
If the programmer KNOWS that there is no aliasing going on, the programmer
should write the bottom set of code.
<
If the programmer writes the top code and expects some kind of compiler
flags (or magic) can translate it to the bottom--the programmer should be
FIRED !
<
Those are the only 3 options available in Operating system code.
You do not allow IDIOTS to write such code and expect anything good to become
of your investment.

>
> Where no invalidation happens because local variables don't count as
> stores (it is generally pretty safe to assume that local variables exist
> off in their own space, roughly entirely independent from physical
> memory, unless their address is taken).
>
>
> One option is to specify subsets of optimizations based on settings, say:
> Os, optimize for size (take slower options if smaller);
> O1, optimize for speed, favoring size minimization over speed;
> O2, optimize for speed, favoring speed over size;
> O3, optimize for speed, enable more aggressive optimizations.
<

Operating systems are in a position where normal codes are not--they have
their own address space and visibility into another address space (or more).
The MMU tables can map multiple virtual addresses to a single physical page.
Are you going to make the compiler smart enough to understand that kind
of memory accessing ??!!?? No, you are not, because it is an unsolvable
problem (outside of SAS.)

>
> So, for example, with Os, O1, and O2, TBAA is not used, whereas for O3
> or similar, things like TBAA would be enabled (with secondary options to
> enable or disable TBAA semantics).
>

If an Operating system wants certain things to run really fast, then embed
this stuff in a library call and optimize the heck out of it (so long as it does
not cross the multipleVA->onePA problem domain.) Then call these optimum
functions from way less optimized code--And here is an Ideal location to
point out that if you inline the optimizable code into unoptimizable code
"text" that inlined function has to take on the unoptimizability of the original
library call.
<
It is much like handling fiearms::
<
1) all pointers are loaded at all times unless verified they are not.
2) never point a pointer at something you do not want destroyed.
3) never enable a pointer until your sights are on the target
<
4) There is NO SUCH THING as a casual pointer.

MitchAlsup

unread,

Oct 7, 2021, 9:22:49 PM10/7/21

to

On Thursday, October 7, 2021 at 7:46:28 PM UTC-5, MitchAlsup wrote:
<
You see, here is the dilemma::
<
Assume 2 OS threads, the first in service of a trap from thread[t],
The second in service of a trap from thread[h].
<
OS[t] creates a pointer into thread[t]. Can this pointer be used in any
meaningful way in OS[h] ?
<
No (99.99%) of the time. The lower ½ of OS[t] address space is completely
unrelated to the lower ½ of OS[h] address space--even though 100% of the
upper ½ address space is shared between OS[t] and OS[h]!
<
Can compiler aliasing analysis figure any of this out ?
<
Doubtful !!

Branimir Maksimovic

unread,

Oct 7, 2021, 9:38:23 PM10/7/21

to

Isn't it that GC are always advertised as way to solve *cycling references*?

>> ...

--

7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...

Branimir Maksimovic

unread,

Oct 7, 2021, 9:41:37 PM10/7/21

to

On 2021-10-08, MitchAlsup <Mitch...@aol.com> wrote:

> On Thursday, October 7, 2021 at 7:46:28 PM UTC-5, MitchAlsup wrote:
><
> You see, here is the dilemma::
><
> Assume 2 OS threads, the first in service of a trap from thread[t],
> The second in service of a trap from thread[h].
><
> OS[t] creates a pointer into thread[t]. Can this pointer be used in any
> meaningful way in OS[h] ?
><
> No (99.99%) of the time. The lower ½ of OS[t] address space is completely
> unrelated to the lower ½ of OS[h] address space--even though 100% of the
> upper ½ address space is shared between OS[t] and OS[h]!
>

Isn't it that threads are defined as sharing same address space, unlike
processes? Actually there is no other difference between processes and
threads...

<
> Can compiler aliasing analysis figure any of this out ?
><
> Doubtful !!
>

Compiler is nothing, just distraction, buggy assumptions at that...
But you have to know bugs of compilers in order to use them :P

Thomas Koenig

unread,

Oct 8, 2021, 1:51:36 AM10/8/21

to

BGB <cr8...@gmail.com> schrieb:

> Say:
> x=foo->x;
> *ptr=3;
> y=foo->x; //Do we reuse prior result here?
>

Depends on the type of ptr, if it agrees with foo.x or not.
If both are (for example) *int, then it is not possible to
reuse it. If foo.x is *float and ptr is *unsigned int, then
yes.

Thomas Koenig

unread,

Oct 8, 2021, 1:53:20 AM10/8/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> It is much like handling fiearms::
><
> 1) all pointers are loaded at all times unless verified they are not.
> 2) never point a pointer at something you do not want destroyed.
> 3) never enable a pointer until your sights are on the target
><
> 4) There is NO SUCH THING as a casual pointer.

I _love_ that.

Thomas Koenig

unread,

Oct 8, 2021, 2:21:23 AM10/8/21

to

David Brown <david...@hesbynett.no> schrieb:

> "We were all
> happy and no bugs existed in code until gcc sinned by optimising code,
> and we were thrown out of Eden into a hell of undefined behaviour".

*snarf*

That was the sound of that qoute hitting the gcc bugzilla quip file.

George Neuner

unread,

Oct 8, 2021, 2:34:59 AM10/8/21

to

On Fri, 08 Oct 2021 01:41:35 GMT, Branimir Maksimovic
<branimir....@icloud.com> wrote:

>On 2021-10-08, MitchAlsup <Mitch...@aol.com> wrote:
>> On Thursday, October 7, 2021 at 7:46:28 PM UTC-5, MitchAlsup wrote:
>><
>> You see, here is the dilemma::
>><
>> Assume 2 OS threads, the first in service of a trap from thread[t],
>> The second in service of a trap from thread[h].
>><
>> OS[t] creates a pointer into thread[t]. Can this pointer be used in any
>> meaningful way in OS[h] ?
>><
>> No (99.99%) of the time. The lower ½ of OS[t] address space is completely
>> unrelated to the lower ½ of OS[h] address space--even though 100% of the
>> upper ½ address space is shared between OS[t] and OS[h]!
>>
>Isn't it that threads are defined as sharing same address space, unlike
>processes? Actually there is no other difference between processes and
>threads...

That's a de facto definition rather than a canon one. Technically
both "process" and "thread" are execution contexts, and they mean
whatever they are defined to mean in a given system. The idea that a
process represents an address space / protection domain while a thread
represents a code path is really only an artifact of the way in which
threads initially were implemented.

As a counter example, consider that on the Mill multiple threads in
the same program simultaneously can be in different protection
domains.

Take a look at "scheduler activations" for a different idea of how the
kernel can support multiple threads.
https://homes.cs.washington.edu/~tom/pubs/sched_act.pdf
http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html

Note: scheduler activations initially may /look/ similar to M:N
threading (userspace threads multiplexed on kernel threads), but
underneath they are quite different.

BGB

unread,

Oct 8, 2021, 3:15:14 AM10/8/21

to

I left this part deliberately vague, but yeah.
Under type-based aliasing rules, that would be correct.

I took a more conservative interpretation here by default, and made my
compiler assume that aliasing may occur regardless of any of the actual
types involved.

This gets much of the performance advantage, without a lot of the
potential drawbacks of full strict-aliasing semantics (such as it
interfering with the ability to cast whatever pointer to whatever type
and then perform operations on it).

David Brown

unread,

Oct 8, 2021, 7:43:14 AM10/8/21

to

It seems that way. Often there are some good points - as I wrote, no
language or tool is perfect, and it's important that people find weak
points or make suggestions for improvement. Massively exaggerated
rants, however, are worse than useless - they encourage other fanatics
who feed on the same myths, doom-saying and exaggerations. And they
ensure that those who actually have a say in the tools and language -
the standards committees and compiler developers - will ignore all their
concerns and write them off as fanatics. Good points they raise get
thrown out and ignored along with the reams of rubbish,
misunderstandings and misrepresentations that hide them.

>
> I have my opinions, but will note that "the system" is self-regulating
> to some extent, as compiler people would have a hard-time keeping people
> using their compilers if they set out to break large amounts of existing
> "real world" code (and will tend to back down when they face sufficient
> backlash).

Agreed. Compiler writers have no interest in making tools that people
don't want to use. If feedback suggests a new compiler version does not
work the way people expect and want, then the compiler developers take
that seriously - they discuss matters with the users, they change the
behaviour of the tools, they add flags or options to give users more
control of the behaviour. In particular, they aim to test out new
changes during pre-release and beta test stages so that potential
problems get handled sooner.

A similar process applies to language standards - there are long periods
of proposals, discussions, and voting before changes are made. Big
compiler users (like Linux distributions, corporations such as IBM,
Microsoft, Google), big compiler implementers, researchers, and anyone
else who is interested can get involved and share their opinion.

You are never going to please everyone all of the time. And you are
never going to please the small but vocal community who object to
everything and accuse compiler writers and/or language committees of all
sorts of evil motives and conspiracies. But they do a solid job
overall, which is why languages like C are still a good choice for a lot
of tasks.

>
> In cases where stuff has broken in more significant ways, it has usually
> been fairly isolated (a particular build of a particular compiler),
> rather than a long-term change to the status quo.
>

And usually a "significant break" is found during testing stages, rather
than post release.

>>
>> Are there imperfections in C as a language standard? Sure. Show me a
>> perfect language standard as an alternative. Are there questionable
>> judgements in some of the design decisions of C toolchains? Sure -
>> again, show me a perfect toolchain. Is C not quite an exact fit for all
>> your needs for writing an OS? Sure. It's a general purpose language
>> that is good for a very wide range of uses - but it's unlikely to be
>> /perfect/ for any given use-case. And it never has been, and it never
>> will be. Nor will anything else. Modern C is a vastly better language
>> than the earliest K&R C. Modern C toolchains are vastly better than the
>> tools of that era. As well as the hundred steps forward, there have no
>> doubt been a dozen steps backwards for some people and some uses.
>> That's life. If you don't like it, give up programming and stick to a
>> safer hobby.
>>
>
>>
>> (For the record, I don't think type-based alias analysis actually gives
>> very significant optimisation opportunities in most code. On the other
>> hand, I also don't think it causes problems very often - not nearly as
>> often as the "optimisation is evil" crowd seem to behave. I write
>> low-level code all the time, and it is extraordinarily rarely that I
>> have to find a way around it.)
>>
>
> Will agree on this point.
>
> In my own testing, basic caching while also aggressively invalidating
> the cached results (on any explicit memory store), can give much of the
> same performance benefit, while still being fairly conservative
> semantically.
>

There is a move towards "providence tracking" for memory and pointers as
an alternative to things like type-based alias analysis. This is a lot
more powerful, as it tracks when different references to the same type
cannot alias, and thus can give much more optimisation opportunities and
static error analysis benefits. It subsumes most TBAA, since pointers
to different types usually have different providences. But in cases
where the providence is actually the same, aliases (or potential
aliases) are handled correctly. However, it turns out to be quite
difficult to specify all this, and it is a work in progress (both for
language standards and implementations).

That is pretty much what compilers do at the moment, except that TBAA is
a perfectly valid optimisation, relatively fast in the compiler, and
with no risk of producing bigger or slower code (unlike, say, loop
unrolling) - thus it is in -Os, -O2 and -O3 in gcc.

If you think that TBAA is not necessarily valid, then it should not be
enabled at /any/ level - but only be an explicit flag. If you think
that TBAA /is/ valid, but that some code might want to break the strict
aliasing rules, then what is needed is a standardised pragma to change
the language semantics. You can write:

#ifdef __GNUC__
#pragma GCC optimize "-fno-strict-aliasing"
#endif

It would be better if there were standardised pragmas, in the manner of
"#pragma STDC FP_CONTRACT ON" that exists today for fine-tuning floating
point semantics.

>
> MSVC does something vaguely similar to this, though from what I have
> found, for some types of programs (such as my emulators), O1 tends to be
> faster than O2 (I suspect because MSVC's O2 option is a lot more prone
> to make use of "misguided attempts at autovectorization" when compared
> with O1).
>

All compilers do something similar.

But "aggressive optimisations" at higher "O" levels do not mean
optimisations that might "break" code, veer from the standards, or
generate incorrect object code. (Compilers might have options like
that, but they will require explicit enabling.) It means optimisations
that might take excessive compile time in relation to their benefits, or
might result in code that is still correct but is actually slower than
you would otherwise get.

David Brown

unread,

Oct 8, 2021, 7:50:06 AM10/8/21

to

I hope you didn't attribute it as a real quotation from the referenced
paper!

(And I hope I didn't offend anyone with some of the imagery I used.)

Branimir Maksimovic

unread,

Oct 8, 2021, 8:06:20 AM10/8/21

to

On 2021-10-08, George Neuner <gneu...@comcast.net> wrote:
> As a counter example, consider that on the Mill multiple threads in
> the same program simultaneously can be in different protection
> domains.
>
> Take a look at "scheduler activations" for a different idea of how the
> kernel can support multiple threads.
> https://homes.cs.washington.edu/~tom/pubs/sched_act.pdf
> http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
>
> Note: scheduler activations initially may /look/ similar to M:N
> threading (userspace threads multiplexed on kernel threads), but
> underneath they are quite different.
>

Thanks, I'll look into it.

Ivan Godard

unread,

Oct 8, 2021, 8:17:16 AM10/8/21

to

On 10/8/2021 4:43 AM, David Brown wrote:

>
> There is a move towards "providence tracking" for memory and pointers as
> an alternative to things like type-based alias analysis. This is a lot
> more powerful, as it tracks when different references to the same type
> cannot alias, and thus can give much more optimisation opportunities and
> static error analysis benefits. It subsumes most TBAA, since pointers
> to different types usually have different providences. But in cases
> where the providence is actually the same, aliases (or potential
> aliases) are handled correctly. However, it turns out to be quite
> difficult to specify all this, and it is a work in progress (both for
> language standards and implementations).

provenance?

anti...@math.uni.wroc.pl

unread,

Oct 8, 2021, 8:42:50 AM10/8/21

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Thursday, October 7, 2021 at 3:47:23 PM UTC-5, BGB wrote:
> > On 10/7/2021 2:35 PM, David Brown wrote:
> > > On 07/10/2021 12:52, cla...@hotmail.com wrote:
> > >> Might strike a cord with Anton and/or Mich
> > >>
> > >> https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/
> > >>
> > >
> >
> > Under TBAA, the second "foo->x" may reuse the result of the first, but
> > under more conservative semantics, the "*ptr=" would effectively
> > invalidate the cached result regardless of the pointer type.
> >
> > Whereas, say:
> > x=foo->bar->x;
> > y=foo->bar->y;
> > Becomes, effectively:
> > t0=foo->bar;
> > x=t0->x;
> > y=t0->y;
> <
> In Operating System code; if one write the top 2 statements, one should
> get at least 4 loads--it is simply UNSAFE to assume otherwise--even
> if x happens to be allocated into a register (which may happen FAR after
> aliasing analysis is performed.)

Sorry, code _assuming_ aliasing there is simply insane, regardless
if this in operating system on not. If anybody really needs
aliansing in such context this should be done by special purpose
construct.

> If the programmer KNOWS that there is no aliasing going on, the programmer
> should write the bottom set of code.
> <
> If the programmer writes the top code and expects some kind of compiler
> flags (or magic) can translate it to the bottom--the programmer should be
> FIRED !
> <
> Those are the only 3 options available in Operating system code.
> You do not allow IDIOTS to write such code and expect anything good to become
> of your investment.

Code as above is typical result of macro expansion (or inlining).
_Depending_ on optimization for any specific use is unwise. But
most systems are written to get good _average_ performance and
in large systems optimization makes difference to average
performance.

> > Where no invalidation happens because local variables don't count as
> > stores (it is generally pretty safe to assume that local variables exist
> > off in their own space, roughly entirely independent from physical
> > memory, unless their address is taken).
> >
> >
> > One option is to specify subsets of optimizations based on settings, say:
> > Os, optimize for size (take slower options if smaller);
> > O1, optimize for speed, favoring size minimization over speed;
> > O2, optimize for speed, favoring speed over size;
> > O3, optimize for speed, enable more aggressive optimizations.
> <
> Operating systems are in a position where normal codes are not--they have
> their own address space and visibility into another address space (or more).
> The MMU tables can map multiple virtual addresses to a single physical page.
> Are you going to make the compiler smart enough to understand that kind
> of memory accessing ??!!?? No, you are not, because it is an unsolvable
> problem (outside of SAS.)

In normal compiler it is compiler who allocates local and global variables
and they are allocated from different memory pool than dynamic memory.
OS that messes with compiler locals is simply insane. OS that plays
tricks with global variables arguably should do this using special
constructs known to compiler (like memory barries, volatile etc.).
Compiler normally are rather conservative in their assumprions about
dynamically allocated memory. So OS that invalidates compiler
assumptions is simply badly designed OS.

--
Waldek Hebisch

David Brown

unread,

Oct 8, 2021, 9:53:21 AM10/8/21

to

Yes, that's the word I intended to use! Once Thunderbird's auto-correct
had turned my mangled guess at the spelling into a correctly spelt but
unintended word, I reused it without thinking.

Thomas Koenig

unread,

Oct 8, 2021, 11:42:43 AM10/8/21

to

David Brown <david...@hesbynett.no> schrieb:

> On 08/10/2021 08:21, Thomas Koenig wrote:
>> David Brown <david...@hesbynett.no> schrieb:
>>> "We were all
>>> happy and no bugs existed in code until gcc sinned by optimising code,
>>> and we were thrown out of Eden into a hell of undefined behaviour".
>>
>> *snarf*
>>
>> That was the sound of that qoute hitting the gcc bugzilla quip file.
>>
>
> I hope you didn't attribute it as a real quotation from the referenced
> paper!

No, I just added the sentence as is (with Americanized spelling).

I thought it fit the critera of "something clever or funny or
boring (but not obscene or offensive, please)".

(If you have a bugzilla account, you can also extend the quip list;
I have put in a few over the years).

Thomas Koenig

unread,

Oct 8, 2021, 11:56:10 AM10/8/21

to

anti...@math.uni.wroc.pl <anti...@math.uni.wroc.pl> schrieb:

> MitchAlsup <Mitch...@aol.com> wrote:
>> On Thursday, October 7, 2021 at 3:47:23 PM UTC-5, BGB wrote:
>> > Whereas, say:
>> > x=foo->bar->x;
>> > y=foo->bar->y;
>> > Becomes, effectively:
>> > t0=foo->bar;
>> > x=t0->x;
>> > y=t0->y;
>> <
>> In Operating System code; if one write the top 2 statements, one should
>> get at least 4 loads--it is simply UNSAFE to assume otherwise--even
>> if x happens to be allocated into a register (which may happen FAR after
>> aliasing analysis is performed.)
>
> Sorry, code _assuming_ aliasing there is simply insane, regardless
> if this in operating system on not. If anybody really needs
> aliansing in such context this should be done by special purpose
> construct.

Which is why I like Fortran's handling of pointers and allocatables
better than C's.

In Fortran, a pointer is an alias to a variable. That variable
can be allocated via the ALLOCATE statement or can be a variable
declared with the TARGET attribute. The compiler can assume
(and, to a large extent, check) that no pointer points at
something that is not a TARGET. Pointers to arrays, or parts
of arrays, are also possible.

An allocatable variable is a different beast. It is allocated by
an ALLOCATE statement or by automatic (re-)allocation. No other
pointer can point to it, unless it is also declared TARGET.
It is also deallocated automatically upon going out of scope.

And a dummy argument (parameter in C) is something else again.
You cannot have these aliasing when one of them is assigned to.

C has conflated these three concepts into one, pointers, together
with pointer arithmetic for array access. This all made sense
in the 1970s when the language was designed for when the language
had to fit on a small machine. Today, not so much.

EricP

unread,

Oct 8, 2021, 11:56:58 AM10/8/21

to

Unless I missed something, these scheduler activations are user mode
threads that switch on blocking events like page faults to a different
user mode thread by having the OS punt back to user mode on blocking.
Why wouldn't I just use normal kernel scheduled threads?

David Brown

unread,

Oct 8, 2021, 12:12:44 PM10/8/21

to

I do have an account, and have registered a few bugs, but I did not know
about the quip list. It's always a rare pleasure to learn something new
from Usenet!

Ivan Godard

unread,

Oct 8, 2021, 12:33:54 PM10/8/21

to

Because, for most current systems, a kernel thread switch is thousands
of cycles.

EricP

unread,

Oct 8, 2021, 1:15:18 PM10/8/21

to

According to the HTML paper dated 2002 titled
"An Implementation of Scheduler Activations on the NetBSD Operating System"
in the section "Performance Analysis"
a Linux 2.4.19 kernel thread context switch is 82 us,
GNU PTH Portable Threads user mode library on NetBSD is 166 us,
and Scheduler Activations on NetBSD is 225 us.

I don't know what that 2002 version of Linux scheduler did but
around that time they used a single Big Lock for many OS resources.
Since then much of the locking has been made finer grain and scheduling
in particular is separated for each core so no locking is required.
So kernel thread context switch should be proportionally even faster.

BGB

unread,

Oct 8, 2021, 1:33:52 PM10/8/21

to

Pretty much.

>>
>> I have my opinions, but will note that "the system" is self-regulating
>> to some extent, as compiler people would have a hard-time keeping people
>> using their compilers if they set out to break large amounts of existing
>> "real world" code (and will tend to back down when they face sufficient
>> backlash).
>
> Agreed. Compiler writers have no interest in making tools that people
> don't want to use. If feedback suggests a new compiler version does not
> work the way people expect and want, then the compiler developers take
> that seriously - they discuss matters with the users, they change the
> behaviour of the tools, they add flags or options to give users more
> control of the behaviour. In particular, they aim to test out new
> changes during pre-release and beta test stages so that potential
> problems get handled sooner.
>
> A similar process applies to language standards - there are long periods
> of proposals, discussions, and voting before changes are made. Big
> compiler users (like Linux distributions, corporations such as IBM,
> Microsoft, Google), big compiler implementers, researchers, and anyone
> else who is interested can get involved and share their opinion.
>
> You are never going to please everyone all of the time. And you are
> never going to please the small but vocal community who object to
> everything and accuse compiler writers and/or language committees of all
> sorts of evil motives and conspiracies. But they do a solid job
> overall, which is why languages like C are still a good choice for a lot
> of tasks.
>

Generally, it is tradeoffs.

Some people want "fastest possible", others want "existing code still
works unmodified".

Stuff should ideally still work even when the coding practices are
"kinda crap" so long as there is at least "some sane behavior which can
reasonably be expected from this".

So, code which casts pointers all over the place, should generally still
work IMO.

Whereas, things like taking the address of a stack local variable and
trying to use it to access contents in another unrelated stack local, or
code which overruns one array to access contents in another array (which
is assumed to be adjacent in memory), is stuff that is sufficiently
broken that I make little attempt at making it work (ROTT was a pretty
big offender with this sort of thing; Duke Nukem 3D also looks like an
awful mess which I have not yet gotten around to trying to port over...).

>>
>> In cases where stuff has broken in more significant ways, it has usually
>> been fairly isolated (a particular build of a particular compiler),
>> rather than a long-term change to the status quo.
>>
>
> And usually a "significant break" is found during testing stages, rather
> than post release.
>

Yeah, presumably. For mainline GCC and Clang, I am mostly using the ones
that come from the main repository.

Checking:
gcc 4.8.4
clang 3.3

Hmm, it appears that WSL's main repository is roughly 8 years behind the
"state of the art". Similarly, these versions somewhat predate WSL itself.

Note that according to "apt", this does appear to be "up to date".

Then again, back when I was mostly using Cygwin, I think I remember it
staying with GCC 0.97 for a very long time.

Checking the GCC version for my RISC-V cross-compiler toolchain:
11.1.0

Well, this seems a little newer...

Not sure I have figured out the whole provenance thing as of yet either.

I was not claiming this as an original idea.

But, I put TBAA as an O3 feature, as it falls more into the "good for
speed but potentially unsafe" category, which is what I am mostly
putting under O3.

There are options to enable it at other levels if desired.

Current policy is similar to MSVC in this sense, which also has TBAA as
a sort of "disabled by default; explicit opt-in" thing last I checked.

> If you think that TBAA is not necessarily valid, then it should not be
> enabled at /any/ level - but only be an explicit flag. If you think
> that TBAA /is/ valid, but that some code might want to break the strict
> aliasing rules, then what is needed is a standardised pragma to change
> the language semantics. You can write:
>
> #ifdef __GNUC__
> #pragma GCC optimize "-fno-strict-aliasing"
> #endif
>
> It would be better if there were standardised pragmas, in the manner of
> "#pragma STDC FP_CONTRACT ON" that exists today for fine-tuning floating
> point semantics.
>

Yes, possibly something along these lines could make sense.

In my own tests, I didn't see any real issues with the TBAA semantics,
but I don't personally consider them to be "safe", since it is fairly
common for people to write code which violates strict-aliasing
assumptions (such as when doing type-conversions via free-form pointer
casting; even if such code tends to work when TBAA is used), so do not
enable them by default.

Granted, TBAA would be fine if one assumes that the code in question is
type-safe.

It is possible that one could try to "prove" TBAA safe by determining an
absence of pointer casting, ..., though doing so outside of a given
local scope is non-trivial. One would need to verify it both locally, as
well as for any pointers which may have come in from arguments or global
variables, ... Given this optimization is handled in the front-end,
there is no good way to make non-local inferences about all of the
callers to a given function (such an inference would be a little easier
in the backend; which already makes use of the relevant sort of
graph-walk for things like pruning stuff that is effectively unreachable).

>
>>
>> MSVC does something vaguely similar to this, though from what I have
>> found, for some types of programs (such as my emulators), O1 tends to be
>> faster than O2 (I suspect because MSVC's O2 option is a lot more prone
>> to make use of "misguided attempts at autovectorization" when compared
>> with O1).
>>
>
> All compilers do something similar.
>
> But "aggressive optimisations" at higher "O" levels do not mean
> optimisations that might "break" code, veer from the standards, or
> generate incorrect object code. (Compilers might have options like
> that, but they will require explicit enabling.) It means optimisations
> that might take excessive compile time in relation to their benefits, or
> might result in code that is still correct but is actually slower than
> you would otherwise get.
>

I consider these cases roughly analogous, "you want speed and are
willing to take more risks to get it".

GCC and Clang generally consider TBAA to be safe, but I personally do
not, hence why they are under the "riskier optimizations" category. So,
sort of like GCC's "-Ofast".

The scheme used was based on MSVC, which basically goes up to O2 and
then stops (where O2 already enables optimizations which may adversely
effect code density). The main practical difference between O2 and O3 in
this case is mostly that O3 turns on TBAA and similar.

On BGBCC, the main code-density effecting optimization is turning on WEX
"almost everywhere", which has the main drawback of hindering the use of
16-bit instruction encodings.

There are other minor differences, like re-configuring the branch
patterns in loops.

Eg:
while(n)
{ ... }

In size-optimized modes:
L0:
if(!n)
goto L1;
...
goto L0;
L1:

In speed-optimized modes:
if(!n)
goto L1;
L0:
...
if(n)
goto L0;
L1:

...

Another difference currently is that speed optimization enables using
R0..R31 pretty much everywhere, whereas size optimization only enables
R16..R31 if a register-pressure threshold is crossed.

There is a prolog and epilog compression feature, but this is enabled
pretty much universally as the benefits from savings in code-footprint
tend to outweigh those of the added overhead from the extra internal
function-calls into the reused prolog and epilog sequences.

Though, ideally, it would be selectively disabled for "hot path"
functions. At present though, there is no good way to figure out what is
or is not part of the hot path (though, I guess it is possible I could
make my emulator dump out a profiler trace for a given binary, and add
an option for the compiler to retrieve it to use for optimization hints).

Relatedly, one could argue for an optimization to disable the use of
"stack canaries", but these are semi-important for security (it adds and
check magic numbers on the stack in an attempt to raise a fault if a
buffer overrun in a stack-based array has occurred). Currently there is
not any option to disable these checks.

...

>>
>> Granted, arguably I could upgrade MSVC and see if anything is
>> difference, but given I am kinda battling with space on "C:\" (and there
>> is no real way to make it bigger), not particularly inclined to do so.
>>
>> Eg, "320GB should be more than enough space for anyone."
>> Windows: "Nope!"
>> Me: "Can you at least give me an option to expand the partition?"
>> Windows: "What sort of insanity is this?..."
>>
>> Still have space on other drives, but this isn't of much use if one
>> wants to install another version of Visual Studio or similar.
>>
>> ...
>>
>

For context, I am still using VS2015.

MitchAlsup

unread,

Oct 8, 2021, 1:45:31 PM10/8/21

to

On Friday, October 8, 2021 at 12:15:18 PM UTC-5, EricP wrote:

> Ivan Godard wrote:

> >
> > Because, for most current systems, a kernel thread switch is thousands
> > of cycles.
<
> According to the HTML paper dated 2002 titled
> "An Implementation of Scheduler Activations on the NetBSD Operating System"
> in the section "Performance Analysis"
> a Linux 2.4.19 kernel thread context switch is 82 us,
> GNU PTH Portable Threads user mode library on NetBSD is 166 us,
> and Scheduler Activations on NetBSD is 225 us.
<

2002 was about the time of the 1GHz processor.
82µs = 82,000 cycles.
<
Almost anyone would consider this "slow". I was at AMD at the time, and it
was about 1,000 cycles from the arrival of the interrupt to delivery of control
at the interrupt handler.
<
Apparently "dealing" with said interrupt causes several more context switches
and wandered around large memory footprint before delivering control to the
desired thread.

>
> I don't know what that 2002 version of Linux scheduler did but
> around that time they used a single Big Lock for many OS resources.
<

2002 was before many were using more than one core.

BGB

unread,

Oct 8, 2021, 6:12:43 PM10/8/21

to

On 10/8/2021 12:45 PM, MitchAlsup wrote:
> On Friday, October 8, 2021 at 12:15:18 PM UTC-5, EricP wrote:
>> Ivan Godard wrote:
>
>>>
>>> Because, for most current systems, a kernel thread switch is thousands
>>> of cycles.
> <
>> According to the HTML paper dated 2002 titled
>> "An Implementation of Scheduler Activations on the NetBSD Operating System"
>> in the section "Performance Analysis"
>> a Linux 2.4.19 kernel thread context switch is 82 us,
>> GNU PTH Portable Threads user mode library on NetBSD is 166 us,
>> and Scheduler Activations on NetBSD is 225 us.
> <
> 2002 was about the time of the 1GHz processor.
> 82µs = 82,000 cycles.
> <
> Almost anyone would consider this "slow". I was at AMD at the time, and it
> was about 1,000 cycles from the arrival of the interrupt to delivery of control
> at the interrupt handler.
> <
> Apparently "dealing" with said interrupt causes several more context switches
> and wandered around large memory footprint before delivering control to the
> desired thread.

This sort of thing is part of why my "TestKern" effort is aiming to cram
everything into a single big address space.

Granted, this does have the (fairly obvious) property that most
traditional schemes for memory protection, ..., kinda go out the window
when one has multiple logical processes within the same address space
(as do the traditional distinctions between threads and processes).

This is also related to the design of my own "PBO" ABI, and why (if
using ELF) I would prefer FDPIC (even if FDPIC also kinda sucks from an
overhead POV, which is also part of why I went with a seriously hacked
PE/COFF with a custom ABI rather than just reusing FDPIC).

Well, I also debated whether or not to go full custom for the binaries,
but my own design attempts didn't really lead to anything which had much
obvious advantage over a hacked up and LZ compressed version of PE/COFF.

Well, and also when one omits the MZ header, PE/COFF itself mostly
reduces to a COFF variant (just with an extra FOURCC on the front). One
could arguably debate if my PEL4 format (PE/COFF-(MZ Stub)+(LZ4
Compression)) is even really PE/COFF anymore, or some other form of COFF
derived executable format.

Had already gone and partly added an ELF loader as well, but then
debated whether to also support loading LZ compressed ELF (reusing the
existing LZ compressor and checksum logic), but, GNU LD doesn't exactly
support this, and otherwise this would require feeding the ELF binaries
through a standalone tool to convert them into an LZ compressed form,
which doesn't really seem worth the added hassle.

Quadibloc

unread,

Oct 8, 2021, 11:55:43 PM10/8/21

to

On Thursday, October 7, 2021 at 6:46:28 PM UTC-6, MitchAlsup wrote:
> On Thursday, October 7, 2021 at 3:47:23 PM UTC-5, BGB wrote:

> > Whereas, say:
> > x=foo->bar->x;
> > y=foo->bar->y;
> > Becomes, effectively:
> > t0=foo->bar;
> > x=t0->x;
> > y=t0->y;

> In Operating System code; if one write the top 2 statements, one should
> get at least 4 loads--it is simply UNSAFE to assume otherwise--even
> if x happens to be allocated into a register (which may happen FAR after
> aliasing analysis is performed.)
>
> If the programmer KNOWS that there is no aliasing going on, the programmer
> should write the bottom set of code.

> If the programmer writes the top code and expects some kind of compiler
> flags (or magic) can translate it to the bottom--the programmer should be
> FIRED !

This is quite confusing to me.

It seems to me that both of those code samples should be
guaranteed to produce identical results, regardless of the
level of optimization selected.

It's
x=foo->bar->x;
*ptr = 3;
y=foo->bar->y;

where 'ptr' happens to have a... connection... with foo or bar
that might behave in the 'unexpected' way of not producing the
same result as

t0=foo->bar;
x=t0->x;
y=t0->y;

*ptr = 3;

Unless, of course, I'm missing something.

John Savard

Stephen Fuld

unread,

Oct 9, 2021, 2:25:37 AM10/9/21

to

OK. This is generally called "Single Address Space". This is used by
such systems as the IBM System i, and the Mill.

> Granted, this does have the (fairly obvious) property that most
> traditional schemes for memory protection, ..., kinda go out the window
> when one has multiple logical processes within the same address space
> (as do the traditional distinctions between threads and processes).

Not necessarily. Just because multiple logical processes exist within
the same address space, doesn't mean that each process has legal access
to all the addresses within that space.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,

Oct 9, 2021, 3:32:45 AM10/9/21

to

An address space tells you what you can name; a protection domain tells
you whether naming it does you any good.

BGB

unread,

Oct 9, 2021, 3:50:07 AM10/9/21

to

Yeah.

>> Granted, this does have the (fairly obvious) property that most
>> traditional schemes for memory protection, ..., kinda go out the
>> window when one has multiple logical processes within the same address
>> space (as do the traditional distinctions between threads and processes).
>
> Not necessarily. Just because multiple logical processes exist within
> the same address space, doesn't mean that each process has legal access
> to all the addresses within that space.
>

I was thinking of traditional schemes like "protection rings", ...

So, say, a protection ring scheme will be pretty ineffective at
protecting one program from another within a single address space.

I have my own scheme I am calling VUGID which can basically attach
UID/GID checks and/or ACLs to memory on a per-page basis (some of the
logic for dealing with this is basically integrated into the MMU/TLB).

George Neuner

unread,

Oct 9, 2021, 8:26:15 AM10/9/21

to

On Fri, 08 Oct 2021 01:38:21 GMT, Branimir Maksimovic
<branimir....@icloud.com> wrote:

>On 2021-10-08, MitchAlsup <Mitch...@aol.com> wrote:
>> On Thursday, October 7, 2021 at 3:47:23 PM UTC-5, BGB wrote:

>>> On 10/7/2021 2:35 PM, David Brown wrote:
>>> > On 07/10/2021 12:52, cla...@hotmail.com wrote:
>>> >> Might strike a cord with Anton and/or Mich
>>> >>
>>> >> https://www.yodaiken.com/2021/10/06/plos-2021-paper-how-iso-c-became-unusable-for-operating-system-development/
>>> >>
>>> >
>>>

>>> Under TBAA, the second "foo->x" may reuse the result of the first, but
>>> under more conservative semantics, the "*ptr=" would effectively
>>> invalidate the cached result regardless of the pointer type.
>>>

>>> Whereas, say:
>>> x=foo->bar->x;
>>> y=foo->bar->y;
>>> Becomes, effectively:
>>> t0=foo->bar;
>>> x=t0->x;
>>> y=t0->y;
>><
>> In Operating System code; if one write the top 2 statements, one should
>> get at least 4 loads--it is simply UNSAFE to assume otherwise--even
>> if x happens to be allocated into a register (which may happen FAR after
>> aliasing analysis is performed.)
>><
>> If the programmer KNOWS that there is no aliasing going on, the programmer
>> should write the bottom set of code.
>><
>> If the programmer writes the top code and expects some kind of compiler
>> flags (or magic) can translate it to the bottom--the programmer should be
>> FIRED !
>

>Isn't it that GC are always advertised as way to solve *cycling references*?

C is a hostile environment for most forms of GC.

Branimir Maksimovic

unread,

Oct 9, 2021, 10:18:16 AM10/9/21

to

Only if you play with pointers. If you use tham wihtout masking. xoring
or shifting they can be scanned as such as Hans&Boehms GC does.
Only thing is that some number can be recognized as false positive :p

Stephen Fuld

unread,

Oct 9, 2021, 10:56:09 AM10/9/21

to

Not necessarily. Most modern systems conflate two totally different
ideas into mechanism, so most people seem to think they are really the
same. As I van pointed out in his response, don't confuse knowing where
something is with what rights to access it you have.

You can watch Ivan's description of how the Mill separates the two
concepts on the Mill web site.

> I have my own scheme I am calling VUGID which can basically attach
> UID/GID checks and/or ACLs to memory on a per-page basis (some of the
> logic for dealing with this is basically integrated into the MMU/TLB).

See, you are conflating protection (UID/GID or ACL) with memory
placement (MMU/TLB). They don't have to be conflated.

EricP

unread,

Oct 9, 2021, 10:56:25 AM10/9/21

to

MitchAlsup wrote:
> On Friday, October 8, 2021 at 12:15:18 PM UTC-5, EricP wrote:
>> Ivan Godard wrote:
>
>>> Because, for most current systems, a kernel thread switch is thousands
>>> of cycles.
> <
>> According to the HTML paper dated 2002 titled
>> "An Implementation of Scheduler Activations on the NetBSD Operating System"
>> in the section "Performance Analysis"
>> a Linux 2.4.19 kernel thread context switch is 82 us,
>> GNU PTH Portable Threads user mode library on NetBSD is 166 us,
>> and Scheduler Activations on NetBSD is 225 us.
> <
> 2002 was about the time of the 1GHz processor.
> 82µs = 82,000 cycles.
> <
> Almost anyone would consider this "slow". I was at AMD at the time, and it
> was about 1,000 cycles from the arrival of the interrupt to delivery of control
> at the interrupt handler.
> <
> Apparently "dealing" with said interrupt causes several more context switches
> and wandered around large memory footprint before delivering control to the
> desired thread.

Test platform was an Apple iBook, with a 500 MHz G3 processor with
256k of L2 cache.

Wikipedia says that G3 is a family of processors in the PowerPC 7xx
series and that it was the PowerPC 750 in this laptop.
It says the 740 was roughly equivalent to a Pentium II and the
750 added off-die L2 cache that "increased performance by
approximately 30% in most situations".

>> I don't know what that 2002 version of Linux scheduler did but
>> around that time they used a single Big Lock for many OS resources.
> <
> 2002 was before many were using more than one core.

Yes, PowerPC 750 was single core.

John Levine

unread,

Oct 9, 2021, 3:16:10 PM10/9/21

to

According to Ivan Godard <iv...@millcomputing.com>:

>>> Granted, this does have the (fairly obvious) property that most
>>> traditional schemes for memory protection, ..., kinda go out the
>>> window when one has multiple logical processes within the same address
>>> space (as do the traditional distinctions between threads and processes).
>>
>> Not necessarily. Just because multiple logical processes exist within
>> the same address space, doesn't mean that each process has legal access
>> to all the addresses within that space.
>
>An address space tells you what you can name; a protection domain tells
>you whether naming it does you any good.

Quite right. Over fifty years ago S/360 put everything in one address space
and had a memory protection scheme so that programs couldn't stomp on (or in
some cases even look at) pages that weren't theirs.
--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Oct 9, 2021, 3:33:41 PM10/9/21

to

On Saturday, October 9, 2021 at 2:16:10 PM UTC-5, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
> >>> Granted, this does have the (fairly obvious) property that most
> >>> traditional schemes for memory protection, ..., kinda go out the
> >>> window when one has multiple logical processes within the same address
> >>> space (as do the traditional distinctions between threads and processes).
> >>
> >> Not necessarily. Just because multiple logical processes exist within
> >> the same address space, doesn't mean that each process has legal access
> >> to all the addresses within that space.
> >
> >An address space tells you what you can name; a protection domain tells
> >you whether naming it does you any good.
<
> Quite right. Over fifty years ago S/360 put everything in one address space
<

1963 (58+ YA) we got to see it.

<
> and had a memory protection scheme so that programs couldn't stomp on (or in
> some cases even look at) pages that weren't theirs.
<

Many others already had base and bounds.
<
Does anyone know if the Boroughs and ICL machines prevented applications
from creating their own descriptors ?

George Neuner

unread,

Oct 9, 2021, 3:35:29 PM10/9/21

to

Actually, the main problem with C - and similar "uncooperative"
languages - is that the compiler often "misplaces" pointers to heap
allocations.

Consider a function that heap allocates an object, manipulates it, and
deletes it: in this case any and all pointers to the object will exist
only on the call stack. It doesn't matter whether the function calls
other functions - it only matters whether collection may be triggered
while the heap allocated object is alive.

In the course of manipulating an object, compilers routinely "lose" or
"misplace" the original pointer to it, e.g., by walking the pointer
across an array, or by indexing from it and not keeping the original
value.
[When the compiler knows the size of the object or array element, it
can easily undo any indexing math to recover the original pointer
value. Thus no need to store the original value.]

If you step through such code with your debugger, you will discover
that, quite often, the /only/ pointer to some object is pointing into
the middle of the object, or in the case of an array it could be
pointing outside of the array bounds. Most importantly, there may be
/no/ easily identifiable heap pointer for GC to find.

This is "compilation pointer masking" and it greatly increases the
effort required to identify live heap objects: the GC has to be able
to recognize references to any location inside any allocated block.

Just guaranteeing that compilers never "lose" original pointers would
make C (and similar "uncooperative" languages) a lot more friendly to
GC.

YMMV,
George

John Levine

unread,

Oct 9, 2021, 3:38:21 PM10/9/21

to

According to MitchAlsup <Mitch...@aol.com>:

>> >An address space tells you what you can name; a protection domain tells
>> >you whether naming it does you any good.
><
>> Quite right. Over fifty years ago S/360 put everything in one address space
><
>1963 (58+ YA) we got to see it.

>> and had a memory protection scheme so that programs couldn't stomp on (or in
>> some cases even look at) pages that weren't theirs.
><
>Many others already had base and bounds.

Right, but that's different. The PDP-6 had base and bounds so a program could only
address within the bounds, and the base provided dynamic relocation which S/360 didn't
have. On the 360 there was only one set of addresses but if you tried to fetch or store
into a block with the wrong protection key, it got a protection fault and your program
was generally killed.

>Does anyone know if the Boroughs and ICL machines prevented applications
>from creating their own descriptors ?

Burroughs certainly did. The protection scheme depended on compilers generating
secure code.

Ivan Godard

unread,

Oct 9, 2021, 4:11:23 PM10/9/21

to

On 10/9/2021 12:33 PM, MitchAlsup wrote:
> On Saturday, October 9, 2021 at 2:16:10 PM UTC-5, John Levine wrote:
>> According to Ivan Godard <iv...@millcomputing.com>:
>>>>> Granted, this does have the (fairly obvious) property that most
>>>>> traditional schemes for memory protection, ..., kinda go out the
>>>>> window when one has multiple logical processes within the same address
>>>>> space (as do the traditional distinctions between threads and processes).
>>>>
>>>> Not necessarily. Just because multiple logical processes exist within
>>>> the same address space, doesn't mean that each process has legal access
>>>> to all the addresses within that space.
>>>
>>> An address space tells you what you can name; a protection domain tells
>>> you whether naming it does you any good.
> <
>> Quite right. Over fifty years ago S/360 put everything in one address space
> <
> 1963 (58+ YA) we got to see it.
> <
>> and had a memory protection scheme so that programs couldn't stomp on (or in
>> some cases even look at) pages that weren't theirs.
> <
> Many others already had base and bounds.
> <
> Does anyone know if the Boroughs and ICL machines prevented applications
> from creating their own descriptors ?

The B5000 stream procedures did if you generated the code yourself; I
think so, anyway, it was before my time there. The B6500 secured the
compilers so it wasn't possible; the instruction to set the 3-bit tag
was only available from ESPOL, the OS flavor of Algol. Later on (after
my day) I think they realized that securing the compilers was
impractical and so the instruction was made privileged.

MitchAlsup

unread,

Oct 9, 2021, 5:28:23 PM10/9/21

to

On Saturday, October 9, 2021 at 2:35:29 PM UTC-5, George Neuner wrote:
> On Sat, 09 Oct 2021 14:18:14 GMT, Branimir Maksimovic
> <branimir....@icloud.com> wrote:

<snip>

> In the course of manipulating an object, compilers routinely "lose" or
> "misplace" the original pointer to it, e.g., by walking the pointer
> across an array, or by indexing from it and not keeping the original
> value.
> [When the compiler knows the size of the object or array element, it
> can easily undo any indexing math to recover the original pointer
> value. Thus no need to store the original value.]
>
> If you step through such code with your debugger, you will discover
> that, quite often, the /only/ pointer to some object is pointing into
> the middle of the object, or in the case of an array it could be
> pointing outside of the array bounds. Most importantly, there may be
> /no/ easily identifiable heap pointer for GC to find.
<

I consider not having convenient indexing forms for LD and ST
bad ISA design. My 66000 LD and ST instructions have such
indexing and in _most_ cases using indexing results in smaller
and faster code than when ++ing or --ing a pointer through an
array; often using fewer registers to boot.

>
> This is "compilation pointer masking" and it greatly increases the
> effort required to identify live heap objects: the GC has to be able
> to recognize references to any location inside any allocated block.
>

Question: in a 64-bit virtual address space, is there ever a reason that
the "heap" cannot live in a contiguous subsection of that space so
base and bounds could tell if the pointer should be considered for
GC (or not)?

>
> Just guaranteeing that compilers never "lose" original pointers would
> make C (and similar "uncooperative" languages) a lot more friendly to
> GC.
<

Banning ++ing and --ing would go a long way.........
>
>
> YMMV,
> George

Branimir Maksimovic

unread,

Oct 9, 2021, 6:22:47 PM10/9/21

to

Boehm's collector scans heap/stack/bss, pointing anywhere within
allocated area, no problem with that.

>
> This is "compilation pointer masking" and it greatly increases the
> effort required to identify live heap objects: the GC has to be able
> to recognize references to any location inside any allocated block.
>

live is if pointer exists anywhere in accessible memory.

>
> Just guaranteeing that compilers never "lose" original pointers would
> make C (and similar "uncooperative" languages) a lot more friendly to
> GC.

Sure, but Boehm's collector works with C. With C++ there are problems
because of caching scheme in defualt allocator.

>
>
> YMMV,
> George
Greetings, Branimir.

BGB

unread,

Oct 9, 2021, 6:33:22 PM10/9/21

to

Some of this can be helped by requiring the compiler to keep a lot of
metadata around for anything which may be relevant to the garbage
collector (where to find root pointers, structure layout metadata, ...).

With some compiler assistance and minor restrictions on coding
practices, one could even go as far as to use a precise garbage collector.

This could either be done via explicit language extensions, or by
semantic trickery and silently "pinning" allocations when they go
outside the scope of what the GC can keep track of.

Say, for example, if someone has:
typedef struct foo_s foo_t;
struct foo_s { ... };
...
foo_t *pfoo;
pfoo=malloc(sizeof(foo_t));

And, the compiler keeps track of:
The layout of foo_s;
The relative location of pfoo;
...
And, then infers that 'pfoo' points to an instance of the same struct
(it could infer this at the scope of the 'malloc' call, ...).

Then, a GC could likely be made pretty workable.

Whether or not using a garbage collector is a worthwhile tradeoff is
another matter.

Bill Findlay

unread,

Oct 9, 2021, 8:06:46 PM10/9/21

to

On 9 Oct 2021, MitchAlsup wrote
(in article<fbb80b3a-bf7a-44aa...@googlegroups.com>):

> On Saturday, October 9, 2021 at 2:16:10 PM UTC-5, John Levine wrote:
> > According to Ivan Godard<iv...@millcomputing.com>:
> > > > > Granted, this does have the (fairly obvious) property that most
> > > > > traditional schemes for memory protection, ..., kinda go out the
> > > > > window when one has multiple logical processes within the same address
> > > > > space (as do the traditional distinctions between threads and processes).
> > > >
> > > > Not necessarily. Just because multiple logical processes exist within
> > > > the same address space, doesn't mean that each process has legal access
> > > > to all the addresses within that space.
> > >
> > > An address space tells you what you can name; a protection domain tells
> > > you whether naming it does you any good.
> <
> > Quite right. Over fifty years ago S/360 put everything in one address space
> <
> 1963 (58+ YA) we got to see it.
> <
> > and had a memory protection scheme so that programs couldn't stomp on (or in
> > some cases even look at) pages that weren't theirs.
> <
> Many others already had base and bounds.
> <
> Does anyone know if the Boroughs and ICL machines prevented applications
> from creating their own descriptors ?

The ICL 2900 used descriptor solely for describing structured data.
They played no role in the protection system, which used nested rings
in a manner very similar to Multics. So processes could and did
(and had to, in some cases) manufacture their own descriptors.

--
Bill Findlay

Victor Yodaiken

unread,

Oct 9, 2021, 9:02:42 PM10/9/21

to

On Thursday, October 7, 2021 at 2:35:29 PM UTC-5, David Brown wrote:

> As I got further in the document, it seems to be just another "the
> compiler should do what I want it to do, not necessarily what I /told/
> it to do" rant. "C today should work the way /I/ want it to do - the
> way /I/ interpret the precise wording of what someone decades ago".
> "Thompson and Ritchie were saints and prophets, and their words were
> holy and eternal - regardless of how computers, languages, and
> programming needs have changed since the inception of C". "We were all
> happy and no bugs existed in code until gcc sinned by optimising code,
> and we were thrown out of Eden into a hell of undefined behaviour".
> "Yes, we know that compilers often give us ways to get the semantics we
> want by specific flags, but that's not good enough - everyone else
> should get slower code so that we don't have to think as much or follow
> the rules of the language".

Of course the paper doesn't make an argument anything like that nor does it make the assembler claims
you start with. It is hard to believe anyone could read the
paper and get such a weird take from it.

Joe Pfeiffer

unread,

Oct 9, 2021, 9:36:41 PM10/9/21

to

Thomas Koenig <tko...@netcologne.de> writes:

> MitchAlsup <Mitch...@aol.com> schrieb:
>
>> It is much like handling fiearms::
>><
>> 1) all pointers are loaded at all times unless verified they are not.
>> 2) never point a pointer at something you do not want destroyed.
>> 3) never enable a pointer until your sights are on the target
>><
>> 4) There is NO SUCH THING as a casual pointer.
>
> I _love_ that.

Number 2 is particularly awesome.
Mind if I put it on Facebook (with attribution, of course)?

BGB

unread,

Oct 9, 2021, 9:48:45 PM10/9/21

to

IIRC, Mill was using a capability architecture or something?...

In my stuff, I am (for now) using a linear 48 bit address space.

I had considered some ideas for how to expand it to 96 bits, but have
not actually implemented anything in this direction yet.

There are two current sub-possibilities here:
Direct 96 bit linear space, 96 bit virtual address in TLBE (expanded to
256 bits).

Two-part remapping, which matches the high portion and low portion
separately, and then using a check value to verify that the high and low
parts of the lookup agree with each other.

The two part scheme would support both 48b and 96b addressing with the
existing TLBE format, but could be a little cheaper to implement
relative to a 1:1 (as it avoids needing to expand the TLBEs), but could
have other potential costs.

Apart from when using explicit 128-bit pointers, the addressing would
also behave more like an x86-style segmented addressing scheme (there
could be PC_HI and GBR_HI regs which would behave sort of like CS and
DS); with modular addressing at the 256TB / 0.25PB boundary.

One issue with such an address space is that it is sufficiently large as
to run out of SI prefixes to express its size directly; though one
possible workaround would be switching to using a petabyte as a
base-unit (eg: "quad", *1), at which point the address space size could
be expressed as (64 petaquad). Other options would be either to make the
table longer, or glue prefixes together (kilopetabyte), at which I guess
it becomes a question of "what is less awful".

*1: Yes, I am ripping this off of Star Trek.

For now though, there are no immediate plans to implement this.

>
>> I have my own scheme I am calling VUGID which can basically attach
>> UID/GID checks and/or ACLs to memory on a per-page basis (some of the
>> logic for dealing with this is basically integrated into the MMU/TLB).
>
> See, you are conflating protection (UID/GID or ACL) with memory
> placement (MMU/TLB). They don't have to be conflated.
>

OK. Conflating them makes the most sense for the MMU though.

Other than the funky protection scheme, the MMU is fairly similar to a
more traditional design. Well, apart from using a software-managed TLB.

However, the only real alternative to software TLB, is having a
page-walker in hardware. However, this part seems relatively expensive,
and it seems easier to throw an exception here and let the ISR handle it
(with special instructions for things like adding TLE entries into the
TLB, etc).

...

Stefan Monnier

unread,

Oct 10, 2021, 12:03:54 AM10/10/21

to

> Boehm's collector scans heap/stack/bss, pointing anywhere within
> allocated area, no problem with that.

"No problem" is debatable. I could agree to "solved problem", but "no
problem" is really pushing it. Even "solved problem" is only true if
you don't care about the costs of solving that problem (can be extra
runtime costs, additional false positives, ...).

Stefan

Ivan Godard

unread,

Oct 10, 2021, 12:47:00 AM10/10/21

to

I wish! But no, it's grant based, because fat pointers are not viable in
the market IMO.

Thomas Koenig

unread,

Oct 10, 2021, 3:41:22 AM10/10/21

to

Ivan Godard <iv...@millcomputing.com> schrieb:

>> IIRC, Mill was using a capability architecture or something?...
>
> I wish! But no, it's grant based, because fat pointers are not viable in
> the market IMO.

Would the 128-bit pointers of System i qualify as fat pointers under
your definition?

Ivan Godard

unread,

Oct 10, 2021, 4:18:09 AM10/10/21

to

Yes, but they are in a language system that is not trying to run legacy
decks that assumes 64-bit (or 32-bit) pointers. And they have IBM
marketing :-)

I don't know what System i does with a large C program fresh off gitHub.
Do you? The Unisys solution was to sandbox the whole thing, with no
internal protection at all.

CHERI is an interesting alternative to fat pointers. In the early days
of what became Mill we came up with a similar approach to theirs. The
(perhaps overhasty) conclusion was that we could get within a factor of
two of a legacy system performance, but not much better. The security
advantages would be a win; the reprogramming of data structures to fit
the changed sizes would be a lose.

The realization that protection could be divorced from mapping and hence
a grant system with skinny pointers could have legacy-like performance
or better but fat-pointer security made up our minds.

Thomas Koenig

unread,

Oct 10, 2021, 5:14:10 AM10/10/21

to

Ivan Godard <iv...@millcomputing.com> schrieb:

> On 10/10/2021 12:41 AM, Thomas Koenig wrote:
>> Ivan Godard <iv...@millcomputing.com> schrieb:
>>
>>>> IIRC, Mill was using a capability architecture or something?...
>>>
>>> I wish! But no, it's grant based, because fat pointers are not viable in
>>> the market IMO.
>>
>> Would the 128-bit pointers of System i qualify as fat pointers under
>> your definition?
>>
>
> Yes, but they are in a language system that is not trying to run legacy
> decks that assumes 64-bit (or 32-bit) pointers. And they have IBM
> marketing :-)

And they have binary compatiblity with applications from the 1980s.

>
> I don't know what System i does with a large C program fresh off gitHub.
> Do you?

Unfortunately not. They have some open source support as claimed
in https://www.ibm.com/support/pages/open-source-support-ibm-i so
you could run Perl and PHP if you wanted to. There is also a
C compiler. It is likely that source off github would fall
down due to EBCDIC to start with :-)

David Brown

unread,

Oct 10, 2021, 6:48:01 AM10/10/21

to

The idea of C being a "high-level assembler" is right there in section
1, "Introduction".

Yes, I exaggerated somewhat, and clearly I was parodying and note
quoting. However, that /is/ the impression I got from the paper. I'm
sorry, but you appear to have some serious misunderstandings about the C
language, its tools, how it was originally intended to be used, how it
developed as a language, how its tools developed, and how its use
developed. The very title, "How ISO C became unusable for operating
systems development" shows that - C was /never/ intended or designed to
let you write operating systems purely in standard portable C. The
intention - which was achieved, and is still the case today - was that
you could write /most/ of the OS in at least reasonably portable C,
relying somewhat on implementation-dependent behaviour in parts.

And if you look at OS's today, that's what you see. Whether you look at
massive OS's like BSD or Linux, or small ones like FreeRTOS, you'll see
the great majority is written in plain standards-defined C. Some parts
are unavoidably written in assembly (either inline or stand-alone), some
parts are dependent on implementation details for the targets supported,
and sometimes the writers take advantage of compiler extensions or
features in order to write code that is more efficient, clearer, or
otherwise better than standard C alone.

Some details have changed since C's inception, but the principles have not.

MitchAlsup

unread,

Oct 10, 2021, 9:57:17 AM10/10/21

to

Be my guest, I wrote that extemporaneously.

EricP

unread,

Oct 10, 2021, 10:32:13 AM10/10/21

to

What exactly is the distinction between a single address space
with page protection that allows multiple threads in that address space
access to some of those pages, and multiple address spaces with multiple
threads that allows access to some of those pages.

Because it sure sounds like the difference is just whether or not there is
a page-table-base pointer change on process switch, and if the TLB
supports ASID's Address Space ID's then there is no performance difference.

In other words, one page table with PTE's with fancy per-thread protection
keys that change when threads switch, is the same as multiple page tables
with PTE's with simple protection that map different memory sections.

Where is the functional distinction?
What can one do that the other can't?

Stefan Monnier

unread,

Oct 10, 2021, 10:40:47 AM10/10/21

to

> What exactly is the distinction between a single address space
> with page protection that allows multiple threads in that address space
> access to some of those pages, and multiple address spaces with multiple
> threads that allows access to some of those pages.

The L1 cache can be indexed and tagged by logical addresses and only needs
to check access rights (which can be spread over many cycles as long as
the answer comes before the corresponding instruction is committed)
rather than having to check address mappings as well (which can only be
spread until the moment that the cache line's tag is checked).

Stefan

Ivan Godard

unread,

Oct 10, 2021, 11:10:31 AM10/10/21

to

A single address space permits VMAX (typically 2^64) of virtual total,
system wide, for all processes. A multiple address space permits TMAX
(typically 2^16 or so)*VMAX total of virtual, system wide.

That's the historical reason that mapping is used: so a 16-bit system
could have more that 16 bits worth of combined system and appp code
running. 64-bit spaces are not (yet) constraining enough to need
multiple spaces..

EricP

unread,

Oct 10, 2021, 12:01:04 PM10/10/21

to

As Meltdown shows, once information that should not be accessible
escapes a security domain it can be extremely difficult to track
down and retract all information that begets from it,
and things like power fluctuations are impossible to retract.

BGB

unread,

Oct 10, 2021, 12:38:32 PM10/10/21

to

Yeah:
16-bits, pretty much everything will need multiple spaces
Even the first 8088/8086 PCs were already using 'far' pointers.
32-bits, could comfortably fit one program
Maybe several of them, if they are "fairly modest"
48 or 64 bits, could likely fit thousands of program instances.
96 or 128 bits, limit would be implausibly large...

This is excluding programs which reserve absurdly large address ranges
"because they can", but I consider this to be an aberration rather than
something which should be encouraged.

One does need to design the C ABI to be aware of the possibility of
multiple program instances in a single address space. For my ISA, I
ended up with a design I call PBO, which effectively accesses globals
via GBR (Global Base Register), and has a scheme by which one can get
from one GBR within a given process to any other GBR within the process
with a series of two memory loads.

Another scheme was ELF FDPIC, which can be used in a similar way. It
works by using multiple pointers for functions in the GOT, with the
target GOT being loaded from the current GOT on every function call.

The big drawback of FDPIC is that it has higher overheads:
All function calls need several additional pointer loads, ...
All global loads/stores need to be done indirectly via the GOT;
...

Though, looking at it, for an ISA design like RISC-V, something like
FDPIC would make more sense than PBO for the main reason that the ISA
lacks any real "sane" way to encode Load/Store displacements larger than
4kB. So, in effect, the cost of accessing globals via a GOT is likely to
be less than the cost of composing the large-displacement load/stores
needed for something like PBO.

Though, admittedly, I personally feel this is a design flaw in the
RISC-V ISA, rather than any particular merit to FDPIC.

...

MitchAlsup

unread,

Oct 10, 2021, 12:57:30 PM10/10/21

to

Mc 88200 supported 2×VMAX {code and data could be in different VASs.}
Nobody used it that way.
Everybody used it as ½×VMAX {lower ½ was user upper ½ was supervisor}.

>
> That's the historical reason that mapping is used: so a 16-bit system
> could have more that 16 bits worth of combined system and appp code
> running. 64-bit spaces are not (yet) constraining enough to need
> multiple spaces..
<

PDP-11/70 had 16-bit user code+data and 16-bit supervisor code+data VASs.
And LD-From and ST-To instructions access.
<
I am coming around to this means of accessing another VAS; indeed, this seems
to be the cleanest way to implement ADA rendezvous where the caller and the
acceptor can be in different address spaces, running at different priorities, on
different cores, possibly different machines (i.e., over a network).
<
Acceptor has access to the callers VAS+arguments and can write acceptor
results back to callers VAS+results.
<
About the only thing this does not provide is access across different virtual
machines--at least directly.

MitchAlsup

unread,

Oct 10, 2021, 1:13:43 PM10/10/21

to

If we exclude designing NEW 16-bit address space machines (more because
the space is already filled and unlikely to make enough headway in cost/
perf to open up a new slot in that space) all we need to do is to not make
those kinds of mistakes in the machines we are designing now and into the
future.

>
> This is excluding programs which reserve absurdly large address ranges
> "because they can", but I consider this to be an aberration rather than
> something which should be encouraged.
>

One of the reasons My 66000 provides a complete 64-bit VAS is to allow
for what used to be absurdly large address ranges--for whatever SW wanted
to do with those.

>
> One does need to design the C ABI to be aware of the possibility of
> multiple program instances in a single address space. For my ISA, I
> ended up with a design I call PBO, which effectively accesses globals
> via GBR (Global Base Register), and has a scheme by which one can get
> from one GBR within a given process to any other GBR within the process
> with a series of two memory loads.
>

My 66000 is PIC and PID without using a register to keep track of "stuff".
{When R0 is used in the Base-register position of a memory reference IP is
substituted, and the rest of the tool chain understands the required mechanics}.

>
> Another scheme was ELF FDPIC, which can be used in a similar way. It
> works by using multiple pointers for functions in the GOT, with the
> target GOT being loaded from the current GOT on every function call.
>

My 66000 supports multiple GOTs also without consuming a register.

>
> The big drawback of FDPIC is that it has higher overheads:
> All function calls need several additional pointer loads, ...
> All global loads/stores need to be done indirectly via the GOT;
> ...

So, (as Ivan would say) don't do that !
<
My 66000 has tabularized function calls that neither waste a register nor
the delay through the LD-aligner and load a memory reference directly into
the IP. Suitable (higher end) HW can start the instruction fetch even BEFORE
the function call address gets to the IP !! Eliminating most of this overhead.
<
Global LDs and STs can use a 64-bit offset along with base (or IP) and
indexing in a single instruction.

>
>
> Though, looking at it, for an ISA design like RISC-V, something like
> FDPIC would make more sense than PBO for the main reason that the ISA
> lacks any real "sane" way to encode Load/Store displacements larger than
> 4kB. So, in effect, the cost of accessing globals via a GOT is likely to
> be less than the cost of composing the large-displacement load/stores
> needed for something like PBO.
>
> Though, admittedly, I personally feel this is a design flaw in the
> RISC-V ISA, rather than any particular merit to FDPIC.
>

I strongly agree with you that this is a big FLAW of RISC-V; maybe not major
but big, nonetheless.
> ...

Ivan Godard

unread,

Oct 10, 2021, 1:56:25 PM10/10/21

to

How do you bound the access to only the argument (var. result) regions?

How do you pass an object by reference? (e.g. an out-of-regs shared buffer)

John Dallman

unread,

Oct 10, 2021, 2:33:59 PM10/10/21

to

In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> 32-bits, could comfortably fit one program
> Maybe several of them, if they are "fairly modest"
> 48 or 64 bits, could likely fit thousands of program instances.

> This is excluding programs which reserve absurdly large address
> ranges "because they can", but I consider this to be an aberration
> rather than something which should be encouraged.

Some programs do need to be able to handle sets of data, in RAM, that are
too large to fit into 32-bit addressing. There aren't huge numbers of
them, but they definitely exist.

> Though, looking at it, for an ISA design like RISC-V, something
> like FDPIC would make more sense than PBO for the main reason that
> the ISA lacks any real "sane" way to encode Load/Store
> displacements larger than 4kB.

That is a definite limitation. Having large offsets be a little slower is
perfectly acceptable, but requiring compilers to set up pointers to, for
example, part-way up a large array is clumsy.

John

Thomas Koenig

unread,

Oct 10, 2021, 2:39:45 PM10/10/21

to

John Dallman <j...@cix.co.uk> schrieb:

> In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
>> 32-bits, could comfortably fit one program
>> Maybe several of them, if they are "fairly modest"
>> 48 or 64 bits, could likely fit thousands of program instances.
>
>> This is excluding programs which reserve absurdly large address
>> ranges "because they can", but I consider this to be an aberration
>> rather than something which should be encouraged.
>
> Some programs do need to be able to handle sets of data, in RAM, that are
> too large to fit into 32-bit addressing. There aren't huge numbers of
> them, but they definitely exist.

Compuational fluid dynamics.

With today's problems, with 32-bit addressing (well, usually less
than that) you can just go home.

MitchAlsup

unread,

Oct 10, 2021, 2:47:19 PM10/10/21

to

I am expecting ADA to be reasonable (i.e., ADA), here. During argument evaluation
ADA will not create a dope vector that exceeds the actual passed data area (checked
on the way out). During use of argument the acceptor will not violate (checked on the
way in) the dope vector constraints.

>
> How do you pass an object by reference? (e.g. an out-of-regs shared buffer)
<

Argument passed as in normal ABI (other than the caller knowing the call is to an
accept), the caller evaluates arguments just like any other call. The process of
rendezvous passes a Root pointer and a pointer to the <now stored> caller register
file in places (control registers) only an acceptor is taught where to look (by the
compiler). I may make access to these control registers raise exceptions if not
in an Accept (it is not hard to tell). (I probably should........)

MitchAlsup

unread,

Oct 10, 2021, 2:54:51 PM10/10/21

to

On Sunday, October 10, 2021 at 1:33:59 PM UTC-5, John Dallman wrote:
> In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
> > 32-bits, could comfortably fit one program
> > Maybe several of them, if they are "fairly modest"
> > 48 or 64 bits, could likely fit thousands of program instances.
> > This is excluding programs which reserve absurdly large address
> > ranges "because they can", but I consider this to be an aberration
> > rather than something which should be encouraged.
<
> Some programs do need to be able to handle sets of data, in RAM, that are
> too large to fit into 32-bit addressing. There aren't huge numbers of
> them, but they definitely exist.
<

In a 64-bit virtual address space, the user should be able to create holes
in the virtual address space as large as 2^62 (I might be argued down to
2^60). Certainly no OS needs to grow bigger than 2^53......

<
> > Though, looking at it, for an ISA design like RISC-V, something
> > like FDPIC would make more sense than PBO for the main reason that
> > the ISA lacks any real "sane" way to encode Load/Store
> > displacements larger than 4kB.
> That is a definite limitation. Having large offsets be a little slower is
> perfectly acceptable, but requiring compilers to set up pointers to, for
> example, part-way up a large array is clumsy.
<

Which is why My 66000 provides 16-bit, 32-bit and 64-bit offsets in LD
and ST instructions. Higher end machines will suffer no penalty in using
large offsets, lowest end machines will still take FEWER cycles than if
they had to manufacture constants (LDHI or Constant Pool). This saves
size (code footprint is smaller, data footprint is smaller, fetch I$ access
count is smaller, and there is no data cache pollution of that which is a
code problem.)
<
No sane instruction set should cause the compiler to manufacture
(or fetch) compile time constants at run time.
>
> John

MitchAlsup

unread,

Oct 10, 2021, 2:58:18 PM10/10/21

to

A decade ago (can it be that long) I was doing 1D CFD on intake and
exhaust systems for automotive engines. These applications would
sometimes run for several days (24/7) on my 6 core system, but
never used "that much memory". The operating mesh had hundreds of
thousands of boundary constraints (mostly pipe walls).
<
Perhaps 2D CFD and 3D CFD exponentiate my experience.......

John Levine

unread,

Oct 10, 2021, 3:00:37 PM10/10/21

to

According to Ivan Godard <iv...@millcomputing.com>:

>> Where is the functional distinction?
>> What can one do that the other can't?
>
>A single address space permits VMAX (typically 2^64) of virtual total,
>system wide, for all processes. A multiple address space permits TMAX
>(typically 2^16 or so)*VMAX total of virtual, system wide.
>
>That's the historical reason that mapping is used: so a 16-bit system
>could have more that 16 bits worth of combined system and appp code
>running. 64-bit spaces are not (yet) constraining enough to need
>multiple spaces..

That's one reason, the other flips it around, so the application address space could be
bigger than physical memory, and you fake it with paging. You probably remember static
shared libraries, which loaded at the same place in every process, something that needs
an address space per process to work.

It is my impression that once you have enough hardware to handle paging in one application
address space, multiple address spaces aren't much harder.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

David W Schroth

unread,

Oct 10, 2021, 3:08:54 PM10/10/21

to

On Sun, 10 Oct 2021 01:18:07 -0700, Ivan Godard
<iv...@millcomputing.com> wrote:

>On 10/10/2021 12:41 AM, Thomas Koenig wrote:
>> Ivan Godard <iv...@millcomputing.com> schrieb:
>>
>>>> IIRC, Mill was using a capability architecture or something?...
>>>
>>> I wish! But no, it's grant based, because fat pointers are not viable in
>>> the market IMO.
>>
>> Would the 128-bit pointers of System i qualify as fat pointers under
>> your definition?
>>
>
>Yes, but they are in a language system that is not trying to run legacy
>decks that assumes 64-bit (or 32-bit) pointers. And they have IBM
>marketing :-)
>
>I don't know what System i does with a large C program fresh off gitHub.
>Do you? The Unisys solution was to sandbox the whole thing, with no
>internal protection at all.

A nitpick - the Burroughs side of Unisys sandboxs the entire thing,
the Univac side doesn't. The Univac side also splits protection from
addressing, and uses an underlying single virtual adress space
(providing an existing counter-example to those who claim such a thing
is not possible). Based upon the number of reported cases of someone
getting around the protection on OS2200, it seems to work.

>
>CHERI is an interesting alternative to fat pointers. In the early days
>of what became Mill we came up with a similar approach to theirs. The
>(perhaps overhasty) conclusion was that we could get within a factor of
>two of a legacy system performance, but not much better. The security
>advantages would be a win; the reprogramming of data structures to fit
>the changed sizes would be a lose.
>
>The realization that protection could be divorced from mapping and hence
>a grant system with skinny pointers could have legacy-like performance
>or better but fat-pointer security made up our minds.

This strikes me as a sensible decision, I just hope to see
implementations.

MitchAlsup

unread,

Oct 10, 2021, 3:28:12 PM10/10/21

to

On Sunday, October 10, 2021 at 2:00:37 PM UTC-5, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
> >> Where is the functional distinction?
> >> What can one do that the other can't?
> >
> >A single address space permits VMAX (typically 2^64) of virtual total,
> >system wide, for all processes. A multiple address space permits TMAX
> >(typically 2^16 or so)*VMAX total of virtual, system wide.
> >
> >That's the historical reason that mapping is used: so a 16-bit system
> >could have more that 16 bits worth of combined system and appp code
> >running. 64-bit spaces are not (yet) constraining enough to need
> >multiple spaces..
> That's one reason, the other flips it around, so the application address space could be
> bigger than physical memory, and you fake it with paging. You probably remember static
> shared libraries, which loaded at the same place in every process, something that needs
> an address space per process to work.
<

In a 64-bit VAS the same trick works--all libraries get mapped to different "places",
and everyone can share equally, you just have to get those "places" defined well.

>
> It is my impression that once you have enough hardware to handle paging in one application
> address space, multiple address spaces aren't much harder.
<

I can attest to that.

BGB

unread,

Oct 10, 2021, 3:28:13 PM10/10/21

to

Yeah, apart from small microcontrollers, a 16-bit address space is
insufficient...

One could "maybe" do a 16-bit machine with register-pairing for 32-bit
addresses and data, but at this point one may as well just design it as
a 32-bit ISA and then using like it were a 16-bit ISA (this practice
seems to be semi-common in Cortex-M land).

>>
>> This is excluding programs which reserve absurdly large address ranges
>> "because they can", but I consider this to be an aberration rather than
>> something which should be encouraged.
>>
> One of the reasons My 66000 provides a complete 64-bit VAS is to allow
> for what used to be absurdly large address ranges--for whatever SW wanted
> to do with those.

It is possible, but programs doing this sort of thing effectively
precludes sticking multiple programs into a single address space.

Well, unless one has a 96 bit space and then allows each program to
address up to 256TB or so.

>>
>> One does need to design the C ABI to be aware of the possibility of
>> multiple program instances in a single address space. For my ISA, I
>> ended up with a design I call PBO, which effectively accesses globals
>> via GBR (Global Base Register), and has a scheme by which one can get
>> from one GBR within a given process to any other GBR within the process
>> with a series of two memory loads.
>>
> My 66000 is PIC and PID without using a register to keep track of "stuff".
> {When R0 is used in the Base-register position of a memory reference IP is
> substituted, and the rest of the tool chain understands the required mechanics}.

BJX2 supports PC-relative addressing as well.

But, GBR is used for the PBO ABI because this allows effectively
decoupling the ".text" section (and related read-only sections), from
the ".data" and ".bss" sections (which are unique to each program instance).

Arguably, with an MMU, one multi-map the same PIC binary to multiple
places in the address space, but then one is wasting address space
multi-mapping the ".text" sections.

>>
>> Another scheme was ELF FDPIC, which can be used in a similar way. It
>> works by using multiple pointers for functions in the GOT, with the
>> target GOT being loaded from the current GOT on every function call.
>>
> My 66000 supports multiple GOTs also without consuming a register.

OK.

>>
>> The big drawback of FDPIC is that it has higher overheads:
>> All function calls need several additional pointer loads, ...
>> All global loads/stores need to be done indirectly via the GOT;
>> ...
> So, (as Ivan would say) don't do that !
> <
> My 66000 has tabularized function calls that neither waste a register nor
> the delay through the LD-aligner and load a memory reference directly into
> the IP. Suitable (higher end) HW can start the instruction fetch even BEFORE
> the function call address gets to the IP !! Eliminating most of this overhead.
> <
> Global LDs and STs can use a 64-bit offset along with base (or IP) and
> indexing in a single instruction.

In BJX2, I can also do simple direct jumps.

The design of the PBO ABI puts the responsibility of the GBR reload onto
the called function, which in turn is only needed in certain cases:
DLL exports;
Function pointers which are also non-leaf and/or access global variables.

Pretty much all other interior functions assume that GBR points to the
correct data section.

So, average case overhead can be reduced to the level of being
"negligible" for most programs.

>>
>>
>> Though, looking at it, for an ISA design like RISC-V, something like
>> FDPIC would make more sense than PBO for the main reason that the ISA
>> lacks any real "sane" way to encode Load/Store displacements larger than
>> 4kB. So, in effect, the cost of accessing globals via a GOT is likely to
>> be less than the cost of composing the large-displacement load/stores
>> needed for something like PBO.
>>
>> Though, admittedly, I personally feel this is a design flaw in the
>> RISC-V ISA, rather than any particular merit to FDPIC.
>>
> I strongly agree with you that this is a big FLAW of RISC-V; maybe not major
> but big, nonetheless.

Yeah. Making ELF FDPIC able to win in a "lesser of two evils" sense is
not exactly a merit, in any case...

Well, I have a 12 bit displacement, fair enough:
LW Rd, Rm, Disp12 //all is good up to 4K

But, beyond 4K?

LUI Rt, DispHi
ADD Rt, Rm, Rt
LW Rd, Rt, DispLo

Yeah, errm, this kinda sucks...

Granted, something like:
LD Rt, Rgot, Disp12
LW Rd, Rt, 0

Also still kinda sucks (slightly smaller, would require intermediate
instructions between the loads to avoid interlock penalties though).

Earlier forms of BJX2 had:
LDIX Disp24, R0
MOV.L (Rm, R0), Rd
But, now I have:
MOV.L (Rm, Disp33s), Rd // 64-bit encoding, minimum 1-cycle

Where, Rm can also encode PC or GBR:
MOV.L (PC, Disp33s), Rd //same
MOV.L (GBR, Disp33s), Rd //same

Though, for PC-rel, BGBCC still uses the older construct, mostly because
PC-Rel-Jumbo-Disp33s would require me to go and add a new reloc type for
a situation which is relatively infrequent with the PBO ABI (usually
comes up in the context of 'const arrays' and similar).

So, say, accessing a const array:
LDIX Disp24, R0 //can be eliminated in-theory.
LEA.B (PC, R0), Ra //get address of array into 'Ra'
..
MOV.Q (Ra, Ri), Rd

Whereas, within a given basic block, each access to the array will use a
single memory load.

John Levine

unread,

Oct 10, 2021, 3:53:04 PM10/10/21

to

According to John Dallman <j...@cix.co.uk>:

>> This is excluding programs which reserve absurdly large address
>> ranges "because they can", but I consider this to be an aberration
>> rather than something which should be encouraged.
>
>Some programs do need to be able to handle sets of data, in RAM, that are
>too large to fit into 32-bit addressing. There aren't huge numbers of
>them, but they definitely exist.

Databases will buffer as much as they can in RAM. If your machine can give
your database ten gig of RAM, it can do useful things with it, assuming
you have databases that big which these days you probably do.

BGB

unread,

Oct 10, 2021, 3:59:33 PM10/10/21

to

On 10/10/2021 2:00 PM, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
>>> Where is the functional distinction?
>>> What can one do that the other can't?
>>
>> A single address space permits VMAX (typically 2^64) of virtual total,
>> system wide, for all processes. A multiple address space permits TMAX
>> (typically 2^16 or so)*VMAX total of virtual, system wide.
>>
>> That's the historical reason that mapping is used: so a 16-bit system
>> could have more that 16 bits worth of combined system and appp code
>> running. 64-bit spaces are not (yet) constraining enough to need
>> multiple spaces..
>
> That's one reason, the other flips it around, so the application address space could be
> bigger than physical memory, and you fake it with paging. You probably remember static
> shared libraries, which loaded at the same place in every process, something that needs
> an address space per process to work.
>
> It is my impression that once you have enough hardware to handle paging in one application
> address space, multiple address spaces aren't much harder.
>

Harder... No.
Slower... Yes.

As noted, in my case there is little stopping the MMU from being used
for multiple address spaces. It is a little more awkward for an OS
kernel, but doesn't really effect the HW all that much.

However, the more obvious issue is that, as soon as one switches address
spaces, none of the contents of the TLB are relevant anyone. So, every
task switch is very likely to be followed by potentially a few thousand
cycles worth of TLB reloads.

If it is a single address space, then it is likely that any shared
contents (such as OS provided libraries, ...) will still be in the TLB
following a task switch.

An ASID could help, if one assumes that some of the old contents remain
in the TLB by the time the scheduler cycles back around.

Another possible option though could be a hierarchical ASID, say:
ASID: 6b=AGID, 10b=APID
Where a TLBE:
ASID=0, Globally Valid
AGID!=0, APID=0: Valid so long as AGID matches.
AGID!=0, APID!=0: Valid if ASID's are equal.

With the ASID being in the same encoding space as VUGID.

The ASID bits are partly co-opted in the split TLBE scheme for 96-bit
addressing though, so it is likely that in 96-bit mode the MMU would not
have ASID's (well, along with some debate as to whether it would reduce
TLB associativity, or add a separate smaller TLB for matching address
bits 95:48...).

John Dallman

unread,

Oct 10, 2021, 4:19:02 PM10/10/21

to

In article <sjvesb$a36$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> The design of the PBO ABI puts the responsibility of the GBR reload
> onto the called function, which in turn is only needed in certain

> cases: ...

> Function pointers which are also non-leaf and/or access global
> variables.

How does the compiler know, when compiling a function, if it will be
called via a pointer?

John

MitchAlsup

unread,

Oct 10, 2021, 4:59:23 PM10/10/21

to

More importantly, why should it even know ?
<
double *trans_table[4] = {sin, cos, tan, atan};
...
x = trans_tasble[i](arg);
>
> John

Thomas Koenig

unread,

Oct 10, 2021, 5:27:30 PM10/10/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Sunday, October 10, 2021 at 1:39:45 PM UTC-5, Thomas Koenig wrote:
>> John Dallman <j...@cix.co.uk> schrieb:
>> > In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>> >
>> >> 32-bits, could comfortably fit one program
>> >> Maybe several of them, if they are "fairly modest"
>> >> 48 or 64 bits, could likely fit thousands of program instances.
>> >
>> >> This is excluding programs which reserve absurdly large address
>> >> ranges "because they can", but I consider this to be an aberration
>> >> rather than something which should be encouraged.
>> >
>> > Some programs do need to be able to handle sets of data, in RAM, that are
>> > too large to fit into 32-bit addressing. There aren't huge numbers of
>> > them, but they definitely exist.
>> Compuational fluid dynamics.
>>
>> With today's problems, with 32-bit addressing (well, usually less
>> than that) you can just go home.
><
> A decade ago (can it be that long) I was doing 1D CFD on intake and
> exhaust systems for automotive engines.

That's hardly CFD these days :-) 1D unsteady, I assume?

> These applications would
> sometimes run for several days (24/7) on my 6 core system, but
> never used "that much memory". The operating mesh had hundreds of
> thousands of boundary constraints (mostly pipe walls).

Most of the CPU time was spent doing some wildly non-linear
things, I assume?

><
> Perhaps 2D CFD and 3D CFD exponentiate my experience.......

Most serious CFD is 3D. A good rule of thumb for finite volume
is that you need 4 to 8 GB per core these days. A 32-core system
will then need between 64 and 128 GB.

The workstation below my desk at work has 512 GB main memory, but
I rarely use that. This is for a finite element code which has
has an option for direct solvers. They are far more stable than
the iterative ones, but _very_ memory hungry if the problem
is big. Still, 256GB would in all probability have been enough.

BGB

unread,

Oct 10, 2021, 5:31:53 PM10/10/21

to

In this case, it is because the code-generator does a graph-walk and
marks the function as having had its address taken before this point.

This in turn works because my compiler does all of the code generation
for a program at the same time in the back-end, rather than using object
files and a linker.

This has the advantage that, at code-generation time, it is possible to
check whether or not something is reachable via searching a global graph
structure representing the program (and whether or not it can be
potentially accessible from outside the current binary).

For separate compilation and static libraries, it uses a format I call
RIL3, which is basically a stack-based bytecode IR (along vaguely
similar lines to .NET bytecode).

The RIL3 format is essentially a single linear stack machine, which
itself builds all of the metadata via stack operations. By high-level
analogy, it is sort of like if PostScript was used to drive an OpenGL
style API which in turn builds a bunch of compiler metadata, ...

In effect, the RIL3 loader is a fairly naive bytecode interpreter
(albeit, at this stage, the bytecode format is not Turing complete).

I had at one point considered a fancier RIL variant that would have
added PostScript style block-structuring and a Turing-complete execution
model, but didn't go that direction. In the years since then, I have
come to the opinion that having a Turing complete interpreter as a
compiler IR loader would probably not be a particularly great idea (my
younger self was more into this sorta stuff though).

So, Front End:
Preprocess;
Parse;
Translate AST -> RIL3;
Optionally dump RIL3 to a file.

Back End:
Load RIL3 modules, translating function bytecode into a 3AC format;
Determine what is reachable, ...
Pass to determine memory model (estimates section sizes, ...)
Several mock-up code-generation passes;
Needed to sort out things like branch displacements and reg alloc;
Final code-generation pass (generates actual binary code);
Layout sections within binary image;
Apply symbolic relocations, generate base reloc table;
Generate a PE/COFF (PEL) image (albeit lacking the MZ stub);
LZ compress PE image;
Store output binary.

Some things could be better, some amount is from historical vestigial
structures.

Note that some things, such as ASM files and inline ASM, will typically
be passed through the RIL3 stage essentially as big ASCII text blobs (or
string literals), with a bytecode op that is like "Yeah, that big blob
of text on the top of the stack here, that is a blob of ASM code..." (at
the top-level, the ASM blob is popped off the stack and added as its own
"global object", or inside a function, emits a 3AC IR op that invokes
the ASM parser on the given string literal).

...

BGB

unread,

Oct 10, 2021, 5:37:55 PM10/10/21

to

I guess this points out a potential drawback of the design:
Taking the address of a function in one place may cause it to get slower
everywhere else...

>>
>> John

MitchAlsup

unread,

Oct 10, 2021, 5:39:19 PM10/10/21

to

On Sunday, October 10, 2021 at 4:27:30 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Sunday, October 10, 2021 at 1:39:45 PM UTC-5, Thomas Koenig wrote:
> >> John Dallman <j...@cix.co.uk> schrieb:
> >> > In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
> >> >
> >> >> 32-bits, could comfortably fit one program
> >> >> Maybe several of them, if they are "fairly modest"
> >> >> 48 or 64 bits, could likely fit thousands of program instances.
> >> >
> >> >> This is excluding programs which reserve absurdly large address
> >> >> ranges "because they can", but I consider this to be an aberration
> >> >> rather than something which should be encouraged.
> >> >
> >> > Some programs do need to be able to handle sets of data, in RAM, that are
> >> > too large to fit into 32-bit addressing. There aren't huge numbers of
> >> > them, but they definitely exist.
> >> Compuational fluid dynamics.
> >>
> >> With today's problems, with 32-bit addressing (well, usually less
> >> than that) you can just go home.
> ><
> > A decade ago (can it be that long) I was doing 1D CFD on intake and
> > exhaust systems for automotive engines.
> That's hardly CFD these days :-) 1D unsteady, I assume?
<

I certainly had to use many-many surfaces to adequately model velocity
stacks--so something 9" long had 45 individual set of diameters.

<
> > These applications would
> > sometimes run for several days (24/7) on my 6 core system, but
> > never used "that much memory". The operating mesh had hundreds of
> > thousands of boundary constraints (mostly pipe walls).
> Most of the CPU time was spent doing some wildly non-linear
> things, I assume?
> ><
> > Perhaps 2D CFD and 3D CFD exponentiate my experience.......
<
> Most serious CFD is 3D.
<

I can certainly agree with this notion, however, I had Free ($0.00) access
to a 1D modeler that was already setup to do "about" what I wanted to do
(including the combustion physics); so I used what was easily available.

Stefan Monnier

unread,

Oct 10, 2021, 6:07:25 PM10/10/21

to

>> >> > Some programs do need to be able to handle sets of data, in RAM, that are
>> >> > too large to fit into 32-bit addressing. There aren't huge numbers of
>> >> > them, but they definitely exist.

I find this discussion quite surprising. On the one hand, I'm still
using the i386 port of Debian on most/all of my machines, so clearly all
my applications fit within that 32bit address space, but I'd have expected it's
quite common for numerical simulations to need a fair bit more.

After all, my Firefox process often gets in the GB range, my Emacs needs
at least as much RAM as the size of the files it edits (and log files >4GB
aren't that rare), and even smartphones come with 8GB these days.

So is it really still rare(ish) to pass the 4GB limit?

Stefan

Chris M. Thomasson

unread,

Oct 10, 2021, 6:26:15 PM10/10/21

to

Volumetric renders on very large stacks can be very memory intensive.
Say a 2048^3 volume.

MitchAlsup

unread,

Oct 10, 2021, 6:44:06 PM10/10/21

to

When virtual memory is working well, 4GB can hold a lot of processes.
<
We used to run cycle-accurate CPU simulations giving the CPU 4GB
of physical memory on machines with ¼GB of actual memory running
a full UNIX workload on top of the simulations we ran.
<
On the other hand, many workloads have "workarounds" that are perfectly
adequate for the application at hand. CRAY used to get CPU/machine
orders for systems where the numeric performance was behind NEC
supercomputers, but where the CRAY I/O system could sustain more
I/O (double buffered data stream I/O) and thus attack larger problems
than the NEC machines.
>
>
> Stefan

Ivan Godard

unread,

Oct 10, 2021, 6:46:55 PM10/10/21

to

That's what- 40 bits? So a 64-bit shared address space can accommodate
2^24 of such apps, concurrently?

Clearly inadequate.

Ivan Godard

unread,

Oct 10, 2021, 6:49:59 PM10/10/21

to

On 10/10/2021 11:54 AM, MitchAlsup wrote:
> On Sunday, October 10, 2021 at 1:33:59 PM UTC-5, John Dallman wrote:
>> In article <sjv4u6$37u$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>
>>> 32-bits, could comfortably fit one program
>>> Maybe several of them, if they are "fairly modest"
>>> 48 or 64 bits, could likely fit thousands of program instances.
>>> This is excluding programs which reserve absurdly large address
>>> ranges "because they can", but I consider this to be an aberration
>>> rather than something which should be encouraged.
> <
>> Some programs do need to be able to handle sets of data, in RAM, that are
>> too large to fit into 32-bit addressing. There aren't huge numbers of
>> them, but they definitely exist.
> <
> In a 64-bit virtual address space, the user should be able to create holes
> in the virtual address space as large as 2^62 (I might be argued down to
> 2^60). Certainly no OS needs to grow bigger than 2^53......

Why?

Ivan Godard

unread,

Oct 10, 2021, 7:00:40 PM10/10/21

to

Not asking what *will* get checked; asking what *can avoid* being
checked, inadvertently or maliciously. What prevents accessing (in or
out) out side what is described by the dope vector? Or are you assuming
that nothing but correctly compiled code can run, a la Unisys?

>>
>> How do you pass an object by reference? (e.g. an out-of-regs shared buffer)
> <
> Argument passed as in normal ABI (other than the caller knowing the call is to an
> accept), the caller evaluates arguments just like any other call. The process of
> rendezvous passes a Root pointer and a pointer to the <now stored> caller register
> file in places (control registers) only an acceptor is taught where to look (by the
> compiler). I may make access to these control registers raise exceptions if not
> in an Accept (it is not hard to tell). (I probably should........)
>

Is there both root and upper bound? How does the extent get into the TLB
for a bounds check? What is the granularity? How do I pass the buffer
pointed to by an argument? How do I pass all 10k nodes of a tree rooted
in a pointer passed as an argument?

Detailed description please.

BGB

unread,

Oct 10, 2021, 7:29:54 PM10/10/21

to

Yeah, even the current 48-bit addressing scheme in BJX2 would be able to
accommodate 100 program instances this size.

For "typical sized" programs that would fit within a 32-bit machine, one
could reasonably fit upward of 120,000 of them in a single address
space, or roughly about 8 million Doom instances...

Stephen Fuld

unread,

Oct 10, 2021, 7:31:22 PM10/10/21

to

I don't think the question of whether 64 bits is adequate (it is), but
for how long into the future will it be adequate, and secondarily, when
it does become inadequate, how big a change, hardware and software, will
be required to deal with it?

Remember, not allowing a large enough address space is considered by
many to be the biggest mistake in computer architecture. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stefan Monnier

unread,

Oct 10, 2021, 7:51:09 PM10/10/21

to

> I don't think the question of whether 64 bits is adequate (it is), but for
> how long into the future will it be adequate, and secondarily, when it does
> become inadequate, how big a change, hardware and software, will be required
> to deal with it?
> Remember, not allowing a large enough address space is considered by many to
> be the biggest mistake in computer architecture. :-(

I wonder, indeed. It seems that the amount of RAM per CPU has been
growing fairly slowly of late. Even for those applications which
require >4GB of memory, many of them could still fairly easily be
reorganized into separate processes that each fit within 4GB (tho the
effort would make no sense currently).

I suspect that we should be able to live within 64bit per-process for
a very long time to come. The question being mostly whether the
convenience of a single address space shared between all those
processes/threads will be enough to justify moving to longer pointers,
or if instead compilers will at that time be able to hide easily enough
those details and offer the illusion of a single shared memory space
while dividing the application into a set of separate processes running
in separate address spaces.

I sometimes have dreams of machines that evolved back into systems with
far pointers (128bit, or even just URLs) and near pointers (e.g 32bit,
just to refer to the last-level cache, everything above that is accessed
as "disk blocks").

Stefan

MitchAlsup

unread,

Oct 10, 2021, 7:51:38 PM10/10/21

to

On Sunday, October 10, 2021 at 6:00:40 PM UTC-5, Ivan Godard wrote:
> On 10/10/2021 11:47 AM, MitchAlsup wrote:
> > On Sunday, October 10, 2021 at 12:56:25 PM UTC-5, Ivan Godard wrote:
> >> On 10/10/2021 9:57 AM, MitchAlsup wrote:

> >>> I am coming around to this means of accessing another VAS; indeed, this seems
> >>> to be the cleanest way to implement ADA rendezvous where the caller and the
> >>> acceptor can be in different address spaces, running at different priorities, on
> >>> different cores, possibly different machines (i.e., over a network).
> >>> <
> >>> Acceptor has access to the callers VAS+arguments and can write acceptor
> >>> results back to callers VAS+results.
> >>> <
> >>> About the only thing this does not provide is access across different virtual
> >>> machines--at least directly.
> >>>
> >> How do you bound the access to only the argument (var. result) regions?
> > <
> > I am expecting ADA to be reasonable (i.e., ADA), here. During argument evaluation
> > ADA will not create a dope vector that exceeds the actual passed data area (checked
> > on the way out). During use of argument the acceptor will not violate (checked on the
> > way in) the dope vector constraints.
<
> Not asking what *will* get checked; asking what *can avoid* being
> checked, inadvertently or maliciously. What prevents accessing (in or
> out) out side what is described by the dope vector? Or are you assuming
> that nothing but correctly compiled code can run, a la Unisys?
<

When an argument to a subroutine arrives in a register, there is a way to move
that data from where it is stored in memory to a register in the accept call-point.
<
When data is not scalar, ADA builds a dope vector describing the access
characteristics of the non-scalar data on the way out.
<
When data is not scalar ADA uses the dope vector properly, and much of the
time, the way good ADA is written, the checks are not necessary, just like if
there was no call-accept sitting between the semantics of the ASCII text
and the compiled code.

<
> >>
> >> How do you pass an object by reference? (e.g. an out-of-regs shared buffer)
> > <
> > Argument passed as in normal ABI (other than the caller knowing the call is to an
> > accept), the caller evaluates arguments just like any other call. The process of
> > rendezvous passes a Root pointer and a pointer to the <now stored> caller register
> > file in places (control registers) only an acceptor is taught where to look (by the
> > compiler). I may make access to these control registers raise exceptions if not
> > in an Accept (it is not hard to tell). (I probably should........)
> >
> Is there both root and upper bound?
<

There is a dope vector into the foreign address space, and there is a root for the
translation tables in the foreign address space. The dope vector contains a base
pointer and other information used to determine the extent of the non-scalar data.

<
> How does the extent get into the TLB
> for a bounds check?
<

The TLB merely performs the translation and gives the acceptor the same access
rights as the caller. The Dope vector and ADA generated instructions perform the
bounds checking. My 66000 has an efficient bounds check in the CMP instruction.
<
< What is the granularity?
Byte.

> How do I pass the buffer
> pointed to by an argument?
<

In a dope vector which contains a pointer (base) and a bounds.

<
< How do I pass all 10k nodes of a tree rooted
> in a pointer passed as an argument?
<

Pass a pointer to a type of tree-node.
>
> Detailed description please.

MitchAlsup

unread,

Oct 10, 2021, 8:00:36 PM10/10/21

to

First notice that the hard limit only applies to single address space architectures.
<
Secondarily notice: I am only worried about "the rest of my lifetime" time line.

<
> and secondarily, when
> it does become inadequate, how big a change, hardware and software, will
> be required to deal with it?
<

I think the transition from 64-bits to whatever will be greater than the
the transition from 32-bits to 64-bits. The next power of 2 is way bigger
than necessary 2^128 (10^38.5) while there are only 10^49 atoms on the
planet.
<
Non-powers of 2 are problematic; but I don't think I have to worry about this.
<
Main memory growth seems to have slowed.........a bit......

>
> Remember, not allowing a large enough address space is considered by
> many to be the biggest mistake in computer architecture. :-(
<

I remember, and I knew the person who first stated that.

MitchAlsup

unread,

Oct 10, 2021, 8:03:52 PM10/10/21

to

On Sunday, October 10, 2021 at 6:51:09 PM UTC-5, Stefan Monnier wrote:
> > I don't think the question of whether 64 bits is adequate (it is), but for
> > how long into the future will it be adequate, and secondarily, when it does
> > become inadequate, how big a change, hardware and software, will be required
> > to deal with it?
> > Remember, not allowing a large enough address space is considered by many to
> > be the biggest mistake in computer architecture. :-(
<
> I wonder, indeed. It seems that the amount of RAM per CPU has been
> growing fairly slowly of late. Even for those applications which
> require >4GB of memory, many of them could still fairly easily be
> reorganized into separate processes that each fit within 4GB (tho the
> effort would make no sense currently).
>
> I suspect that we should be able to live within 64bit per-process for
> a very long time to come. The question being mostly whether the
> convenience of a single address space shared between all those
> processes/threads will be enough to justify moving to longer pointers,
<

Note:: something like 98% of all applications only need 32-bits as of today.
<
Also note:: both of the big architectures are not SaS.

<
> or if instead compilers will at that time be able to hide easily enough
> those details and offer the illusion of a single shared memory space
> while dividing the application into a set of separate processes running
> in separate address spaces.
<

DataBase experience says no.

>
> I sometimes have dreams of machines that evolved back into systems with
> far pointers (128bit, or even just URLs) and near pointers (e.g 32bit,
> just to refer to the last-level cache, everything above that is accessed
> as "disk blocks").
<

URLs are the "solves everything" except performance hammer.
>
>
> Stefan

EricP

unread,

Oct 10, 2021, 8:57:49 PM10/10/21

to

The ASID doesn't have to be very large because they can be assigned
on a per-core basis, as opposed to globally unique ASIDs,
and you can recycle the numbers for each core.

Say it has 4 kB pages and full 64 bit address so the TLB
has a fully assoc table with 52 bit virtual addresses.

A 4 bit ASID tags the TLB VA entries for the most recent 16 address
spaces mapped on that core. When the OS recycles an ASID it purges
that cores' TLB entries of just that ASID number before reusing.

When a process is mapped in a core, its page table base address is set
in the MMU base register, and the assigned ASID is set in a ID register.
Each time a PTE is loaded, if it is a per-process PTE then the current
ID register is copied to the new TLB entry ASID field.

OS needs a little code to track which ASID's for each process
were assigned on each core but nothing complicated.

John Levine

unread,

Oct 10, 2021, 10:24:56 PM10/10/21

to

According to BGB <cr8...@gmail.com>:

>Harder... No.
>Slower... Yes.
>
>As noted, in my case there is little stopping the MMU from being used
>for multiple address spaces. It is a little more awkward for an OS
>kernel, but doesn't really effect the HW all that much.
>
>However, the more obvious issue is that, as soon as one switches address
>spaces, none of the contents of the TLB are relevant anyone. So, every
>task switch is very likely to be followed by potentially a few thousand

>cycles worth of TLB reloads. ...

There are approaches to this issue going back at least to the 1980s.
They're variations on putting a some sort of process or segment ID
in the TLB so for the TLB it's one big address space, even though
for programs they are separate.

The IBM ROMP and I think the early POWER chips divided the 32 bit
address into a 4 bit segment number and 28 bit offset, with the segment
number being in index into a much larger set of segments, 12 or 15 bits.
On a process switch it'd change the 16 segment map entries but it
didn't need to flush the TLB like thing its reverse map had.

Terje Mathisen

unread,

Oct 11, 2021, 1:41:12 AM10/11/21

to

John Levine wrote:
> According to John Dallman <j...@cix.co.uk>:
>>> This is excluding programs which reserve absurdly large address
>>> ranges "because they can", but I consider this to be an aberration
>>> rather than something which should be encouraged.
>>
>> Some programs do need to be able to handle sets of data, in RAM, that are
>> too large to fit into 32-bit addressing. There aren't huge numbers of
>> them, but they definitely exist.
>
> Databases will buffer as much as they can in RAM. If your machine can give
> your database ten gig of RAM, it can do useful things with it, assuming
> you have databases that big which these days you probably do.

10GB is a small DB these days!

More interestingly a 10GB RAM buffer often means that you can take a 100
GB regular DB and convert it to compressed column store layout which
will then fit everything in memory.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Oct 11, 2021, 2:03:06 AM10/11/21

to

MitchAlsup <Mitch...@aol.com> schrieb:
> On Sunday, October 10, 2021 at 4:27:30 PM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:

>> > A decade ago (can it be that long) I was doing 1D CFD on intake and
>> > exhaust systems for automotive engines.
>> That's hardly CFD these days :-) 1D unsteady, I assume?
><
> I certainly had to use many-many surfaces to adequately model velocity
> stacks--so something 9" long had 45 individual set of diameters.
><
>> > These applications would
>> > sometimes run for several days (24/7) on my 6 core system, but
>> > never used "that much memory". The operating mesh had hundreds of
>> > thousands of boundary constraints (mostly pipe walls).
>> Most of the CPU time was spent doing some wildly non-linear
>> things, I assume?
>> ><
>> > Perhaps 2D CFD and 3D CFD exponentiate my experience.......
><
>> Most serious CFD is 3D.
><
> I can certainly agree with this notion, however, I had Free ($0.00) access
> to a 1D modeler that was already setup to do "about" what I wanted to do
> (including the combustion physics); so I used what was easily available.

Ah, that is the "wildly nonlinear" part I was looking for.
The Arrhenius equation for the rate of chemical reactions
has two nasty properties for numerics:

k = A * exp(-Ea/(R*T))

is

a) exponential, so you need to evaluate an expensive function
(a nod to your implementation there - are you closer to having
it implemented by a chip designer?)

b) highly nonlinear, so you need a large number of iterations,
especially if you have many of them coupled.

so it is possible to burn almost arbitrary amounts of CPU time
on such a problem.

Such a problem is usually CPU bound, not memory bound like "normal"
CFD. For a 1D problem, I would normally not classify it as CFD;
for me, that starts at 2D, but that is a definition by me
and my colleagues. You would also use different programs for this.

Interestingly enough, even a program like Ansys Fluent does not
really solve 2D CFD problems. Instead, it turns a 2D problem into
a 3D problem on a slab with a thickness of 1m (by default).

Thomas Koenig

unread,

Oct 11, 2021, 2:19:55 AM10/11/21

to

Stefan Monnier <mon...@iro.umontreal.ca> schrieb:

Most processes do not pass 4GB, but there are important ones that do.

If you're on a x86, you also have fewer registers in 32-bit mode,
unless you use the x32 ABI (is that still alive?)

SPARC compilers will give you 32 bit code by default to save
on code size. If you want to run a cryptographic hash, then
of course 64 bit is better :-)

Thomas Koenig

unread,

Oct 11, 2021, 2:22:11 AM10/11/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> I think the transition from 64-bits to whatever will be greater than the
> the transition from 32-bits to 64-bits. The next power of 2 is way bigger
> than necessary 2^128 (10^38.5) while there are only 10^49 atoms on the
> planet.

My personal favorite would be 72 bits (bring back the 36-bit REAL of the
IBM 704 and the PDP-10!), but of course

><
> Non-powers of 2 are problematic; but I don't think I have to worry about this.

this is true.

Thomas Koenig

unread,

Oct 11, 2021, 2:26:27 AM10/11/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On the other hand, many workloads have "workarounds" that are perfectly
> adequate for the application at hand. CRAY used to get CPU/machine
> orders for systems where the numeric performance was behind NEC
> supercomputers, but where the CRAY I/O system could sustain more
> I/O (double buffered data stream I/O) and thus attack larger problems
> than the NEC machines.

Cray I/O was quite good, I once heard (anecdotally) that the
computing center at Stuttgart got a Cray for database applications
because it had a better price than the equivalent IBM equipment.
This was in the 1990s.

However, regarding CFD, this is a thing of the past. If you get
into paging (or swapping) with a modern CFD, you can usually forget
about getting a result in a reasonable time.

George Neuner

unread,

Oct 11, 2021, 9:15:00 AM10/11/21

to

On Sat, 9 Oct 2021 14:28:22 -0700 (PDT), MitchAlsup
<Mitch...@aol.com> wrote:

>Question: in a 64-bit virtual address space, is there ever a reason that
>the "heap" cannot live in a contiguous subsection of that space so
>base and bounds could tell if the pointer should be considered for
>GC (or not)?

I can't think of one offhand. 64 bits gives plenty of space for lots
of programs to have huge contiguous heaps.

GCs for uncooperative languages (like C) do check potential pointers -
i.e. properly aligned values - against the heap limits as a first
step. But what the GC really needs to know is if a pointer refers to
an actual allocated block. Determining /that/ is a lot harder because
the value must be range checked against every allocation, not just
against the heap limits.

There are various data structures that facilitate interval / range
searching, but maintaining them is extra overhead in the allocator and
collector, and potentially is heap space unavailable to the program.
It's a lot more effective to require that the compiler retain original
pointers and always work on copies of them.

George