Introducing ForwardCom: An open ISA with variable-length vector registers

Agner Fog

unread,

Jan 30, 2023, 6:03:36 AM1/30/23

to

I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.

ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:

A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.

A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.

This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.

There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.

It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.

ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.

The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.

All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.

The flexible instruction format allows the CPU to do more work per instruction. Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit. Multiplication, division, and floating point arithmetic have longer latencies.

The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
for (int r0 = 0; r0 < 100; r0++) { }

There are many other features, including strong security, efficient memory management, an efficient library system, efficient global register allocation, and more.

See www.forwardcom.info for all the details.

The current status of the ForwardCom project is:

* Instruction set and ABI standards are defined in detail.
* The following binary tools have been developed: high-level assembler, disassembler, linker, library manager, emulator, and debugger. A compiler is not available yet.
* A hardware implementation in an FPGA soft core is available with full documentation. The current version supports integer instructions only.
* Everything is free and open source

MitchAlsup

unread,

Jan 30, 2023, 2:07:31 PM1/30/23

to

On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
> I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.
>
> ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:
>
> A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.
>
> A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.
>
> This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.
>
> There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.
>
> It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.
>
> ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.
<

My 66000 is mostly RISC with a few of the better parts of CISC throw in. I also
think that something between RISC and CISC has potential for being better than
either. And in particular, Brian's LLVM compiler produces My 66000 ASM for
benchmarks (with available source code) that average 75%-79% the instruction
count of RISC-V LLVM compiler (with same optimization settings).

>
> The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.
<

When I designed My 66000 ISA I developed the instruction set so that every
instruction could have 1 immediate, or 1 displacement, or both; and that
the immediate could be in either operand position. This enables my ISA to
<
SLL R7,#1,R9
STD #3.1415926535897932,[Rbase+Rindex<<scale+Displacement]
FMAC R9,R7,#3.2768D5,R9
FMAC R10,R6,R5,#1.7D0
<
as single instructions. There is no way to deliver an operand into calculation
that is of lower power than a constant.

>
> All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.
<

Same.

>
> The flexible instruction format allows the CPU to do more work per instruction.
<

Agreed.

<
> Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit.
<

I worked in sign control, too; so::
<
ADD R7,R9,-R14 // is the subtract version
ADD R7,-R9,R14 // is the reverse subtract version

<
> Multiplication, division, and floating point arithmetic have longer latencies.
<

Of course they do.

MitchAlsup

unread,

Jan 30, 2023, 6:06:57 PM1/30/23

to

On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
>

> The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
> for (int r0 = 0; r0 < 100; r0++) { }
>

This reminds me of CDC 6600 assembly where essentially all calculations are
set x7 = x3 + x2
<
How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
And what code would be produced ??
<

Agner Fog

unread,

Jan 31, 2023, 2:00:46 AM1/31/23

to

31. januar 2023 kl. 00.06.57 UTC+1 MitchAlsup wrote:
> How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
> And what code would be produced ??

The ForwardCom instruction format has one field specifying operand type and precision, so you cannot mix operand sizes in a single instruction. Unused parts of a register are always set to zero, so a 32-bit operation will zero-extend the result to 64 bits in a 64-bit register.
int32 r7 = r6 + r19
will do a 32-bit addition and zero-extend the result to 64 bits in r7. This will work if r19 is known to contain an 8-bit unsigned integer. If the bits 8-31 of R19 are not known to be zero then you have to cut them off first:
int8 r19 = r19.
There is no implicit sign extension. If you want to sign-extend an integer to a larger size you must use an extra instruction for doing so.

Thomas Koenig

unread,

Feb 1, 2023, 9:03:45 AM2/1/23

to

Agner Fog <ag...@dtu.dk> schrieb:

> The ForwardCom assembly language is simple and immediately
> intelligible. Adding two integers is as simple as: int r1 =
> r2 + r3. Branches and loops are written just like in C or Java,
> for example:

> for (int r0 = 0; r0 < 100; r0++) { }

Quite interesting, thanks a lot for sharing!

One question: What would be the best way to handle loop-carried
dependencies (let's say a memmove, where operands can overlap,
or C's

void add (const int *a, int *b, int *c, int n)
{

for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}

in your ISA?

Quadibloc

unread,

Feb 1, 2023, 9:48:06 AM2/1/23

to

On Monday, January 30, 2023 at 12:07:31 PM UTC-7, MitchAlsup wrote:

> My 66000 is mostly RISC with a few of the better parts of CISC

thrown
> in.

The only thing I'm not happy about in your design is the fact that
some actions, treated as single instructions by the microarchitecture
for efficiency, are specified as multiple instructions in the code.

To me, this appears to massively complicate the decoding of
instructions, making the architecture more difficult to implement.

Of course, though, it doesn't make the processor less efficient,
since instruction decoding can be done well ahead of execution,
so this need not have any impact on latencies.

John Savard

Terje Mathisen

unread,

Feb 1, 2023, 11:26:50 AM2/1/23

to

Does that make sense? Adding a 'const' to the target array?

I could see how noalias would be useful for the compiler but what is
const supposed to do here?

BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
loop, effectively something like:

while (i < n) {
i += vector_add(a,b,c,n-i);
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Feb 1, 2023, 12:19:32 PM2/1/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:
>> Agner Fog <ag...@dtu.dk> schrieb:
>>
>>> The ForwardCom assembly language is simple and immediately
>>> intelligible. Adding two integers is as simple as: int r1 =
>>> r2 + r3. Branches and loops are written just like in C or Java,
>>> for example:
>>
>>> for (int r0 = 0; r0 < 100; r0++) { }
>>
>> Quite interesting, thanks a lot for sharing!
>>
>> One question: What would be the best way to handle loop-carried
>> dependencies (let's say a memmove, where operands can overlap,
>> or C's
>>
>> void add (const int *a, int *b, int *c, int n)
>> {
>>
>> for (int i=0; i<n; i++)
>> a[i] = b[i] + c[i];
>> }
>
> Does that make sense? Adding a 'const' to the target array?

Not at all, I had not thought that through, const should be on
the other two arguments.

> I could see how noalias would be useful for the compiler but what is
> const supposed to do here?

> BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
> loop, effectively something like:
>
> while (i < n) {
> i += vector_add(a,b,c,n-i);
> }

Hm.

Let's assume C semantics without noalias or restrict, and assume that
the function (without the nonsenscial const) is called as

int *arr;
add (arr+1, arr, arr+2, n);

Then, the store to a[0] would overwrite arr[1], and in the next
iteration b[1] would be referenced, which would be arr[1], so the
store to a[0] would have to be carried over to the next iteration
of the loop. Classic aliasing (and I made the case especially
obnoxious because not even loop reversal would help).

Mitch's My66000 would detect the interdependency at runtime and
drop down to scalar mode.

For Fortran, there is a related problem. While arguments cannot
overlap, the language proscribes that the right-hand side of
an assignment is evaluated before the actual assignment.
This leads to generation of temporary arrays for code like

a(n:m) = a(n-1:m-1) + a(n+1:m+1)

or when overlap / non-overlap cannot be detected at compile time.

My question would be: Would this kind of thing be left to the
programmer (or the language definition, or the compiler) or are
there special provisions in the ISA to deal with this kind of
thing one way or another?

Mitch's VVM drops down to scalar if an interdpendency is found,
via a hardware check.

MitchAlsup

unread,

Feb 1, 2023, 2:05:09 PM2/1/23

to

On Wednesday, February 1, 2023 at 8:03:45 AM UTC-6, Thomas Koenig wrote:
> Agner Fog <ag...@dtu.dk> schrieb:
> > The ForwardCom assembly language is simple and immediately
> > intelligible. Adding two integers is as simple as: int r1 =
> > r2 + r3. Branches and loops are written just like in C or Java,
> > for example:
>
> > for (int r0 = 0; r0 < 100; r0++) { }
> Quite interesting, thanks a lot for sharing!
>
> One question: What would be the best way to handle loop-carried
> dependencies (let's say a memmove, where operands can overlap,
> or C's
<

MM Rto,Rfrom,Rcnt

>
> void add (const int *a, int *b, int *c, int n)
> {
> for (int i=0; i<n; i++)
> a[i] = b[i] + c[i];
> }
<

I fail to see a loop carried dependence. Nothing loaded or
calculated in the loop is used in the subsequent loop. Do I
have a bad interpretation of "loop carried" ?
>
> in your ISA?

Thomas Koenig

unread,

Feb 1, 2023, 2:20:41 PM2/1/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

It's not obvious, but (see my reply to Terje) the pointers can
point to overlapping parts of an array. In this case, a non-
obvious loop carried depenence can be introduced (note the
lack of restrict in the argument list), apart from the
const which is wrong.

Tim Rentsch

unread,

Feb 1, 2023, 2:39:13 PM2/1/23

to

Thomas Koenig <tko...@netcologne.de> writes:

[...]

> For Fortran, there is a related problem. While arguments cannot
> overlap, the language proscribes that the right-hand side of
> an assignment is evaluated before the actual assignment.

I expect you mean prescribes, not proscribes.

MitchAlsup

unread,

Feb 1, 2023, 3:25:25 PM2/1/23

to

I can envision an implementation where there are no loop
carried dependences.
>
But if your point was to illustrate that C has bad semantics
wrt array aliasing, I have to agree. The Fortran version has
no loop carried dependence.

Quadibloc

unread,

Feb 1, 2023, 4:19:58 PM2/1/23

to

On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:

> ForwardCom can vectorize array loops in a new way that automatically
> adjusts to the maximum vector length supported by the CPU. It works
> in the following way:

> A register used as loop counter is initialized to the array length. This
> counter is decremented by the vector register length and repeats as
> long as it is positive. The counter register also specifies the vector
> register length. The vector registers used in the loop will have the
> maximum length as long as the counter exceeds this value. There
> is no loop tail because the vector length is automatically adjusted
> in the last iteration of the loop to fit the remaining number of array
> elements.

This isn't completely new. The IBM System/370 at one point added a
vector feature which was largely modelled on that of the Cray I. But
unlike the Cray I, it was designed so that one model of the 370 might
have vector registers with 64 elements, and another one might have
vector registers with 256 elements.

To allow code to be written that was independent of the hardware
vector size, a 16-bit field in the vector control register was used as
the loop counter.

This is not the same as the vector feature recently added to the
current IBM z/Architecture mainframes, which is just one comparable
to AVX on x86 microprocessors.

John Savard

Thomas Koenig

unread,

Feb 1, 2023, 5:04:46 PM2/1/23

to

The point I was trying to make (but which probably didn't come
across very well) is that loop-carried dependencies which are
allowed by the language can carry large performance penalties,
and that they impede any sort of vectorization - with wrong code
or with performance degradation, or both.

And Fortran has its own issues in that respect, not with arguments,
but with array expressions. Conceptually, everything on the
right-hand side of an assignment is evaluated before assignment
to the left-hand side. This is no problem if there are no
dependencies between the two sides. If there are simple
dependencies, like

a = sin(a)

that is not a problem, because the compiler can easily prove
that this can be scalarized into into a simple loop. For
something like

a (1:n) = a(0:n-1)

(array slice notation, where after the assignment a(1) has the
value that a(0) had previously, a(2) previously a(1), etc.) this
is conceptually identical to

tmp(1:n) = a(0:n-1)
a(1:n) = tmp

but the compiler cannot use a forward loop. It could either
actually create a temporary, or reverse the loop.

And in doubt, the compiler has to err on the side of correct code,
rather than speed.

Terje Mathisen

unread,

Feb 1, 2023, 5:25:03 PM2/1/23

to

IMHO, this is exactly like the IBM RLL-decoding using intentionally
overlapping moves (MVC?): You really, really should not do it, and if
you do, then you deserve whatever ills befall you.

MitchAlsup

unread,

Feb 1, 2023, 6:40:20 PM2/1/23

to

On Wednesday, February 1, 2023 at 4:04:46 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>

> > But if your point was to illustrate that C has bad semantics
> > wrt array aliasing, I have to agree. The Fortran version has
> > no loop carried dependence.
<
> The point I was trying to make (but which probably didn't come
> across very well) is that loop-carried dependencies which are
> allowed by the language can carry large performance penalties,
> and that they impede any sort of vectorization - with wrong code
> or with performance degradation, or both.
<

A subtle point you may have missed with VVM; is that one CAN
vectorize such loops, and the ones with such a dependency simply
run slower (at the speed the dependency resolves) rather than the
speed the pipeline is capable of running if the dependence were
not present. In both cases, the correct results are obtained.
<
Loop carried dependencies do not prevent VVM vectorization.

Thomas Koenig

unread,

Feb 2, 2023, 1:30:43 AM2/2/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

I concur. The question is who is "you" in your sentence above...

Compiler writers have to get it right, according to the language
specification, so wrong code is not an option. If the language
specification results in pessimized code to cater for a case that
hardly anybody uses intentionally, well... what is a compiler
writer to do?

Thomas Koenig

unread,

Feb 2, 2023, 1:45:30 AM2/2/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

I know that, but didn't explicitly mention it :-)

><
> Loop carried dependencies do not prevent VVM vectorization.

The _really_ good thing is that it only happens when detected at
runtime, so there is no performance penalty when it is not needed.

But if loop carried dependencies occur, they will have a large
performance impact.

Also, VVM faithfully reproduces a special ordering at the loop,
the one that it was written in. This is obviously ideal for
languages which write out explicit loops, like C, but does not
offer improvment for languages (like Fortran :-) which operate
more on vectors or arrays.

Consider the Fortran statement

a(n:m) = a(k:k+m-n) + 2.0

which can be efficiently implemented with either a forward or a backward
loop, depending on the sign of n - k. The compiler has two choices:
Either copy the intermediate result to a temporary, or perform a
run-time test and switch the loop ordering, depending on the result.

Depending on the length of the loop, there is then always the question
of which is faster, because the run-time test adds to the overhead of
the loop startup.

This is one case where a genuine vector ISA could offer advantages,
it could do the loop reversal check in hardware, and fast. I simply
don't know if ForwardCom can do that, or not.

Agner Fog

unread,

Feb 2, 2023, 1:57:11 AM2/2/23

to

Thomas Koenig wrote:
What would be the best way to handle loop-carried dependencies (let's say a memmove, where operands can overlap)

The memmove function allows overlap, unlike memcpy. The memmove library function is using a backwards loop when necessary. The ForwardCom library function will do the same, of course. A backwards loop is not as elegant as a forward loop, but still possible.

An optimizing compiler will not vectorize a loop automatically unless it can prove that there is no aliasing problem. The programmer may help by explicit vectorization, or a __restrict keyword, or a no-aliasing compiler option. ForwardCom does not differ from other SIMD processors in this respect.

Quadibloc wrote:
some actions, treated as single instructions by the microarchitecture for efficiency, are specified as multiple instructions in the code.

The assembler understands the high-level construct:
for (int r1=0; r1<n; r1++) { }
The loop prolog (r1=0) is a single instruction (or two instructions if n is a register and it is not certain that n > 0).
The loop epilog (r1++, jump if r1<n) is a single instruction. It is the high-level assembler that does the translation. The microprocessor sees the epilog as a single instruction and treats it like any other combined ALU-branch instruction, so it has no problem decoding this instruction.
The compiler is free to pass the 'for' loop as a high-level expression to the assembler or translate it to low-level instruction mnemonics.

Thomas Koenig wrote:
This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.

It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

Agner Fog

unread,

Feb 2, 2023, 2:02:04 AM2/2/23

to

> Thomas Koenig wrote:
> This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.
> It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

What I meant was, an out-of-order My66000 might do the equivalent of loop reversal in small, simple cases, but not for big or complicated loops.

Thomas Koenig

unread,

Feb 2, 2023, 12:26:04 PM2/2/23

to

Thomas Koenig <tko...@netcologne.de> schrieb:

> Also, VVM faithfully reproduces a special ordering at the loop,
> the one that it was written in. This is obviously ideal for
> languages which write out explicit loops, like C, but does not
> offer improvment for languages (like Fortran :-) which operate
> more on vectors or arrays.

Actually, that should read

"does not offer additional improvement". Having VVM for Fortran would
already speed up a lot of code.

Stephen Fuld

unread,

Feb 2, 2023, 12:56:55 PM2/2/23

to

I get what you are saying, but I am not sure it is true. Specifically,
while the CPU can do only one add at a time (due to the dependency), so
it doesn't offer the advantage of operating on multiple elements
simultaneously, as it does on non loop carried dependency loops.

But it may be, and Mitch is the expert here, that the other advantages
of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
what they are officially called) offer an advantage over pure scaler code.

BTW, whether you call what VVM does for this loop "vectorizing" is, I
think a matter of ones definition of vectorizing, specifically whether
the definition must include operating on more than one element
simultaneously or not.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Feb 2, 2023, 3:52:27 PM2/2/23

to

On Thursday, February 2, 2023 at 11:56:55 AM UTC-6, Stephen Fuld wrote:
> On 2/2/2023 9:26 AM, Thomas Koenig wrote:
> > Thomas Koenig <tko...@netcologne.de> schrieb:
> >
> >> Also, VVM faithfully reproduces a special ordering at the loop,
> >> the one that it was written in. This is obviously ideal for
> >> languages which write out explicit loops, like C, but does not
> >> offer improvment for languages (like Fortran :-) which operate
> >> more on vectors or arrays.
> >
> > Actually, that should read
> >
> > "does not offer additional improvement". Having VVM for Fortran would
> > already speed up a lot of code.
<
> I get what you are saying, but I am not sure it is true. Specifically,
> while the CPU can do only one add at a time (due to the dependency),
> so it doesn't offer the advantage of operating on multiple elements
> simultaneously, as it does on non loop carried dependency loops.
<

If we are talking about the following loop (snatched from above)::

<
void add (const int *a, int *b, int *c, int n)
{
for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}
<

The only thing preventing performing 2-4-8 adds, at the same time,
is whether (or not) a[] actually aliases b[] or c[]. If/when there is no
aliasing, this can be performed as wide as the implementation has
resources. If aliasing is actually taking place then one is limited by
the actual data-flow.
<
Note: That in the case above::
<
add( a, a, @a[37]);
<
Does not perform and ACTUAL aliasing up to execution widths of
32 per cycle (8 per cycle FP due to latency).

>
> But it may be, and Mitch is the expert here, that the other advantages
> of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
> what they are officially called) offer an advantage over pure scaler code.
<

Consider that a processor is comprised of many function units, each
having a characteristic latency:: integer is 1 cycle, LDs are 3, STs are 1
cycle, FP is 4 cycles. Now consider that there is 1 FU of each kind and
that the LOOP instruction executes in 1 cycle. Finally consider that the
underlying machine is 1-wide and in-order (classic 1st generation RISC).
Cache width is 128-bits--not 1st generation RISC, but suitable for mis-
aligned DoubleWord access at full speed.
>
Here, loops are bound by the number of times a FU is used within the
loop. DAXPY, which has 2 LDs, 1 ST, 1 FMAC, per iteration. LDs and STs
are being performed 128-bits per access, or 2 Iterations per access.
<
On my 1-wide in order machine, we are reading 128-bits per for the
2 LDs, and writing 128-bits for the ST every 3 (2*)cycles, so with a
single FMAC unit we can perform 2 LDs, 1 FMAC, 1 ST and the equi-
valent of 1×(ADD-CMP-BLT) every 3 cycles (7 instructions in 3 cycles
= 2+1/3 I/C). 2.3 IPC is not bad for a 1-wide machine being about
3.5× the performance of the scalar code all by itself.
<
(*) The nature of cache lines having 1 tag and 4 quadword columns
enables the 2 LDs and the ST to both be performed in 2 cycles since
the cache line is 4 accesses wide and we need 3 tag accesses every
4 cycles. {This may require an additional QuadWord of buffering for
STs.} In order to do this, the column being stored cannot be a column
being read by either of the 2 LDs. 7 effective instructions in 2 cycles
is 3.5 IPC and 5× what scalar code could perform.
<
Also note: The C (or Fortran) source has manually unrolled this loop 4
times. This just multiplies the number of instructions in the loop by
4 while leaving a single LOOP instruction for iteration control.

<
> BTW, whether you call what VVM does for this loop "vectorizing" is, I
> think a matter of ones definition of vectorizing, specifically whether
> the definition must include operating on more than one element
> simultaneously or not.
>

In the same way a CARY-like vector machine does not need fetch
bandwidth while its vector instructions ae being performed, a VVM
implementation does not need the Fetch-Parse-Decode stages of
the pipeline to be active while the loop is being performed.

Anton Ertl

unread,

Feb 2, 2023, 6:03:22 PM2/2/23

to

Agner Fog <ag...@dtu.dk> writes:
>ForwardCom can vectorize array loops in a new way that automatically adjust=
>s to the maximum vector length supported by the CPU. It works in the follow=
>ing way:
>
>A register used as loop counter is initialized to the array length. This co=
>unter is decremented by the vector register length and repeats as long as i=
>t is positive. The counter register also specifies the vector register leng=
>th. The vector registers used in the loop will have the maximum length as l=
>ong as the counter exceeds this value. There is no loop tail because the ve=
>ctor length is automatically adjusted in the last iteration of the loop to =

>fit the remaining number of array elements.

...
>There is no global vector length register. You can have different vectors w=
>ith different lengths at the same time. The length is stored in the vector =
>register itself. When you save and restore a vector register, it will only =
>save the part of the register that is actually used.=20

In recent time we have had the problem that CPU manufacturers combine
different cores on the same CPU, and want to migrate threads between
these cores. So they use the same SIMD lengths on all cores, even, in
the case of Intel, disabling AVX-512 on CPUs that do not have the
smaller cores, i.e., where all cores could do AVX-512 if it was not
disabled.

It seems to me that you don't have an answer for that. And I guess
that as long as there are architectural (programmer-visible) SIMD
registers, the problem will persist. The SIMD registers must not be
an architectural feature, as in VVM (or maybe Helium?), to make
migration from cores with longer to cores with shorter vectors
possible.

The other option would be for Intel to implement AVX-512 in the small
cores without 512-bit units, the same way that AMD has been using time
and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
and Zen1, and ZMM is 2x256 bits on Zen4.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Agner Fog

unread,

Feb 3, 2023, 12:47:28 AM2/3/23

to

Anton Ertl wrote:
> In recent time we have had the problem that CPU manufacturers combine
> different cores on the same CPU, and want to migrate threads between
> these cores. So they use the same SIMD lengths on all cores, even, in
> the case of Intel, disabling AVX-512 on CPUs that do not have the
> smaller cores, i.e., where all cores could do AVX-512 if it was not
> disabled.

Yes, this was a big mistake in my opinion. The first version of Intel's Alder Lake had access to different cores with and without AVX512 support. Your program will crash if the software detects that it can use AVX512 and later jumps to another core without AVX512. So they had to disable the AVX512 completely, including the new half-precision floating point instructions. Did Intel expect the OS to move a thread back to a capable core in this case? You cannot expect an OS to give special treatment to every new processor on the market with new quirks. I have discussed this problem at https://www.agner.org/forum/viewtopic.php?f=1&t=79
Unfortunately, ARM does the same in their big.LITTLE architecture.

Regarding pointer aliasing:
I think it is the responsibility of the programmer to consider pointer aliasing problems and do a loop backwards if necessary. If you want the hardware to fix all kinds of bad programming, you will soon find yourself down a very deep rabbit hole.

Terje Mathisen

unread,

Feb 3, 2023, 1:54:28 AM2/3/23

to

Generate correct code, so for max optimization give a warning, but
produce code that tests and branches to alternate implementations
depending upon the parameters, i.e. in you example both b[] and c[] must
be checked for overlap with a[]. The fallback can be fast, as in
memmove(), if zero or one input overlap, then fall back on scalar for
really bad cases?

robf...@gmail.com

unread,

Feb 3, 2023, 3:45:42 AM2/3/23

to

Re: memory management,

I skimmed through the manual and was somewhat surprised by the memory
management section. It sounds almost somewhat like a segmented system.
Is it possible to get a little more detail? If I have got this right, there is a table(s)
of base block addresses and block lengths and a number of comparators are
used to map a virtual address to a physical one. Can an object span across
multiple blocks? Is the block table swapped on a context switch, or are there
entries for multiple contexts?

While many systems contain block address translations, they also typically
include TLB's and paged memory. That style of memory management seems
to have won-out.

Anton Ertl

unread,

Feb 3, 2023, 6:35:44 AM2/3/23

to

Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:

>> In recent time we have had the problem that CPU manufacturers combine=20
>> different cores on the same CPU, and want to migrate threads between=20
>> these cores. So they use the same SIMD lengths on all cores, even, in=20
>> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
>> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
>> disabled.=20
>Yes, this was a big mistake in my opinion. The first version of Intel's Ald=

>er Lake had access to different cores with and without AVX512 support.

What I read is that you could have a BIOS setting where you could
enable AVX-512 but you could only do that if you disabled the E-cores.
Intel soon provided BIOS updates that eliminated this option.

>Your=
> program will crash if the software detects that it can use AVX512 and late=
>r jumps to another core without AVX512. So they had to disable the AVX512 c=
>ompletely, including the new half-precision floating point instructions. Di=
>d Intel expect the OS to move a thread back to a capable core in this case?=
> You cannot expect an OS to give special treatment to every new processor o=

>n the market with new quirks.

One way to deal with this would be to not report the availability of
AVX-512 (and maybe even disable them) unless the OS tells the CPU in
through a model-specific register that it is prepared for
heterogeneous AVX-512 support.

An OS that is unaware of the issue would just work, without AVX-512,
as is currently the case with Alder Lake; but the option I outlined
would open the possibility for other approaches:

One way an OS could handle the issue is to only tell the CPU that it
should report AVX-512 for processes that are limited to P-cores (with
thread-affinity or somesuch).

Another way would be to always tell the CPU that, and, as you wrote,
to migrate the process back to a P-Core if it executes an AVX-512
instruction.

Intel was not very coordinated when designing Alder Lake. They had
two teams designing cores with different instruction sets, then did
not provide OS workarounds for that, then dealt with the resulting
fallout by first restricting AVX-512 severely, but for whatever reason
even that was not enough for them, and finally they disabled AVX-512
completely.

>Unfortunately, ARM does the same in their big.LITTLE architecture.

A little more coordinated, it seems: All their cores are limited to
128-bit SIMD in SVE, which however eliminates the USP of SVE:
scalability to wider SIMD units.

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=

>sing problems and do a loop backwards if necessary.

I think it's something the programming language design should take
care of. If the language has vectors as a value type, this eliminates
the problem. The Fortran array sub-language is not quite there, but
still pretty good.

Michael S

unread,

Feb 3, 2023, 7:41:59 AM2/3/23

to

SIMD registers are an architectural feature of Helium/MVE.
They overlap with scalar FP registers in a manner, different from
what is common in x86 world, but consistent with a way in which
ARMv7-M FPv5 double precision register overlap with single-precision.

If anything, actual execution width (1, 2 or beats per Architecture tick)
is more exposed to SW in Helium than in most other SIMD environments.

"The SIMD registers must not be an architectural feature" appears
to be a fashion of comp.arch newsgroup, but it's not a popular
view in wider world.

>
> The other option would be for Intel to implement AVX-512 in the small
> cores without 512-bit units, the same way that AMD has been using time
> and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
> and Zen1, and ZMM is 2x256 bits on Zen4.
>

Intel did it many times in the past. All SSEx implementations before Core had
64-bit floating-point execution units. They did it later, too.
Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
Alder Lake was/is just a mess, as you correctly state in other post.
Although somewhat less messy than its predecessor Lakefield.

Michael S

unread,

Feb 3, 2023, 7:49:53 AM2/3/23

to

Yes, Arm Inc. manages their big.LITTLE combos well.
It was Samsung that did bad things like combining Cortex-A53
as LITTLE with their own Mongoose cores as big, despite the
mismatch in cache lines width, which is certainly visible to
application software.
Mongoose is dead now, so happiness returned.

Scott Lurndal

unread,

Feb 3, 2023, 8:46:01 AM2/3/23

to

Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary. If you want the hardwar=
>e to fix all kinds of bad programming, you will soon find yourself down a v=
>ery deep rabbit hole.

Which is why memmove(3) was standardized decades ago.

Anton Ertl

unread,

Feb 3, 2023, 9:49:46 AM2/3/23

to

Michael S <already...@yahoo.com> writes:
>"The SIMD registers must not be an architectural feature" appears
>to be a fashion of comp.arch newsgroup, but it's not a popular
>view in wider world.

Yet? Certainly there have been complaints about SIMD width as
instruction set feature (as in SSE, AVX, AVX-512), and they have
resulted in SVE, and AFAIK the RISC-V vector extension. But with that
the SIMD width is still architectural, which prevents migration to
implementations with a different SIMD width; that's very clear in
Alder Lake, but is also a problem when migrating VMs to different
computers, e.g., migrating an SVE program from a Fujitsu supercomputer
to some ARMv9 server.

Keeping things that you want to be able to change as
microarchitectural feature rather than turning them into architectural
features has been used successfully in the past. If we could use it
for SIMD/vectors, that would be great. I am not sure how realistic
VVM is, but if it can be realized, the fact that the SIMD size is a
microarchitectural rather than architectural feature is a major
benefit as far as I am concerned (I am not a believer in
auto-vectorization, even if it happens in the hardware and avoids the
aliasing problems.

>Intel did it many times in the past. All SSEx implementations before Core had
>64-bit floating-point execution units. They did it later, too.
>Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
>Alder Lake was/is just a mess, as you correctly state in other post.

So the fact that Gracemont (and therefore Alder Lake) does not support
AVX-512 is probably a result of Intel trying to perform market
segmentation with instruction set extensions. If so, apparently Intel
marketing does not understand the market very well.

Anton Ertl

unread,

Feb 3, 2023, 10:25:41 AM2/3/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>For Fortran, there is a related problem. While arguments cannot
>overlap, the language proscribes that the right-hand side of
>an assignment is evaluated before the actual assignment.
>This leads to generation of temporary arrays for code like
>
> a(n:m) = a(n-1:m-1) + a(n+1:m+1)
>
>or when overlap / non-overlap cannot be detected at compile time.

And what is the problem? How would you define it instead?

I think that Fortran defines this well. I think that it's even better
to have vectors as value types not just for intermediate results of
expressions, but also in variables, but what Fortran does here is a
good intermediate step.

Having the result in different memory than the source is a fine
solution (and, if all of a is overwritten, and a had a value type, it
could even be done without extra copying). But in the present case,
there is another solution (which might be found by your compiler):

if (n<=m) {
double aim1 = a[n-1] + a[n+1];
for (i=n+1; i<=m; i++) {
double ai = a[i-1] + a[i+1];
a[i-1] = aim1;
aim1 = ai;
}
a[i-1] = aim1;
}

You can unroll the loop by a factor of 2 and perform some renaming to
eliminate "aim1 = ai;". You can use SIMD instead of double, with the
same logic otherwise to make use of SIMD.

Your example looks similar to a convolution filter, but those usually
have more than one dimension, which makes the whole optimization
harder (e.g., for a 2D convolution filter you would have to store more
than one line in intermediate storage before copying it back to the
original location. Just having the result in a new location looks
much more attractive then.

BGB

unread,

Feb 3, 2023, 1:40:30 PM2/3/23

to

IMO:
This isn't a huge loss, as IME the general utility of SIMD drops off
quickly as one goes wider.

So, 4-element vectors, 128-bit vectors, ... can almost be considered as
magic numbers.

Likewise, if one does go to 8-element vectors, it is often "more useful"
to handle them more as "2x 4-elements" than as a homogeneous 8.

Main use-case for 256-bit would be 4x Binary64, but not as much else.
But, if one has a CPU where this has no advantage over 128-bit (because
the CPU is just faking it), well then, 128-bit remains on top.

And, ironically, the main alternate use case for SIMD being for memory
copies, which are then "however wide the CPU can natively handle"
(multiplies by a scale of pipeline latency).

So, for BJX2 in my current cores, this leads to copying 512-bits at a
time being the "most efficient" case for general memcpy. Though larger
blocks can still see an advantage by reducing loop overhead.

Copying 2048 or 4096 bits is enough to basically "hide" looping related
overheads, but is only really "useful" for bulk copying (say, one needs
to copy a buffer of 16K or more before it makes much difference).

For something like an LZ77 decoder, is may make sense to ignore
bulk-copy cases (the majority of matches being relatively short).

In this case, it makes more sense to try to be "most efficient" for
copying in the 4-16 byte range. Copies much over 256 bytes being
relatively uncommon.

Otherwise, I have skepticism about compiler autovectorization being able
to make effective use of wider SIMD types, when (as a programmer) it is
also difficult to make effective use of them.

>> Regarding pointer aliasing:
>> I think it is the responsibility of the programmer to consider pointer alia=
>> sing problems and do a loop backwards if necessary.
>
> I think it's something the programming language design should take
> care of. If the language has vectors as a value type, this eliminates
> the problem. The Fortran array sub-language is not quite there, but
> still pretty good.
>

Would be useful if C gave a little more here.

Ideally, I would like both keywords for aliasing (and unaligned
load/store, etc); along with maybe a way to specify vector types and
vector ops that worked across compilers and platforms, and whether or
not the target actually supports them natively.

Though, GCC's vectors can give at least one of these points.
* Works across targets.
But:
* Not really portable outside of GCC (and Clang)
* Still seems to depend on native SIMD support AFAIK.

For BGBCC, I had defined my own system, with partial compatibility with
GCC's notation. Operations are defined relative to the vector size/type,
rather than whether or not it maps to a native HW vector (though, with
the caveat that trying to use vectors when no SIMD is enabled, will
generate runtime calls...).

...

Thomas Koenig

unread,

Feb 3, 2023, 3:52:33 PM2/3/23

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:
>
>> ForwardCom can vectorize array loops in a new way that automatically
>> adjusts to the maximum vector length supported by the CPU. It works
>> in the following way:
>
>> A register used as loop counter is initialized to the array length. This
>> counter is decremented by the vector register length and repeats as
>> long as it is positive. The counter register also specifies the vector
>> register length. The vector registers used in the loop will have the
>> maximum length as long as the counter exceeds this value. There
>> is no loop tail because the vector length is automatically adjusted
>> in the last iteration of the loop to fit the remaining number of array
>> elements.
>
> This isn't completely new. The IBM System/370 at one point added a
> vector feature which was largely modelled on that of the Cray I. But
> unlike the Cray I, it was designed so that one model of the 370 might
> have vector registers with 64 elements, and another one might have
> vector registers with 256 elements.

A colleague used it on an IBM 3090. Compared to the Fujitsu VP
they also had at the computer center, it was as slow as molasses.

[...]

Agner Fog

unread,

Feb 4, 2023, 2:34:37 AM2/4/23

to

Re: memory management
The main purpose of the memory map in ForwardCom is not virtual address translation, but memory protection. Virtual address translation is not needed in simple well-behaved cases.

> Can an object span across multiple blocks?

Yes, but there would be no point in doing so because blocks can have arbitrary sizes. For example, if a data object spans two blocks of 3 kB and 4 kB, you can join them into one block of 7 kB. You only need to subdivide a data structure if different parts have different access permissions. A typical application has initialized static date, uninitialized static data, stack, and heap. These all have the same access permissions (read and write) so they can be joined into a single block large enough to serve all the needs of the running process for read/write data. A typical application program has three memory blocks with different access permissions: 1. execute only, 2. read only, 3. read and write. If there is a child thread with thread-private memory, then this thread will have a memory map with one extra entry for private data that is not accessible to parent and sibling threads. Thread-private memory is a ForwardCom feature.

ForwardCom has features to predict the maximum size of stack and heap and reserve the calculated size of memory when a program is loaded. If this prediction fails, then the memory becomes fragmented. Then you will need an extra entry in the memory map to access a vacant physical memory block that is not contiguous with the rest, and possibly use virtual address translation to make it appear to be contiguous.

ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.

The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

Thomas Koenig

unread,

Feb 4, 2023, 6:03:26 AM2/4/23

to

Agner Fog <ag...@dtu.dk> schrieb:

> ForwardCom has features to predict the maximum size of stack
> and heap and reserve the calculated size of memory when a program
> is loaded.

That sounds like an extremely hard thing to do. What do these
heuristics loook like, and how would they estimate (for example)
heap size of the program

program main
real, dimension(:,:), allocatable :: a, b c
read (*,*) n
allocate (a(n,n), b(n,n))
read (*,*) a, b
c = matmul(a,b)
print *,c
end

? Do they rely on statistics gathered in earlier runs (sort
of profile-guided feedback, which is not very reliable), or are
they functions of the program itself? What about self-compiled
programs?

> If this prediction fails, then the memory becomes
> fragmented. Then you will need an extra entry in the memory map
> to access a vacant physical memory block that is not contiguous
> with the rest, and possibly use virtual address translation to
> make it appear to be contiguous.

> ForwardCom is using various features to avoid memory
> fragmentation, including an innovative system for function
> libraries. This should keep the number of entries in the memory
> map of each process or thread small enough to fit on the CPU chip
> without the need for TLB and multi-level page tables. A small memory
> map with few variable-size entries is implemented most efficiently
> with comparators. A large memory map for a highly fragmented memory
> cannot use this method, but must use fixed-size pages and large
> page tables. Intel has up to 5 levels of page tables.

It is also a question of the size of the page tables...

Do you envision having page tables as a backup, or will your method
converge to something equivalent to page tables when the application
manages to beat all of your heuristics?

And is the current performance of page tables expected to be better
or worse than what you propose, when your heuristics don't work well?

Really, how well your heuristics work is a central question. I guess
this is not possible without extensive profiling of typical workloads,
where the question of what "typical" is is always problematic.

Very probably, profilign existing applications on other processors
can help a lot there.

> The memory map needs to be saved and restored on context
> switches. It will typically be smaller than the register file
> so that it can be saved efficiently. The CPU may or may not have
> multiple memory maps on-chip to switch between.

So, if the memory map exceeds the (probably) fixed number of
entries that are on the chip, what happens?

Thomas Koenig

unread,

Feb 4, 2023, 7:18:02 AM2/4/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Also note: The C (or Fortran) source has manually unrolled this loop 4
> times. This just multiplies the number of instructions in the loop by
> 4 while leaving a single LOOP instruction for iteration control.

Is such unrolling actually (likely to be) helpful in a VVM
implementation for speed?

Agner Fog

unread,

Feb 4, 2023, 7:20:41 AM2/4/23

to

Thomas Koenig wrote:
> how would they estimate (for example) heap size of the program?

>Do they rely on statistics gathered in earlier runs

The maximum stack size is calculated by the compiler and the linker. In case of deeply recursive functions, the programmer may specify an estimated maximum recursion level.

And yes, the maximum heap size relies on statistics gathered in earlier runs. The programmer may add a suggested value.

If a heap grows beyond the estimated maximum, then the OS may add a large extra block of memory, perhaps the double of the previous size. Assume, for example, that a program is using 4 memory map entries when it is loaded, and the hardware has space for 256 entries, then the heap can grow to 2^252 times the initial size before we run out of memory map entries. This will never happen even if we use disk space as virtual memory.

There can be other causes of filling the memory map, though. The most likely is demand-paged memory. The use of demand-paged memory is strongly discouraged in ForwardCom. Another possibility is starting and ending so many processes that the entire memory space gradually becomes totally fragmented.

> So, if the memory map exceeds the (probably) fixed number of entries that are on the chip, what happens?

The best solution will probably be to start a garbage collector that can suspend running processes for a short time while rearranging the fragmented memory it is using. A fallback to fixed-size page tables is possible if the hardware supports it, but that is the expensive solution we are trying to avoid.

The performance problem of garbage collection exists in current systems as well. The programmer of a performance-critical memory-hungry program will take care to avoid garbage collection. This can be done by allocating a large memory block when starting, and by recycling allocated memory. Another solution is to start a new thread when a large block of memory is needed, for example in a video editing program. The main thread will not need an extra memory map entry if the new memory block is private to the new thread.

A memory map with space for, for example, 256 entries will require at most 512 comparators. This will be much more efficient than a traditional TLB with millions of pages in multiple levels, and we will have no TLB misses.

lørdag den 4. februar 2023 kl. 12.03.26 UTC+1 wrote Thomas Koenig:
> Agner Fog schrieb:

Quadibloc

unread,

Feb 4, 2023, 8:03:57 AM2/4/23

to

It will be interesting to hear Mitch's answer to this.

I would be (initially) inclined to say yes. Why?

Mitch's VVM instructions _look_ like ordinary loop instructions. But
the processor treats them as vector instructions instead.

To me, that implies they *take the place* of Cray-like vector
instructions, removing complexity from the instruction set.
That would mean that they have _less_ overhead than a regular
loop, but not that they have _zero_ overhead. And zero
overhead is what an unrolled loop has, right?

And that's where I see why I'm wrong. Compared to an ideal
vector instruction, an unrolled loop certainly does have
"overhead" - all those additional times you fetch and decode
the instructions for the later copies of the loop. (The execute,
of course, you have to do anyways.)

But as long as even a loop with VVM has some nonzero
amount of overhead, it's certainly possible that unrolling it
a very _small_ number of times might still be an optimization.
However, that wouldn't be true if the small number was less
than 2, and given Mitch's description of how he intends
VVM to be implemented, it does seem as though
this would be quite possible.

John Savard

EricP

unread,

Feb 4, 2023, 11:19:22 AM2/4/23

to

Agner Fog wrote:
> Re: memory management

>
> ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.
>
> The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

A hardware table of comparators is more than double the cost per-entry
as a TLB binary CAM. A binary CAM compare == entry is a bunch of XOR's
and an AND gate. A binary range comparator (low <= addr <= upr) entry is
two bunches of XOR's plus a priority selector (find-last-set),
so double the wire loading on the compare value bus, and more than
double the power consumption per entry, plus longer propagation delay.

Real TLBs for radix page tables can have multiple CAMs to cache
table interior nodes, allowing a reverse page table walk to speed
TLB miss resolution, which adds to the whole TLB cost.

Also each EXE or DLL has its own set of virtual memory section
requirements, plus all the DLLs referenced by other DLLs,
and any of these virtual sections may be relocated when created.
A single instance of a running program can have dozens to hundreds
of memory sections, each with its range of addresses and attributes.

A HW range table would have a limited number of entries so only a
subset of a programs memory section entries can be loaded at once.
How would a range table miss work?

How does virtual to physical relocation work for this range table?
A radix page table TLB does relocation by substitution so there is
zero cost beyond the TLB lookup for a table hit.
A range table can't use substitution and must use arithmetic relocation,
which costs an N-bit addition propagation delay above the table lookup.
For a 64-bit address this could add a clock to every memory access.

How would paging work for the range table? Hardware needs to track which
pages in each memory section are present, or trigger a page fault if not.
Where are these page Present bits stored for each memory section,
or is this going to use whole-segment faulting/swapping?

Agner Fog

unread,

Feb 4, 2023, 11:39:16 AM2/4/23

to

4. februar 2023 kl. 17.19.22 UTC+1 EricP wrote:
> Agner Fog wrote:
> > Re: memory management

> A hardware table of comparators is more than double the cost per-entry
> as a TLB binary CAM.

Yes, but there are fewer entries

> Real TLBs for radix page tables can have multiple CAMs to cache
> table interior nodes, allowing a reverse page table walk to speed
> TLB miss resolution, which adds to the whole TLB cost.

No page tables = no table walk

> Also each EXE or DLL has its own set of virtual memory section
> requirements, plus all the DLLs referenced by other DLLs,
> and any of these virtual sections may be relocated when created.
> A single instance of a running program can have dozens to hundreds
> of memory sections, each with its range of addresses and attributes.

Yes, exactly. ForwardCom has no DLLs, but a new library system with a relink feature. This removes the biggest source of memory fragmentation and avoids a lot of waste of partially used pages. See https://www.forwardcom.info/forwardcom_libraries.php

> How would paging work for the range table? Hardware needs to track which
> pages in each memory section are present, or trigger a page fault if not.

Usually, all memory used by a process will be loaded when the process is loaded. If a memory block is too big for loading it all at once, it needs one entry in the memory map for the present part and one entry for the absent part.

MitchAlsup

unread,

Feb 4, 2023, 12:53:09 PM2/4/23

to

I guess I failed to actually address the point I was trying to make::
<
Many of the BLAS subroutines were hand unrolled (typically 4 times).
<
Is this something the programmer should be doing in this day and age,
or is this something that is better left to the compiler ???
<
Many of the BLASS matrix subroutines are partitioned into 4 loops,
one for each kind of transpose, to optimize memory performance.
<
Is this something the programmer should be doing in this day and age,
or is this something that is better left to the compiler ???
<
-------------------------------------------------------------------------------------------------------------
<
As far as loop unrolling and VVM goes, unrolled loops provide no
impediment to VVM as long as the unrolled loop still fits in the
VVM buffering (which is implementation defined.)

Stephen Fuld

unread,

Feb 4, 2023, 1:55:07 PM2/4/23

to

On 2/4/2023 9:53 AM, MitchAlsup wrote:
> On Saturday, February 4, 2023 at 6:18:02 AM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
> <
>>> Also note: The C (or Fortran) source has manually unrolled this loop 4
>>> times. This just multiplies the number of instructions in the loop by
>>> 4 while leaving a single LOOP instruction for iteration control.
> <
>> Is such unrolling actually (likely to be) helpful in a VVM
>> implementation for speed?
> <
> I guess I failed to actually address the point I was trying to make::
> <
> Many of the BLAS subroutines were hand unrolled (typically 4 times).
> <
> Is this something the programmer should be doing in this day and age,
> or is this something that is better left to the compiler ???

I guess probably the compiler.

> Many of the BLASS matrix subroutines are partitioned into 4 loops,
> one for each kind of transpose, to optimize memory performance.

> <
> Is this something the programmer should be doing in this day and age,
> or is this something that is better left to the compiler ???
> <

I am less sure about this. Loops are pretty general, so the question of
loop unrolling has wide applicability. Matrix stuff is less general, so
the effort in the compiler would have less of a payback.

-------------------------------------------------------------------------------------------------------------
> <
> As far as loop unrolling and VVM goes, unrolled loops provide no
> impediment to VVM as long as the unrolled loop still fits in the
> VVM buffering (which is implementation defined.)

While I agree with this, I think that VVM reduces the benefits of loop
unrolling. i.e. the loop overhead is only one instruction, and there is
no cost for the non-sequential control transfer.

mac

unread,

Feb 4, 2023, 2:45:50 PM2/4/23

to

> But if your point was to illustrate that C has bad semantics
> wrt array aliasing, I have to agree. The Fortran version has
> no loop carried dependence.

Strictly speaking, the Fortran version *ignores* any loop-carried
dependence

MitchAlsup

unread,

Feb 4, 2023, 2:48:34 PM2/4/23

to

Only because the language explicitly puts memory aliasing on the
shoulders of the programmer. The compiler has been freed of the
burden of detecting memory aliasing explicitly by the language in
certain circumstances.

Thomas Koenig

unread,

Feb 4, 2023, 4:15:59 PM2/4/23

to

Agner Fog <ag...@dtu.dk> schrieb:

> Thomas Koenig wrote:
>> how would they estimate (for example) heap size of the program?
>>Do they rely on statistics gathered in earlier runs
>
> The maximum stack size is calculated by the compiler and the
> linker.

Hmm... "calculated" is a strong word. Is it then necessary to make
a complete call graph, and to estimate the recursion depth?
This sounds very hard to do reliably, as compiler estimates
of levels of recursion are likely to be as wrong as programmer's
estimates :-)

>In case of deeply recursive functions, the programmer may
>specify an estimated maximum recursion level.

Ah, the days of the REGION parameter of JCL... (but slurm isn't
better, it is just the syntax that is different).

There is actually another reason to use stack instead of dynamic
memory: Speed. Allocating arrays on the stack insted of using
dynamic memory has several advantages: No overhead for malloc/free(),
and unused values on the stack are much more likely to be overwritten
by new values, so they no longer take up stack space.

This is why gfortran, for example, also puts arrays on the stack
up to a certain limit at -Ofast.

Your scheme is actually better than a hard error on an unreasonably
low value as a stack size, which some operating systems impose.
MacOS is particularly egregious in this respect, especially
regarding pthreads...

> And yes, the maximum heap size relies on statistics gathered in
> earlier runs. The programmer may add a suggested value.
> If a heap grows beyond the estimated maximum, then the OS may add
> a large extra block of memory, perhaps the double of the previous
> size.

OK...

>Assume, for example, that a program is using 4 memory map
> entries when it is loaded, and the hardware has space for 256
> entries,

256 entries for each process, each consisting of a 64-bit pointer
and a 64-bit offset, would be quite large (4 kB), also compared to
a reasonable register set. However, you obviously don't need that
(I assume that you use 64-bit addresses).

> then the heap can grow to 2^252 times the initial size
> before we run out of memory map entries. This will never happen
> even if we use disk space as virtual memory.

... and if your architecture can even address that much memory.

>
> There can be other causes of filling the memory map, though. The
> most likely is demand-paged memory. The use of demand-paged memory
> is strongly discouraged in ForwardCom.

What is then your strategy for when virtual memory exceeds physical
memory? Swapping out whole processes? That has been replaced by
paging for interactive systems for good reasons.

What is the strategy if the size of a single process exceeds total physical
memory, but (as is often the case) only a small part of that memory is in use?
(Having recently done a Debug build of clang, which took 20 GB on my
16GB home machine, I had reason to observe that the behavior, while not
great, was not too bad for responsiveness).

What is your strategy for memory-mapped I/O?

> Another possibility is
> starting and ending so many processes that the entire memory space
> gradually becomes totally fragmented.

Would you run into trouble over time if the processes finish again,
or would that set things right?

>> So, if the memory map exceeds the (probably) fixed number of entries that are on the chip, what happens?
> The best solution will probably be to start a garbage collector
> that can suspend running processes for a short time while
> rearranging the fragmented memory it is using. A fallback to
> fixed-size page tables is possible if the hardware supports it,
> but that is the expensive solution we are trying to avoid.

Sounds like a thing better to avoid, indeed...

> The performance problem of garbage collection exists in current
> systems as well. The programmer of a performance-critical
> memory-hungry program will take care to avoid garbage
> collection. This can be done by allocating a large memory block
> when starting, and by recycling allocated memory. Another solution
> is to start a new thread when a large block of memory is needed,
> for example in a video editing program. The main thread will not
> need an extra memory map entry if the new memory block is private

t> o the new thread.

> A memory map with space for, for example,
> 256 entries will require at most 512 comparators. This will be much
> more efficient than a traditional TLB with millions of pages in
> multiple levels, and we will have no TLB misses.

Sounds like an interesting concept, but I have till have doubts that
it will actually work. How far are you on the way of demonstrating
this? Given enough resources (probably a few PhDs would not be
enough) it might be possible to put such a system in a current
operating system (Linux kernel?) and try out how it works. Or,
maybe more reasonably, it might be possible to modify the kernel
to gather enough data so that a simulator of your system can be fed.

How far along are you to verifying that your system works?

MitchAlsup

unread,

Feb 4, 2023, 4:32:04 PM2/4/23

to

Consider a web browser which is intended to be initialized and then run for years without every being turned off. Are you handling pages on the browser as unique processes, or as additions to the current process ??
<
Then consider a database engine with 10K users and supporting a 100GB database::
Each user having to be totally isolated from every other user:: Are 256 mapping entries anywhere close to what is needed for the <common> DB engine
<
Consider the context switching overhead of migrating 256 quadword entries from process to process. Are you still capable of hard real time performance ?
<
Have you considered that each mapping entry might want its own ASID so that shared memory is accessed through a common ASID ??

Agner Fog

unread,

Feb 5, 2023, 2:55:52 AM2/5/23

to

Thomas Koenig wrote:
>> The maximum stack size is calculated by the compiler and the linker.

>Hmm... "calculated" is a strong word. Is it then necessary to make
>a complete call graph, and to estimate the recursion depth?

If library function A calls functions B and C, then the object file for A includes sufficient information to calculate the difference between the stack pointer at B and A, and between C and A. This includes, of course, fixed-size arrays on the stack. The linker is summing this information and finds the branch with the deepest stack use.

alloca in C makes variable-size arrays on the stack. This is rarely used. The programmer may specify an estimated maximum size when using alloca, or the system can use the same strategy as for the heap.

>This is why gfortran, for example, also puts arrays on the stack up to a certain limit at -Ofast.

Most programming languages put fixed-size arrays on the stack.

>256 entries for each process, each consisting of a 64-bit pointer
>and a 64-bit offset, would be quite large (4 kB), also compared to
>a reasonable register set. However, you obviously don't need that

Most processes would use only 3 entries. The rest is reserve for worst case. You don't need to save the unused entries.

>What is the strategy if the size of a single process exceeds total physical
>memory, but (as is often the case) only a small part of that memory is in use?

We would have to divide the process's memory into a few blocks, some of which are swapped out.

>What is your strategy for memory-mapped I/O?

The user process is calling a system function. The system call instruction has a feature for sharing a block of memory. The system function does not have access to everything, as current systems have, but only to the memory block shared by the application. The system function may use DMA or whatever the system designer sees fit.

>> Another possibility is starting and ending so many processes that the entire memory space
>> gradually becomes totally fragmented.

>Would you run into trouble over time if the processes finish again,
>or would that set things right?

The OS will collect as much garbage as it can without disturbing running processes.

>Sounds like an interesting concept, but I have till have doubts that it will actually work.

You probably cannot avoid TLB and fixed-size pages in large multi-user servers, but you can in small systems for dedicated purposes. There is so much to gain by avoiding TLB and multilevel page tables that this is something worth pursuing. ForwardCom does not ban TLB, but many niche applications can certainly work without it. I don't see ForwardCom replacing x86 or ARM in any near future because users need backward compatibility. The first uses will probably be research, embedded FPGA systems, single-purpose niche products, and large vector processors. Some wasteful programming habits have to go if you want top performance. At least you should avoid memory mapped files.

>How far along are you to verifying that your system works?

The library system and relinking mechanism works already. This removes the biggest source of memory fragmentation. The rest is playground for whoever needs an interesting university project.

MitchAlsup wrote:
>Consider a web browser which is intended to be initialized and then run for years without every
>being turned off. Are you handling pages on the browser as unique processes,
>or as additions to the current process ??

Each thread has its own memory map. You can put each browser page in a separate thread.

>Then consider a database engine with 10K users and supporting a 100GB database::
>Each user having to be totally isolated from every other user::
>Are 256 mapping entries anywhere close to what is needed for the <common> DB engine

You probably wouldn't use this for a many-user server, as I explained above.
But even if you have multiple users, you would still service each user in a separate process or a separate thread with its own small memory map.

>Consider the context switching overhead of migrating 256 quadword entries from process to process.
>Are you still capable of hard real time performance ?

99% of processes will have 2-5 memory map entries. You don't save unused entries. The hardware may still have space for many entries to deal with rare worst-case events.

>Have you considered that each mapping entry might want its own ASID so that shared memory is
>accessed through a common ASID ??

Each process and each thread has its own little memory map. The map of which process has access to what is not stored inside the CPU. Transfer of large data from one process to another or between system and a user process goes through a memory block shared between the two.

Thomas Koenig

unread,

Feb 5, 2023, 6:46:09 AM2/5/23

to

Agner Fog <ag...@dtu.dk> schrieb:

> Thomas Koenig wrote:
>>> The maximum stack size is calculated by the compiler and the linker.
>
>>Hmm... "calculated" is a strong word. Is it then necessary to make
>>a complete call graph, and to estimate the recursion depth?
>
> If library function A calls functions B and C, then the object file for A includes sufficient information to calculate the difference between the stack pointer at B and A, and between C and A. This includes, of course, fixed-size arrays on the stack. The linker is summing this information and finds the branch with the deepest stack use.
>
> alloca in C makes variable-size arrays on the stack. This is
> rarely used.

Well, alloca is not part of the standardized C language, it is an
extension.

>The programmer may specify an estimated maximum size
> when using alloca, or the system can use the same strategy as for
> the heap.

alloca is not part of the standardized C language proper (and systems
have IMHO unreasonable restrictions on stack size). Plus, a stack
overflow is usually not handled gracefully.

>>This is why gfortran, for example, also puts arrays on the stack up to a certain limit at -Ofast.
>
> Most programming languages put fixed-size arrays on the stack.

I was not talking about fixed-size arrays only.

In modern Fortran (since 1991) you can use

subroutine foo(a,n)
real, dimension(n), intent(inout) :: a
real, dimension(n) :: b

which declare a variable-sized array, which the compiler can put
on the heap or the stack. Since Fortran 2008, there is also

n = ...
block
real, dimension(n) :: a
end block

C99 later had something like this as variable-length arrays, but
they were declared optional in subsequent revisions of the C standard,
and if I remember correctly, they are not supported in Microsfot C.

>>256 entries for each process, each consisting of a 64-bit pointer
>>and a 64-bit offset, would be quite large (4 kB), also compared to
>>a reasonable register set. However, you obviously don't need that
>
> Most processes would use only 3 entries. The rest is reserve
> for worst case. You don't need to save the unused entries.

OK.

>>What is the strategy if the size of a single process exceeds total physical
>>memory, but (as is often the case) only a small part of that memory is in use?
>
> We would have to divide the process's memory into a few blocks,
> some of which are swapped out.

So, this would then somewhat look like a system with huge pages.

Hm, then your system would be a bit like having huge tables (1 GB
or something like that) with an additional size limitation?

>>What is your strategy for memory-mapped I/O?
>
> The user process is calling a system function. The system call
> instruction has a feature for sharing a block of memory. The system
> function does not have access to everything, as current systems
> have, but only to the memory block shared by the application. The
> system function may use DMA or whatever the system designer

So, each shared memory block would then have its own entry in the
table for each of the processes, correct?

Do I see it correctly that an application which, for whatever reason,
opened 1000 files via memory-mapped I/O at the same time would run
out of descriptors?

Same thing could apply for shared memory between processes.

>>> Another possibility is starting and ending so many processes that the entire memory space
>>> gradually becomes totally fragmented.
>
>>Would you run into trouble over time if the processes finish again,
>>or would that set things right?
>
> The OS will collect as much garbage as it can without disturbing running processes.

OK.

>
>>Sounds like an interesting concept, but I have till have doubts that it will actually work.
>
> You probably cannot avoid TLB and fixed-size pages in large
> multi-user servers, but you can in small systems for dedicated
> purposes. There is so much to gain by avoiding TLB and multilevel
> page tables that this is something worth pursuing. ForwardCom does
> not ban TLB, but many niche applications can certainly work without
> it. I don't see ForwardCom replacing x86 or ARM in any near future
> because users need backward compatibility.

Source code compatibility would be the first step; for this you
need a compiler. I read that there is currently none available;
a port to either gcc or LLVM would be a logical first step.
Is work ongoing on that?

> The first uses will
> probably be research, embedded FPGA systems, single-purpose niche
> products, and large vector processors. Some wasteful programming
> habits have to go if you want top performance. At least you should
> avoid memory mapped files.

Memory mapped files: They would be wasteful on your architecture,
but I'm not aware that they are wasteful on what is in used today
(or are they?).

Research, sure. FPGA/embedded systems: Possible, you also don't
necessarily need an operating system for that. Large vector
processors: For a cluster, you would need access to a pretty
full operating system including networked file access, shells etc.
Job submission and cross-compilation could then be done on special
servers, and people using HPC are used to specifying maximum memory
requirements via slurm (or whatever system they are using) anyway.

So, what about a port of a reasonably fully-featured OS? Linux
or one of the BSD variants comes to mind.

Regarding special applications: I think today's situation of binary
compatibility between largely different classes of computers,
from laptops to supercomputers, has shown that a general-purpose
architecture has the greatest potential.

>>How far along are you to verifying that your system works?
>
> The library system and relinking mechanism works already. This
> removes the biggest source of memory fragmentation. The rest is
> playground for whoever needs an interesting university project.

In other words, you have removed obstacles on the way to memory
fragmentation, but there is no (semi-)hard data on the effectiveness
of your solutions for real-world applications. This will make it
hard to convince sceptics that your model would actually work,
I'm afraid.

Quadibloc

unread,

Feb 5, 2023, 9:23:25 AM2/5/23

to

On Saturday, February 4, 2023 at 9:19:22 AM UTC-7, EricP wrote:

> A single instance of a running program can have dozens to hundreds
> of memory sections, each with its range of addresses and attributes.

I see I misread this when I first saw it, quoted in another post.

This is simply a description of how memory management is carried out
on a modern CPU.

But when I was under my misapprehension, I was visualizing some
mechanism, which I could not understand how to achieve, whereby
a program, written in re-entrant code, could behave as 64-bit code
when called from 64-bit code, and as 32-bit code when called from
32-bit code.

I mean, _perhaps_ it could be achieved if there were machine
instructions that worked on "current-address length integers"
in addition to the existing ones that explicitly work on 32-bit
or 64-bit integers...

John Savard

Quadibloc

unread,

Feb 5, 2023, 9:35:32 AM2/5/23

to

On Sunday, February 5, 2023 at 12:55:52 AM UTC-7, Agner Fog wrote:

> You probably cannot avoid TLB and fixed-size pages in large
> multi-user servers, but you can in small systems for dedicated
> purposes. There is so much to gain by avoiding TLB and
> multilevel page tables that this is something worth pursuing.

Well, _that_ tells me why it is *not* being pursued.

Or, rather, why it is only being pursued on small, special-purpose
chips intended for embedded applications.

Because a _general-purpose_ microprocessor, intended to compete
with x86, ARM, RISC-V and other systems like 680x0 and PowerPC,
and SPARC and z/Architecture and Alpha... is, _of course_ supposed
to be able to be used as the CPU in a large multi-user server.

While it might seem perverse to impose that as a requirement for
the processor that goes into a *smartphone*, we do it because we
can, whether or not we should, and we reap the benefit that our
smartphones can run cooler applications than the other guy's.

Or, in other words, seeking a technical solution to a market
problem... can still work, but only if you know that's what you're
trying to do.

John Savard

Scott Lurndal

unread,

Feb 5, 2023, 11:31:37 AM2/5/23

to

Agner Fog <ag...@dtu.dk> writes:
>Thomas Koenig wrote:
>>> The maximum stack size is calculated by the compiler and the linker.
>
>>Hmm... "calculated" is a strong word. Is it then necessary to make

>>a complete call graph, and to estimate the recursion depth?=20

>

>alloca in C makes variable-size arrays on the stack. This is rarely used.

While alloca() isn't particuarly common, I wouldn't call it rare. VLAs
on the other hand, are quite often used; via extensions on gcc, and
standard with C++.

In any case, you seem to expect that all the code for your ISA
will be de novo, which pretty much makes your new ISA irrelevent.

Anton Ertl

unread,

Feb 5, 2023, 12:57:48 PM2/5/23

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, February 5, 2023 at 12:55:52 AM UTC-7, Agner Fog wrote:
>
>> You probably cannot avoid TLB and fixed-size pages in large
>> multi-user servers, but you can in small systems for dedicated
>> purposes. There is so much to gain by avoiding TLB and
>> multilevel page tables that this is something worth pursuing.

Can you elaborate on and quantify the gains of avoiding TLBs and page
tables, compared to your scheme?

>Because a _general-purpose_ microprocessor, intended to compete
>with x86, ARM, RISC-V and other systems like 680x0 and PowerPC,
>and SPARC and z/Architecture and Alpha... is, _of course_ supposed
>to be able to be used as the CPU in a large multi-user server.
>
>While it might seem perverse to impose that as a requirement for
>the processor that goes into a *smartphone*, we do it because we
>can, whether or not we should, and we reap the benefit that our
>smartphones can run cooler applications than the other guy's.

It seems to me that the segmented approach suggested by Agner Fog will
not be able to accomodate single-user PCs, either, because they have a
large number of processes, and not all of them fit the simple
three-segment model he assumes is enough for most processes.

Looking at my Laptop running Ubuntu 22.04 with 140 days of uptime,

ps aux|wc -l

outputs 460, telling me that it has 459 processes running; however,

pstree -hup|wc -l

outputs 1757 (lines, with there being at least one process per line).

Another number is the result of

I am not sufficiently familiar with modern Linux stuff to explain
these differences, but my guess is that they have to do with
containers.

Next, let's look at the VMAs of the processes (in /proc/[0-9]*, which
gives the same number as "ps aux").

wc -l /proc/[0-9]*/maps|sort -n|tail -2

tells me that there are 96021 VMAs (virtual memory areas, which may be
mapped to Agner Fog's memory management units) across all processes,
and there is one process (a Firefox process) with 5334 VMAs. There
are 264 processes with 0 VMAs reported in maps (probably kernel
processes), the rest starts with 23 VMAs, and there are 104 processes
with more than 256 VMAs.

I would not expect that smartphones are much closer to the kind of
usage that fits Agner Fog's model than my laptop.

There have certainly been some changes since paging and page tables
were introduced, and some requirements have mostly vanished (in
particular, paging to swap space), but additional uses have been
introduced (e.g., mmaping files), and the ability to deal with
fragmentation is as relevant as ever.

A major advantage of paging over Agner Fog's approach is that
anonymous pages that have not been written to just cost a page table
entry, but otherwise not physical memory. This results in significant
differences between the virtual memory size and the resident set size
(actual memory used):

ps aux|grep -v ^USER|awk '{print $5" v +! "$6" r +!"} END {print "v @ 1024 / . r @ 1024 / . cr bye"}'|gforth -e "variable v variable r 0 v ! 0 r !"

On my laptop this shows a total virtual memory size (across all
processes reported by ps) of 218971MB (this machine has 40GB RAM), and
a resident set size of 9918MB. "free" reports that 17MB are used;
maybe the rest is used in containers. With Agner Fog's approach,
>200GB RAM would be necessary, or some of these processes would have
to be swapped out (none are), or the software would have to be
rewritten to consume less virtual memory (but that has its own costs).

MitchAlsup

unread,

Feb 5, 2023, 4:09:07 PM2/5/23

to

On Sunday, February 5, 2023 at 1:55:52 AM UTC-6, Agner Fog wrote:
> Thomas Koenig wrote:
> >> The maximum stack size is calculated by the compiler and the linker.
>
> >Hmm... "calculated" is a strong word. Is it then necessary to make
> >a complete call graph, and to estimate the recursion depth?
> If library function A calls functions B and C, then the object file for A includes sufficient information to calculate the difference between the stack pointer at B and A, and between C and A. This includes, of course, fixed-size arrays on the stack. The linker is summing this information and finds the branch with the deepest stack use.
>
> alloca in C makes variable-size arrays on the stack. This is rarely used. The programmer may specify an estimated maximum size when using alloca, or the system can use the same strategy as for the heap.
> >This is why gfortran, for example, also puts arrays on the stack up to a certain limit at -Ofast.
<
> Most programming languages put fixed-size arrays on the stack.
<
> >256 entries for each process, each consisting of a 64-bit pointer
> >and a 64-bit offset, would be quite large (4 kB), also compared to
> >a reasonable register set. However, you obviously don't need that
<
> Most processes would use only 3 entries. The rest is reserve for worst case. You don't need to save the unused entries.
<

I don't see it like that
<
I see a txt section (execute only) a rodata section read/execute only to this
thread but writable by the dynamic linker, bss is initialized to zero but read
and write to this thread, data is initialized read and write, stack is read and
write, and heap is read and write. Some systems treat the stack as "read-zero
if not initialized" and afterwards, read and write.
<
Many systems include a global offset table,....
<
So, {text, rodata, bss, data, stack, heap, GOT} is certainly bigger than 3 segments.

<
> >What is the strategy if the size of a single process exceeds total physical
> >memory, but (as is often the case) only a small part of that memory is in use?
> We would have to divide the process's memory into a few blocks, some of which are swapped out.
> >What is your strategy for memory-mapped I/O?
<
> The user process is calling a system function. The system call instruction has a feature for sharing a block of memory. The system function does not have access to everything, as current systems have, but only to the memory block shared by the application. The system function may use DMA or whatever the system designer sees fit.
<

Memory Mapped I/O space is strongly ordered in My 66000

<
> >> Another possibility is starting and ending so many processes that the entire memory space
> >> gradually becomes totally fragmented.
>
> >Would you run into trouble over time if the processes finish again,
> >or would that set things right?
> The OS will collect as much garbage as it can without disturbing running processes.
> >Sounds like an interesting concept, but I have till have doubts that it will actually work.
> You probably cannot avoid TLB and fixed-size pages in large multi-user servers, but you can in small systems for dedicated purposes. There is so much to gain by avoiding TLB and multilevel page tables that this is something worth pursuing.
<

Page tables where one can skip arbitrary levels address most of the multi-level
page table problems.

<
> ForwardCom does not ban TLB, but many niche applications can certainly work without it. I don't see ForwardCom replacing x86 or ARM in any near future because users need backward compatibility. The first uses will probably be research, embedded FPGA systems, single-purpose niche products, and large vector processors. Some wasteful programming habits have to go if you want top performance. At least you should avoid memory mapped files.
<

You should design the system so that the consumer of said system has as little
problem porting SW as possible. This argues for multi-level paging and memory
mapped files.
<
>

Josh Vanderhoof

unread,

Feb 5, 2023, 5:08:14 PM2/5/23

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Quadibloc <jsa...@ecn.ab.ca> writes:
>>On Sunday, February 5, 2023 at 12:55:52 AM UTC-7, Agner Fog wrote:
>>
>>> You probably cannot avoid TLB and fixed-size pages in large
>>> multi-user servers, but you can in small systems for dedicated
>>> purposes. There is so much to gain by avoiding TLB and
>>> multilevel page tables that this is something worth pursuing.
>
> Can you elaborate on and quantify the gains of avoiding TLBs and page
> tables, compared to your scheme?

Back in the 90's I wrote a dos extender that ran with paging off. It
was a little faster than Linux and Windows running a Doom clone I wrote.
Around 5-10%. Measurable but not earth shattering.

Agner Fog

unread,

Feb 6, 2023, 2:21:36 AM2/6/23

to

Thomas Koenig wrote:
>>>What is your strategy for memory-mapped I/O?

>> The user process is calling a system function. The system call
>> instruction has a feature for sharing a block of memory. The system
>> function does not have access to everything, as current systems
>> have, but only to the memory block shared by the application. The
>> system function may use DMA or whatever the system designer

>So, each shared memory block would then have its own entry in the
>table for each of the processes, correct?

There is no "the table". Each thread has its own memory map.
You only add one entry to the map of each thread.

>Do I see it correctly that an application which, for whatever reason,
>opened 1000 files via memory-mapped I/O at the same time would run
>out of descriptors?

The shared memory block belongs to the application process. The process needs no
extra map entry. When you call the system function and enter system mode,
you switch to a system map with one entry for the memory block shared by the process,
and one entry for the DMA I/O area.
You cannot call multiple I/O functions simultaneously unless they are in separate
threads. Each thread has its own memory map.

>Same thing could apply for shared memory between processes.

This case is discussed in chapter 9.4 p. 123 in the ForwardCom manual.
Several possible solutions are discussed there.

Scott Lurndal wrote:
>In any case, you seem to expect that all the code for your ISA
>will be de novo, which pretty much makes your new ISA irrelevent.

forwardcom.info:
"Starting from scratch and making a complete vertical redesign allows us to learn
from the history of past mistakes and get rid of the heritage of old quirks that
hamper contemporary computer systems."

This is an academic exercise. What would an optimal computer system look like if
we can get rid of all old garbage? Today's computers suffer from designs that were
made in the old days where priorities were very different from what they are today,
as well as from the consequences of short-sighted patches and decisions made for
short-time marketing priorities.

Right now, it looks like a ForwardCom without TLB will be optimal for many applications
in dedicated systems, while a ForwardCom with TLB and page tables will be optimal for
the most demanding server applications.

Anton Ertl wrote:
>Can you elaborate on and quantify the gains of avoiding TLBs and page
>tables, compared to your scheme?

Look at how much silicon space is used by the TLB, the size of the
multi-level page tables, the time it takes to walk through these tables,
the costs of TLB misses, and the costs of cache misses due to data being
scattered all over the memory space.

Memory access is the limiting factor for the performance in most systems today.
The methods for making memory contiguous in ForwardCom can lead to very
significant gains whether or not you have a TLB and page tables.

>tells me that there are 96021 VMAs (virtual memory areas, which may be
>mapped to Agner Fog's memory management units) across all processes,
>and there is one process (a Firefox process) with 5334 VMAs.

I suspect that most of these VMAs are shared objects, which ForwardCom
doesn't have, and memory mapped files, which are also avoided in ForwardCom.

A firefox process with 5334 VMAs is obviously wasteful and suboptimal.
I guess it puts every object and every image in a separate memory mapped file.

Your example shows indeed how wasteful current systems are. Firefox has
probably put every little 1kB icon on its own 4kB memory mapped page (or 2 MB?),
scattered all around the entire memory space. This gives you very poor caching.
Set 0 in the cache will be overloaded because every page starts at a round address,
unless the cache is using randomization.

I think it is time to take a fresh look at current systems and see if the biggest
performance killers can be redesigned. Making data contiguous will obviously
help a lot. If my proposal to get rid of the TLB and 5-level page table can be a
catalyst towards a more efficient design, then I will be happy.

MitchAlsup wrote:
>Many systems include a global offset table,....
>So, {text, rodata, bss, data, stack, heap, GOT} is certainly bigger than 3 segments.

bss, data, stack, and heap all have read and write access. They are merged into one big
memory block. The stack is at the bottom growing backwards, while the heap is at the top
growing forwards.

ForwardCom has a different library system with no GOT, no PLT, no import tables. Getting rid
of these tables and the inefficient symbol interposition was an important reason why I
invented a different library system.

Josh Vanderhoof wrote:
>Back in the 90's I wrote a dos extender that ran with paging off.
>It was a little faster than Linux and Windows running a Doom clone I wrote.
>Around 5-10%. Measurable but not earth shattering.

Interesting example. That was in the old days where memory access was rarely a limiting factor and where 640 kB was "enough for everybody".
(This famous quote has not be verified, but in fact the IBM PC design was limited to a maximum of 640 kB RAM. Most PCs had less. Extensions beyond this required memory bank swapping.)

Today, we have more than 10,000 times as much RAM, 3 levels of cache, and much more
memory fragmentation.

Quadibloc

unread,

Feb 8, 2023, 7:34:09 PM2/8/23

to

On Monday, February 6, 2023 at 12:21:36 AM UTC-7, Agner Fog wrote:

> Interesting example. That was in the old days where memory access was
> rarely a limiting factor and where 640 kB was "enough for everybody".
> (This famous quote has not be verified, but in fact the IBM PC design
> was limited to a maximum of 640 kB RAM. Most PCs had less. Extensions
> beyond this required memory bank swapping.)

I vaguely recall a quote about a certain amount of memory being enough
for everyone, but I can't recall if it was 640K.

32K words of memory - 36-bit long words - was the memory of the giant
IBM 7090 mainframe, which performed amazing calculations of things like
the orbit the Apollo spacecraft would have to take to go to the Moon. So
I can definitely see people claiming that 128K (32K words of 32 bits) would
be enough for anyone.

John Savard

Quadibloc

unread,

Feb 8, 2023, 7:39:11 PM2/8/23

to

On Wednesday, February 8, 2023 at 5:34:09 PM UTC-7, Quadibloc wrote:
> On Monday, February 6, 2023 at 12:21:36 AM UTC-7, Agner Fog wrote:
>
> > Interesting example. That was in the old days where memory access was
> > rarely a limiting factor and where 640 kB was "enough for everybody".
> > (This famous quote has not be verified, but in fact the IBM PC design
> > was limited to a maximum of 640 kB RAM. Most PCs had less. Extensions
> > beyond this required memory bank swapping.)

> I vaguely recall a quote about a certain amount of memory being enough
> for everyone, but I can't recall if it was 640K.

A little web search has cleared up this mystery. What Bill Gates actually
said, after the fact, when he already knew more memory was needed, was
this:

“When we set the upper limit of PC-DOS at 640K, we thought nobody
would ever need that much memory.”

This was in 1985, in the pages of InfoWorld.

John Savard

MitchAlsup

unread,

Feb 8, 2023, 8:07:41 PM2/8/23

to

I used a Sigma 5 system with 16 KB with a 5 KB operating system where we
had to sample a 4 K sample (16-bit A/D converter), and then perform a correlation
(FFT->conjugate multiplication->inverse FFT) and then draw the spectrum on the
Techtronic's display tube, without ever exceeding the 16KB limit of the machine.
<
We used overlays and occasionally had to re-think our strategy when the
overlay we had figured out failed.
<
Lack of memory is never a reason a computer cannot solve a given problem !!
<
Back in the 1990s, CRAY corporation sold more machines then NEC even while
NEC has higher in-memory performance and somewhat larger memory; because
the CRAY had a higher performing I/O system and could keep the CPUs busy while
the I/O was bringing in new data and sending out crunched data simultaneously.
<
Those were the days............
>
> John Savard

Thomas Koenig

unread,

Feb 9, 2023, 5:55:07 AM2/9/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Back in the 1990s, CRAY corporation sold more machines then NEC even while
> NEC has higher in-memory performance and somewhat larger memory; because
> the CRAY had a higher performing I/O system and could keep the CPUs busy while
> the I/O was bringing in new data and sending out crunched data simultaneously.

I once heard (at the time) that a university in Germany bought a Cray
rather than an IBM mainframe mainly because of its I/O capacity, which
was higher than what IBM could offer at a comparable price.

Not sure what model or time it was, must have been in the mid-1990s.

MitchAlsup

unread,

Feb 9, 2023, 11:34:19 AM2/9/23

to

A modern CRAY was ether the XMP or the YMP in the 1990s.

JohnG

unread,

Feb 9, 2023, 4:09:27 PM2/9/23

to

On Thursday, February 9, 2023 at 8:34:19 AM UTC-8, MitchAlsup wrote:
> A modern CRAY was ether the XMP or the YMP in the 1990s.

Don't forget the T3D/E line. And technically the C90/T90 lines aren't YMPs. There were also the Celerity/FPS machines. Cray had a lot (too many) of irons in the fire in the mid-90's.

I'm guessing OP might be thinking of the T90 Jülich bought.

Also, buying supercomputers back in the day was _always_ as much political as it was technical. There was no way in hell a US institution was going to buy a NEC with NSF/ONR/DOD/NASA/lmnop money. Similarly, foreign purchases were always heavily lobbied by the Department of Commerce and other agencies.

MitchAlsup

unread,

Feb 9, 2023, 7:53:39 PM2/9/23

to

Agner::

There are questions I have that are unanswered over at forward.com::

One process contain multiple threads under a common memory address space,
but each thread has its own stack and its own local data (a place to put errno).
So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.

Now, over in the operating system, kernels like to have a unique area for each process
(that the kernel manages on behalf of the process), a unique area for each thread
under the process (that kernel manages); while the kernel manages processes and
threads which can access I/O devices; and a unique area for each core (where the
kernel keeps information concerning context switching,...)...

So we have at least 8 unique areas accessible to any given thread, and at least 8 more
where the kernel is managing data on behalf of user or other threads. and we still
have plenty of areas we have not considered in the OS (and we have not even considered
any HyperVisor.)

Your web site has negligible information on this topic:: care to elaborate on your plan
of attack ??

Agner Fog

unread,

Feb 10, 2023, 2:01:23 AM2/10/23

to

MitchAlsup wrote:

>One process contain multiple threads under a common memory address space,
>but each thread has its own stack and its own local data (a place to put errno).
>So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.

The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.

>Now, over in the operating system, kernels like to have a unique area for each process
>(that the kernel manages on behalf of the process), a unique area for each thread
>under the process (that kernel manages); while the kernel manages processes and
>threads which can access I/O devices; and a unique area for each core (where the
>kernel keeps information concerning context switching,...)...

The OS does not need access to the memory of an application process while it is running -
in fact, it should rather not have access to it. The application shares a limited block
of memory with a device driver when it calls an I/O system function.

In case of heap overflow you call a system function.
This makes the thread of the process switch to system mode and gets a new system mode
memory map. It then allocates more memory and returns to the process with a memory map
that has been extended to include the newly allocated memory.

Stack overflow works in a similar way, except that the system function call takes the form of an interrupt.

This is how I imagine it could work.

Scott Lurndal

unread,

Feb 10, 2023, 10:36:21 AM2/10/23

to

Agner Fog <ag...@dtu.dk> writes:

>MitchAlsup wrote:
>

>
>The OS does not need access to the memory of an application process while it is running -
>in fact, it should rather not have access to it. The application shares a limited block
>of memory with a device driver when it calls an I/O system function.

How do two applications share memory with each other? Or a dozen applications
all sharing a dozen different shared memory regions?

Can all active threads share the same text (readonly instruction space) for e.g. libc?

>
>In case of heap overflow you call a system function.
>This makes the thread of the process switch to system mode and gets a new system mode
>memory map. It then allocates more memory and returns to the process with a memory map
>that has been extended to include the newly allocated memory.

A classic pitfall of segmentation schemes is memory fragmentation, which leads
to background tasks moving memory around asynchronously to make enough space for a large
allocation - performance killing and error prone code. How does your system
avoid this pitfall?

EricP

unread,

Feb 10, 2023, 12:29:28 PM2/10/23

to

Agner Fog wrote:
> MitchAlsup wrote:
>
>> One process contain multiple threads under a common memory address space,
>> but each thread has its own stack and its own local data (a place to put errno).
>> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
>
> The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
> A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.

Most MMU's do not support Execute-Only protection, just Read-Execute,
Read-only, Read-Write, and usually Read-Write-Execute.
If an MMU does not support RWE then it can be simulated by the
OS trapping and temporarily switching from RE to RW and back to RE.

RW virtual memory sections can be initialized or zero-init.
Initialized are mapped file sections with initial values that are
usually marked Copy-On-Write (COW) so the original value is retained.
(There is no technical reason a RW file section be marked as COW,
and if not then any modifications are paged back to the source file.)
Zero-init section pages are just that.
Both Inited-COW and Zero-init pages may be outswapped to page file.

Virtual sections may also have various attributes,
like a reserved and committed sizes, regular or stack,
and if stack then grow up or grow down
(tells OS how to handle stack page faults).

There may also be virtual sections for HW devices, like the on-device
memory of a graphics card made available to privileged processes like
a window manager, or device control registers to a user mode device driver.
These virtual sections may have special platform specific HW controls
for caching that must be passed to the page tables.

The above are virtual sections typically statically defined by
EXE & DLL files of a program. However DLL's are *dynamic* and
can be loaded and unloaded on the fly.

User mode threads each have a thread header, some amount
of Thread Local Store (TLS) defined by each EXE/DLL,
and an initial stack size that has a default value
but can also be set by each thread creator.
An OS may keep pools of previously allocated thread headers,
TLS, and stacks to speed up user mode thread creation
(but it still needs to create kernel structures for each thread).

And for *nix style OS there is support for forking which on most
modern systems just copies certain page table entries to the child
and marks all RW pages as COW. As each page is written the child
and parent co-ownership splits and each gets its own page copy.

>> Now, over in the operating system, kernels like to have a unique area for each process
>> (that the kernel manages on behalf of the process), a unique area for each thread
>> under the process (that kernel manages); while the kernel manages processes and
>> threads which can access I/O devices; and a unique area for each core (where the
>> kernel keeps information concerning context switching,...)...
>
> The OS does not need access to the memory of an application process while it is running -
> in fact, it should rather not have access to it. The application shares a limited block
> of memory with a device driver when it calls an I/O system function.

OS needs to initialize certain user mode data structures like thread headers.
OS needs to edit user space such as thread stack to deliver signals.
OS needs to copy system service routine arguments to/from kernel space.
OS needs RO and RW access to IO buffers in user space.

One _could_ build an OS that only enables access windows to specific
user mode memory areas on demand and deletes the window after use.
But it is a lot of work and not much benefit so few do this
(if a virus is already inside the kernel, modifications to
application user mode memory is the least of your problems).

> In case of heap overflow you call a system function.
> This makes the thread of the process switch to system mode and gets a new system mode
> memory map. It then allocates more memory and returns to the process with a memory map
> that has been extended to include the newly allocated memory.
>
> Stack overflow works in a similar way, except that the system function call takes the form of an interrupt.
>
> This is how I imagine it could work.

One way to handle stack overflow is to allocate a dead page
above and below the stack, and in the user mode thread header
the OS stashes the stack low-water and absolute limits.
If a subroutine allocates more than 1 page-size stack space then
it checks the new stack bottom against its current absolute limits
in the thread header. The low-water limit tracks which pages
in the stack have already been touched and committed.

All of the above mechanisms are based on page faults and handling.
For a virtual range based MMU many algorithms would have to be redesigned.
For example, copy-On-Write must allocate and copy the whole virtual
section not just a touched page.

Scott Lurndal

unread,

Feb 10, 2023, 12:44:30 PM2/10/23

to

EricP <ThatWould...@thevillage.com> writes:
>Agner Fog wrote:
>> MitchAlsup wrote:
>>
>>> One process contain multiple threads under a common memory address space,
>>> but each thread has its own stack and its own local data (a place to put errno).
>>> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>>> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
>>
>> The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
>> A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.
>
>Most MMU's do not support Execute-Only protection,

ARM's Aarch64 does. See table D8-42 in DDI0487I_a.

ARM also supports readable but _not_ executable pages/blocks.

MitchAlsup

unread,

Feb 10, 2023, 1:45:25 PM2/10/23

to

On Friday, February 10, 2023 at 11:29:28 AM UTC-6, EricP wrote:
> Agner Fog wrote:
> > MitchAlsup wrote:
> >
> >> One process contain multiple threads under a common memory address space,
> >> but each thread has its own stack and its own local data (a place to put errno).
> >> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
> >> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
> >
> > The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
> > A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.
<
> Most MMU's do not support Execute-Only protection, just Read-Execute,
<

Providing a HUGE hole for attackers ; any RWE page can be attacked
by changing the code at the return point--taking over the application.
JIT languages are especially vulnerable.

<
> Read-only, Read-Write, and usually Read-Write-Execute.
> If an MMU does not support RWE then it can be simulated by the
> OS trapping and temporarily switching from RE to RW and back to RE.
>
> RW virtual memory sections can be initialized or zero-init.
> Initialized are mapped file sections with initial values that are
> usually marked Copy-On-Write (COW) so the original value is retained.
<

COW works under paging but less well under segmentation. COW
under paging copies 1 page at a time, COW under segmentation has
to copy the entire 1MB segment when 28 words are written.
<
I also fail to see how memory mapped files will work when each file
can be extended by simply writing after the EOF point. To me, this implies
that each MM File has to have its own segment.

The parent and child end up sharing all pages that are not written,
lowering the memory footprint of the conglomeration.

<
> >> Now, over in the operating system, kernels like to have a unique area for each process
> >> (that the kernel manages on behalf of the process), a unique area for each thread
> >> under the process (that kernel manages); while the kernel manages processes and
> >> threads which can access I/O devices; and a unique area for each core (where the
> >> kernel keeps information concerning context switching,...)...
> >
> > The OS does not need access to the memory of an application process while it is running -
<

read( file, buffer, size ); ???
<
Somebody certainly needs access !

<
> > in fact, it should rather not have access to it. The application shares a limited block
> > of memory with a device driver when it calls an I/O system function.
<
> OS needs to initialize certain user mode data structures like thread headers.
> OS needs to edit user space such as thread stack to deliver signals.
> OS needs to copy system service routine arguments to/from kernel space.
> OS needs RO and RW access to IO buffers in user space.
<

OS needs write access to execute-only pages for (well) paging !

Scott Lurndal

unread,

Feb 10, 2023, 1:51:36 PM2/10/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, February 10, 2023 at 11:29:28 AM UTC-6, EricP wrote:
>> Agner Fog wrote:
>> > MitchAlsup wrote:
>> >
>> >> One process contain multiple threads under a common memory address space,
>> >> but each thread has its own stack and its own local data (a place to put errno).
>> >> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>> >> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
>> >
>> > The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
>> > A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.
><
>> Most MMU's do not support Execute-Only protection, just Read-Execute,
><
>Providing a HUGE hole for attackers ; any RWE page can be attacked
>by changing the code at the return point--taking over the application.
>JIT languages are especially vulnerable.

All major extant processors (intel, amd, arm) provide controls
to prevent execution of instructions from pages with write access
enabled so far as I am aware.

The ability to deny read and write access while allowing instruction
execution is not universal (albeit supported by ARM64).

MitchAlsup

unread,

Feb 10, 2023, 3:17:04 PM2/10/23

to

But should be !! You certainly do not want the JIT application to find where
its code is positioned and "modify" it itself; while at the same time you do
ant the JIT compiler to be able to modify it.
<
One does not want the application to be able to modify the tables used
to perform flow control in packed switch statements or method calling
linkage pointers.
<
And with the plethora of new attack vectors, one does not want the control
stack (return address and preserved registers) accessible to attackers
{Buffer overflow, Return-Oriented-Programming,...}.
<
Old architecture may be forgiven since they were developed prior to these
attack vectors; newer ones cannot be forgiven for not providing means to
withstand such attacks.

Scott Lurndal

unread,

Feb 10, 2023, 3:28:06 PM2/10/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, February 10, 2023 at 12:51:36 PM UTC-6, Scott Lurndal wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>> >On Friday, February 10, 2023 at 11:29:28 AM UTC-6, EricP wrote:
>> >> Agner Fog wrote:
>> >> > MitchAlsup wrote:
>> >> >
>> >> >> One process contain multiple threads under a common memory address space,
>> >> >> but each thread has its own stack and its own local data (a place to put errno).
>> >> >> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>> >> >> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
>> >> >
>> >> > The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
>> >> > A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.
>> ><
>> >> Most MMU's do not support Execute-Only protection, just Read-Execute,
>> ><
>> >Providing a HUGE hole for attackers ; any RWE page can be attacked
>> >by changing the code at the return point--taking over the application.
>> >JIT languages are especially vulnerable.
>> All major extant processors (intel, amd, arm) provide controls
>> to prevent execution of instructions from pages with write access
>> enabled so far as I am aware.
>>
>> The ability to deny read and write access while allowing instruction
>> execution is not universal (albeit supported by ARM64).
><
>But should be !! You certainly do not want the JIT application to find where
>its code is positioned and "modify" it itself; while at the same time you do
>ant the JIT compiler to be able to modify it.

Certainly, but as you pointed out below, old architectures (intel/amd)
don't necessarily have the capability to support that without expanding
the size of a page table entry or otherwise fundamentally
redesigning the translation table subsystem, for example.

Yet all of that is not enough; an architecture should also support
a means to securely isolate a guest operating system from the
hypervisor, such that the hypervisor cannot view or alter physical memory
owned by the guest. Such as intel secure enclaves (SGX) or arm realms.

https://developer.arm.com/documentation/den0126/0100/Overview

EricP

unread,

Feb 10, 2023, 4:23:25 PM2/10/23

to

My ARMv8 manual from 2020 only mentions the AP (access protection) and
XN (eXecute Never) PTE bits which define R, RW, RE, RWE access protection.
Though there is a bunch of other features for "protection tables"
in there that I have not read about.

MitchAlsup

unread,

Feb 10, 2023, 4:30:25 PM2/10/23

to

Yes, understood.
<
I have such a means, but I am not in a position to talk about it openly (without NDA).
Under my model, one can have "paranoid" ×{ applications, OSs, HyperVisors} that
can interact without out exposing memory they have been allotted.
<
Paranoid means that the lesser privileged owner can ask for normal services
{read, write, socket buffers,...} to pass through the OS without letting the OS
look at the data/buffer, but allowing {real or virtual} devices to perform the
requested service. The service provider can see the MMU map and verify the
correctness of the request without being able to "look" into said buffer.
<
And I did not need to add a mode to support that.

Scott Lurndal

unread,

Feb 10, 2023, 4:46:53 PM2/10/23

to

The latest version is here, with the table noted above:

https://developer.arm.com/documentation/ddi0487/latest

Agner Fog

unread,

Feb 11, 2023, 2:05:37 AM2/11/23

to

Scott Lurndal wrote:
>How do two applications share memory with each other? Or a dozen applications
>all sharing a dozen different shared memory regions?

ForwardCom has special system calls for inter-process communications. One process shares a limited part of its memory with one or more other processes. This may be done in a separete thread.

>Can all active threads share the same text (readonly instruction space) for e.g. libc?

Multiple instances of the same application share the same execution and readonly memory.

Different applications using the same library function have each their instance due to the
unique library system of ForwardCom.

>A classic pitfall of segmentation schemes is memory fragmentation, which leads
>to background tasks moving memory around asynchronously to make enough space for a large
>allocation - performance killing and error prone code. How does your system
>avoid this pitfall?

There are no fixed-size segments. ForwardCom has many features to avoid memory fragmentation,
as I have explained elsewhere. Extending the heap or stack of a running process is a last resort,
that should happen only rarely.

EricP wrote:
>Most MMU's do not support Execute-Only protection, just Read-Execute,
>Read-only, Read-Write, and usually Read-Write-Execute.

ForwardCom supports execute-only protection as the default for executable memory.
It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.

>OS needs to initialize certain user mode data structures like thread headers.

The OS can initialize structures before leaving them to the running process.

>OS needs to edit user space such as thread stack to deliver signals.

Signals, mutexes, and other inter-process communication goes through a memory
block shared for this purpose.

>OS needs to copy system service routine arguments to/from kernel space

>OS needs RO and RW access to IO buffers in user space.

Small arguments go through registers when calling a system function.
Large data go through a memory block that the calling process shares with
a device driver when calling it. This access lasts only until the call returns.

>One _could_ build an OS that only enables access windows to specific
>user mode memory areas on demand and deletes the window after use

This is exactly what I want for ForwardCom. For two reasons: for security and
for avoiding memory fragmentation.

>One way to handle stack overflow is to allocate a dead page

>above and below the stack,...
I want no fixed-size pages. There is unused virtual space below the stack
and above the heap. If all attempts to avoid stack overflow fail, then
allocate some more memory. These memory blocks grow exponentailly in size
so that the total number of blocks is kept small.

>For a virtual range based MMU many algorithms would have to be redesigned.

Yes, definitely. I am trying to explore if it is possible to get rid of the TLB
and multi-level page tables by redesigning everything.

MitchAlsup wrote:
>>> The OS does not need access to the memory of an application process while it is running -

>read( file, buffer, size ); ???
>Somebody certainly needs access !

As I have explained before, the application shares a limited part of its memory with a
device driver when calling it. This access lasts only until the call returns.

>OS needs write access to execute-only pages for (well) paging !

And I want to avoid paging. The OS needs to initialize executable memory only
before it is left to the application. These is never a need for simultaneous
write access and execute access. Self-modifying code should be avoided.

The need for lazy loading of code sections is mainly a thing of the past.
Compiled programs today have typically a few hundred kB of executable code
while we have many GB of RAM available. In other words, executable code
typically accounts for a very small fraction of total memory use.

MitchAlsup wrote:
>One does not want the application to be able to modify the tables used
>to perform flow control in packed switch statements or method calling
>linkage pointers.

This is why ForwardCom puts such tables in read-only memory.

Scott Lurndal wrote:
>an architecture should also support
>a means to securely isolate a guest operating system from the
>hypervisor, such that the hypervisor cannot view or alter physical memory
>owned by the guest.

Yes, indeed. BTW, virtual machines and hypervisors are mainly for dealing with
incompatibilities between software written for different platforms. ForwardCom
is standardizing all ABI details and major system functions in order to avoid
such incompatibilities. A ForwardCom application should be able to run on
different ForwardCom platforms by using the relinking feature to replace some
interface libraries. Thus, there should be little need for a hypervisor except
when emulating legacy architectures.

Thomas Koenig

unread,

Feb 11, 2023, 5:50:27 AM2/11/23

to

Agner Fog <ag...@dtu.dk> schrieb:

> Thomas Koenig wrote:
>>>>What is your strategy for memory-mapped I/O?
>
>>> The user process is calling a system function. The system call
>>> instruction has a feature for sharing a block of memory. The system
>>> function does not have access to everything, as current systems
>>> have, but only to the memory block shared by the application. The
>>> system function may use DMA or whatever the system designer
>
>>So, each shared memory block would then have its own entry in the
>>table for each of the processes, correct?
>
> There is no "the table". Each thread has its own memory map.
> You only add one entry to the map of each thread.
>
>>Do I see it correctly that an application which, for whatever reason,
>>opened 1000 files via memory-mapped I/O at the same time would run
>>out of descriptors?
>
> The shared memory block belongs to the application process. The process needs no
> extra map entry. When you call the system function and enter system mode,
> you switch to a system map with one entry for the memory block shared by the process,
> and one entry for the DMA I/O area.
> You cannot call multiple I/O functions simultaneously unless they are in separate
> threads. Each thread has its own memory map.
>
>>Same thing could apply for shared memory between processes.
> This case is discussed in chapter 9.4 p. 123 in the ForwardCom manual.
> Several possible solutions are discussed there.

I've read this chapter, and I have to say I find it less than
convincing.

Let me comment on a few selected paragraphs.

# The memory space may become fragmented despite the use of these
# techniques. Problems that can result in memory fragmentation are
# listed below

# Recursive functions can use unlimited stack space. We may require
# that the programmer specifies a maximum recursion level.

Programmers are likely to reject that - this feels a bit like the
old days of mainframes.

# Allocation of variable-size arrays on the stack using the alloca
# function in C. We may require that the programmer specifies a
# maximum size.

Already discussed upthread - JCL REGION parameter, here we come.

# Script languages and byte code languages. It is difficult to predict
# the required size of stack and heap when running interpreted or
# emulated code. It is recommended to use a just-in-time compiler
# instead.

[...]

Reads like a recommendation to not use Emacs, Perl or awk, at least
not on anything serious.

# Unpredictable number of threads without protection. The required
# stack size for a thread may be computed in advance,

Are there papers or other resources which show how far this is,
in fact, possible?

# but in some
# cases it may be difficult to predict the number of threads that
# a program will generate.

Probably, yes.

[...]

# Shared memory for inter-process communication. This requires extra
# entries in the memory map as explained below.

And it's not a problem that is solved, I belive.

Skipping down a bit, the manual has

# If one program needs to communicate with a large number of other
# programs then we can use one of these solutions: (1) let the program
# that needs many connections own the shared memory and give each
# of its clients access to one part of it,

This implicitly assumes client-server, and would not, for example,
address a PGAS language. And PGAS is used for high performance
applications.

# (2) run multiple threads in (or multiple instances of) the program
# that needs many connections so that each thread has access to only
# one shared memory block,

That is quite wasteful - create a thread to avoid a memory map entry,
so you have additional task switching overhead.

# (3) let multiple communication channels use the same shared memory
# block or parts of it,

If you have (let's say) a web server application, you will want
protection against buffer overruns as much as possible.

# (4) communicate through function calls

Quite some overhead.

# (5) communicate through network sockets

Even more.

# (6) communicate through files.

... which should not be memory mapped, so again quite a high overhead.

Jumping up again...

# Memory mapping of files and devices. This practice should be
# avoided if possible.

I belive most databases use memory-mapped files, in order to
combine speed and persistence. This would pretty much exclude
that application from your architecture.

More generally, memory-mapped files are very fast in current
architectures. If you penalize software (and programmers) who
have used this for speed, you will lose a large part of them.

More generally, I am (as I wrote previously) concerned about
swapping out whole processes vs. only unused pages, and about
large processes. ps shows me, on my 16GB home system on which
I type this, /usr/lib/xorg/Xorg with a virtual memory size of
24.3 GB and a resident size of 94MB. And chromium has several
theads with a virtual memory size of 1.1TB. Both of this is
no problem if only the active pages are in memory.

Plus, there is the question of what to do with a process whose
virtual memory size exceeds total memory.

You can argue that both programs are incredibly wasteful, but
they work under the current memory management, where yours
would not.

And swapping takes much more time than paging out (or never
paging in) irrelevant pages.

So, if your architecture without TLB is implemented, my current
conculsion is that it would not work well for

- databases
- web servers using lots of shared memory
- computers used interactively (laptops, desktops)
- high-performance computing using shared memory to communicate
(at least for some PGAS variants)

In this sense, I mostly share your conclusion that

[big snip]

> Right now, it looks like a ForwardCom without TLB will be optimal for many applications
> in dedicated systems, while a ForwardCom with TLB and page tables will be optimal for
> the most demanding server applications.

except for a subset of them (databases and web servers might
have to be redesigned at least, and might not perform as well
even if they are).

However, in this way, you would introduce a dependency on
microarchitecture: Code which is assumes pages and heavily uses
memory-mapped I/O and shared memory will preform badly or not at
all on systems lacking paging and a TLB. Would it be wise
to expose that microarchitectural detail to the ISA? Hmm...

Thomas Koenig

unread,

Feb 11, 2023, 7:48:38 AM2/11/23

to

JohnG <gomij...@gmail.com> schrieb:

> On Thursday, February 9, 2023 at 8:34:19 AM UTC-8, MitchAlsup wrote:
>> A modern CRAY was ether the XMP or the YMP in the 1990s.
>
> Don't forget the T3D/E line. And technically the C90/T90 lines
> aren't YMPs. There were also the Celerity/FPS machines. Cray had
> a lot (too many) of irons in the fire in the mid-90's.

>
> I'm guessing OP might be thinking of the T90 Jülich bought.

It was the University of Stuttgart, actually (I was at Karlsruhe
at the time, we had a Siemens/Fujitsu VP at the time).

> Also, buying supercomputers back in the day was _always_ as
> much political as it was technical. There was no way in hell a US
> institution was going to buy a NEC with NSF/ONR/DOD/NASA/lmnop
> money. Similarly, foreign purchases were always heavily lobbied
> by the Department of Commerce and other agencies.

The company I went on to join had bought a Cray. They used it
for chemical plant simulations, which is a bit strange because,
unless you solve for huge Jacobians, you cannot effectively use a
Cray at all. I think that, by the time I joined, it had already
been decommissioned; I certainly never used it (but then, I worked
on different things, in a different department).

Hmm... looking around for a bit, I actually found an article that
describes what they did at the time: "Chemical Plant Simulation
on Cray Research Supercomputers: An Industrial Success Story".
It also shows how they had to tweak the code to make it run
successfully.

MitchAlsup

unread,

Feb 11, 2023, 1:54:24 PM2/11/23

to

On Saturday, February 11, 2023 at 1:05:37 AM UTC-6, Agner Fog wrote:
> Scott Lurndal wrote:
> >How do two applications share memory with each other? Or a dozen applications
> >all sharing a dozen different shared memory regions?
<

Let us postulate a single process with K threads and K memory mapped shared files
such that thread[j[ is sharing its file with thread[j+1] and no other thread {thread[K]
is sharing its file with thread[0]|.}
<
Now:: suppose K = 1,000, 10,000, 100,000
<
How does Forward.com perform these maneuvers ??
<<snip>

> MitchAlsup wrote:
> >>> The OS does not need access to the memory of an application process while it is running -
> >read( file, buffer, size ); ???
> >Somebody certainly needs access !
> As I have explained before, the application shares a limited part of its memory with a
> device driver when calling it. This access lasts only until the call returns.
<

The device driver is inside the untrusted OS, the application does not want the
untrusted OS to see his data. The application is accepting that the actual device
needs to access the specified memory. It does not mater that permission is
removed after the access is complete, the application does not trust the device
driver but begrudgingly accepts that SATA device may need to read/write the
specified memory.

<
> >OS needs write access to execute-only pages for (well) paging !
> And I want to avoid paging. The OS needs to initialize executable memory only
> before it is left to the application. These is never a need for simultaneous
> write access and execute access. Self-modifying code should be avoided.
>
> The need for lazy loading of code sections is mainly a thing of the past.
> Compiled programs today have typically a few hundred kB of executable code
> while we have many GB of RAM available. In other words, executable code
> typically accounts for a very small fraction of total memory use.
<

The reason that Gould (S.E.L) minicomputers went to paging after being
base-bounds for 15 years is that someone computed the time delay of
loading 2GB application after storing another 2GB application en-massé.
{Back in the day this was calculated that time was 4 seconds--4 seconds
where the CPU was idle; whereas paging was a reliable 60ms.}
<
What would Forward.com do with a 20GB segment if the OS needed to swap
one program out to bring another one in ?

<
> MitchAlsup wrote:
> >One does not want the application to be able to modify the tables used
> >to perform flow control in packed switch statements or method calling
> >linkage pointers.
> This is why ForwardCom puts such tables in read-only memory.
> Scott Lurndal wrote:
> >an architecture should also support
> >a means to securely isolate a guest operating system from the
> >hypervisor, such that the hypervisor cannot view or alter physical memory
> >owned by the guest.
> Yes, indeed. BTW, virtual machines and hypervisors are mainly for dealing with
> incompatibilities between software written for different platforms. ForwardCom
> is standardizing all ABI details and major system functions in order to avoid
> such incompatibilities. A ForwardCom application should be able to run on
> different ForwardCom platforms by using the relinking feature to replace some
> interface libraries. Thus, there should be little need for a hypervisor except
> when emulating legacy architectures.
<

How does the "computer system" survive the crash of the operating system
without a hypervisor being there to pick up the pieces ??
<
Then, how does the hypervisor swap lesser used portions of the operating system
to load balance between several GuestOSs ??

MitchAlsup

unread,

Feb 11, 2023, 2:07:31 PM2/11/23

to

On Saturday, February 11, 2023 at 4:50:27 AM UTC-6, Thomas Koenig wrote:
> Agner Fog <ag...@dtu.dk> schrieb:
> > Thomas Koenig wrote:
>
>
> # Shared memory for inter-process communication. This requires extra
> # entries in the memory map as explained below.
<

Is there a way to setup this memory sharing so that thread[j] has write only
access while thread[i!=j] has read only access ??

>
> And it's not a problem that is solved, I belive.
>

<snip>

>
> # Memory mapping of files and devices. This practice should be
> # avoided if possible.
<

Yet this has become the de rigueur way of programming access to
such files.

>
> I belive most databases use memory-mapped files, in order to
> combine speed and persistence. This would pretty much exclude
> that application from your architecture.
<

Most data bases share a single chunk of memory shared across all
applications (typically ½ of all installed memory).
>

EricP

unread,

Feb 12, 2023, 11:14:01 AM2/12/23

to

Oh right, it does have user-mode (aka EL0) execute-only.
I misunderstood the table (they have to many control options).
But its only for user-mode not for higher protection rings (EL1,EL2).

Execute-only is odd, especially for a risc-ish ISA like ARMv8,
because normally executable code needs to load constants from
RIP-relative tables in the code (text) memory section,
which requires Read-Execute access.

Even if an MMU supports EO protection it is of limited use.

EricP

unread,

Feb 12, 2023, 11:14:01 AM2/12/23

to

MitchAlsup wrote:
> On Friday, February 10, 2023 at 11:29:28 AM UTC-6, EricP wrote:
>> Agner Fog wrote:
>>> MitchAlsup wrote:
>>>
>>>> One process contain multiple threads under a common memory address space,
>>>> but each thread has its own stack and its own local data (a place to put errno).
>>>> So, to a first order, one needs a set of common areas {text, bss, data, rodata, stack,
>>>> heap,...} available to all threads of a process, 2 areas for each thread {local, stack}.
>>> The main thread of a process typically has three memory map entries: 1. execute only, 2. read only, and 3. read/write containing stack, bss, data, heap.
>>> A child thread with private memory has one extra entry for child-private read/write data. This includes the child's stack and static thread-local data located at the bottom of this stack.
> <
>> Most MMU's do not support Execute-Only protection, just Read-Execute,
> <
> Providing a HUGE hole for attackers ; any RWE page can be attacked
> by changing the code at the return point--taking over the application.
> JIT languages are especially vulnerable.

ARMv8 has user-mode Read-Execute protection.

> <
>> Read-only, Read-Write, and usually Read-Write-Execute.
>> If an MMU does not support RWE then it can be simulated by the
>> OS trapping and temporarily switching from RE to RW and back to RE.
>>
>> RW virtual memory sections can be initialized or zero-init.
>> Initialized are mapped file sections with initial values that are
>> usually marked Copy-On-Write (COW) so the original value is retained.
> <
> COW works under paging but less well under segmentation. COW
> under paging copies 1 page at a time, COW under segmentation has
> to copy the entire 1MB segment when 28 words are written.
> <
> I also fail to see how memory mapped files will work when each file
> can be extended by simply writing after the EOF point. To me, this implies
> that each MM File has to have its own segment.
> <

EXE and DLL files are memory mapped files with multiple
memory sections whose size is statically defined in the EXE/DLL.
I gather Agner wants to aggregate all the EXE/DLL file memory sections
in the closure set of a program into a smaller set.

A MM data file is a single *mapping window* onto a file,
the mapping window size being defined when the section is created
and sized <= file size. Even if an OS allows a MM file to be extended
the existing mapping windows won't change size because each had to
reserved a range of virtual address when the MM section was created.

Yes, for segmentation COW must copy a whole memory section if a
single byte is written, so it becomes a lot more expensive as
physical memory for the whole section must be allocated and copied.

With paging, OS only has to reserve but not allocate physical memory,
which is a lot easier as it just requires diddling some counters.
This skips physical frame allocation, and leaves them in the free pool
for all to use until each page it actually touched.

> <
>>>> Now, over in the operating system, kernels like to have a unique area for each process
>>>> (that the kernel manages on behalf of the process), a unique area for each thread
>>>> under the process (that kernel manages); while the kernel manages processes and
>>>> threads which can access I/O devices; and a unique area for each core (where the
>>>> kernel keeps information concerning context switching,...)...
>>> The OS does not need access to the memory of an application process while it is running -
> <
> read( file, buffer, size ); ???
> <
> Somebody certainly needs access !
> <
>>> in fact, it should rather not have access to it. The application shares a limited block
>>> of memory with a device driver when it calls an I/O system function.
> <
>> OS needs to initialize certain user mode data structures like thread headers.
>> OS needs to edit user space such as thread stack to deliver signals.
>> OS needs to copy system service routine arguments to/from kernel space.
>> OS needs RO and RW access to IO buffers in user space.
> <
> OS needs write access to execute-only pages for (well) paging !

OS should handle paging user mode executable code by

1) allocate a free dirty physical frame
2) set up DMA scatter-write to faulting frames, queue IO file read, wait
3) IF code/data pages contains relocatable memory values
3.1) set PTE page protection to super:RW, user:NA (no access)
3.2) write patch relocations
4) set PTE page protection to super:R, user:RE

so at no instant would the executable code page be
RW accessible to user mode.

EricP

unread,

Feb 12, 2023, 11:38:06 AM2/12/23

to

Separating *untrusted* HV access protection is a different matter.

In the past for MMU access protection the view for most designs
was that no lesser security level should have greater access than
higher security levels. For example, if user-mode had RW access
then super-mode had RW access.

This requires fewer PTE bits to define combined super and user access
as it doesn't need to define all combinations of super:RWE and user:RWE.
For example, instead of requiring 6 PTE bits for access control
it might only need 3.

The untrusted hypervisor OS is a recent development which seems to
come about from the growth of cloud time-share: customers don't
necessarily trust their providers, and providers don't want to
have to get security clearances for all their operators.

Since the HV page table entries are separate from the guest PTE's
the existing access control bits can be redefined to give HV
less access than the guest.

EricP

unread,

Feb 12, 2023, 12:06:45 PM2/12/23

to

Agner Fog wrote:
>
> EricP wrote:
>> Most MMU's do not support Execute-Only protection, just Read-Execute,
>> Read-only, Read-Write, and usually Read-Write-Execute.
>
> ForwardCom supports execute-only protection as the default for executable memory.
> It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.

Execute-only protection doesn't actually accomplish anything
because code still needs to load program constants.
It moves RIP-relative load of a constant to a different offset.

>> One way to handle stack overflow is to allocate a dead page
>> above and below the stack,...
> I want no fixed-size pages. There is unused virtual space below the stack
> and above the heap. If all attempts to avoid stack overflow fail, then
> allocate some more memory. These memory blocks grow exponentailly in size
> so that the total number of blocks is kept small.

Ok, for each thread stack OS can reserve a larger range of virtual
addresses, but only allocate a smaller amount of physical memory.
Run-time stack probes compare stack allocations against the
thread's low water mark and its stack limits, and reallocate
physical memory and possibly move a physical segment if needed.

EricP

unread,

Feb 12, 2023, 12:44:21 PM2/12/23

to

EricP wrote:
>
> In the past for MMU access protection the view for most designs
> was that no lesser security level should have greater access than
> higher security levels. For example, if user-mode had RW access
> then super-mode had RW access.
>
> This requires fewer PTE bits to define combined super and user access
> as it doesn't need to define all combinations of super:RWE and user:RWE.
> For example, instead of requiring 6 PTE bits for access control
> it might only need 3.

I like the method Intel eventually converged on: because PTE space is
precious, each PTE contains 3 access control bits which is an index
into an 8 entry MMU HW table containing the 6 access control bits.
The OS decides at boot which 8 subsets of the 64 possible combinations
it requires for user and super mode access and configures the lookup table.

MitchAlsup

unread,

Feb 12, 2023, 3:03:06 PM2/12/23

to

On Sunday, February 12, 2023 at 11:06:45 AM UTC-6, EricP wrote:
> Agner Fog wrote:
> >
> > EricP wrote:
> >> Most MMU's do not support Execute-Only protection, just Read-Execute,
> >> Read-only, Read-Write, and usually Read-Write-Execute.
> >
> > ForwardCom supports execute-only protection as the default for executable memory.
> > It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.
>
> Execute-only protection doesn't actually accomplish anything
> because code still needs to load program constants.
> It moves RIP-relative load of a constant to a different offset.
<

My 66000 has no loading of constants, the constants traverse the memory
hierarchy as if they were instructions (even going through the ICache). Thus
these constants are Execute-only. Switch table offsets are also accessed off
the IP, and the entries in the table are PIC......and are execute-only.

MitchAlsup

unread,

Feb 12, 2023, 3:04:51 PM2/12/23

to

So, all we need now is a story of how he manages 1,000-1,000,000 of these
MM files.

Thomas Koenig

unread,

Feb 12, 2023, 4:36:32 PM2/12/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Sunday, February 12, 2023 at 11:06:45 AM UTC-6, EricP wrote:
>> Agner Fog wrote:
>> >
>> > EricP wrote:
>> >> Most MMU's do not support Execute-Only protection, just Read-Execute,
>> >> Read-only, Read-Write, and usually Read-Write-Execute.
>> >
>> > ForwardCom supports execute-only protection as the default for executable memory.
>> > It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.
>>
>> Execute-only protection doesn't actually accomplish anything
>> because code still needs to load program constants.
>> It moves RIP-relative load of a constant to a different offset.
><
> My 66000 has no loading of constants, the constants traverse the memory
> hierarchy as if they were instructions (even going through the ICache). Thus
> these constants are Execute-only.

There are exceptions. An example:

static double const c[] = {1.0, -0.5, 0.3333333333333333, -0.25, 0.2};

double poly(double x)
{
double res;
res = c[4];
for (int i=3; i>=0; i--) {
res = res * x + c[i];
}
return res;
}

which is translated to

LBB0_2: ; %for.body
; =>This Inner Loop Header: Depth=1
vec r4,{r1}
mov r5,r3
ldd r3,[ip,r5<<3,c]
fmac r1,r1,r2,r3
add r3,r5,#-1
loop eq,r5,#0,#0
; %bb.1: ; %for.cond.cleanup
ret
.Lfunc_end0:
.size poly, .Lfunc_end0-poly
; -- End function
.section .rodata,"a",@progbits
.p2align 3
.type c,@object
.size c, 40
c:
.dword 0x3ff0000000000000 ; double 1
.dword 0xbfe0000000000000 ; double -0.5
.dword 0x3fd5555555555555 ; double 0.33333333333333331
.dword 0xbfd0000000000000 ; double -0.25
.dword 0x3fc999999999999a ; double 0.20000000000000001

An alternative would be if the loop were unrolled, which would
lead to

mov r2,#0x3FC999999999999A
fmac r2,r1,r2,#0xBE800000
fmac r2,r2,r1,#0x3FD5555555555555
fmac r2,r2,r1,#0xBF000000
fmac r1,r2,r1,#1
ret

which would be shorter and (presumably) faster.

MitchAlsup

unread,

Feb 12, 2023, 6:04:22 PM2/12/23

to

On Sunday, February 12, 2023 at 3:36:32 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Sunday, February 12, 2023 at 11:06:45 AM UTC-6, EricP wrote:
> >> Agner Fog wrote:
> >> >
> >> > EricP wrote:
> >> >> Most MMU's do not support Execute-Only protection, just Read-Execute,
> >> >> Read-only, Read-Write, and usually Read-Write-Execute.
> >> >
> >> > ForwardCom supports execute-only protection as the default for executable memory.
> >> > It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.
> >>
> >> Execute-only protection doesn't actually accomplish anything
> >> because code still needs to load program constants.
> >> It moves RIP-relative load of a constant to a different offset.
> ><
> > My 66000 has no loading of constants, the constants traverse the memory
> > hierarchy as if they were instructions (even going through the ICache). Thus
> > these constants are Execute-only.
> There are exceptions. An example:
>
> static double const c[] = {1.0, -0.5, 0.3333333333333333, -0.25, 0.2};
<

You did ask for these constant to be in memory with "static". Thus, while
these containers do not change their value, they are not constants in the
sense I am using the word.

>
> double poly(double x)
> {
> double res;
> res = c[4];
> for (int i=3; i>=0; i--) {
> res = res * x + c[i];
<

I am am sure at this point the compiler knows that c[i] has a value
that never changes (on constant i); because I have seen code
{probably in EMBench} where the compiler unrolled a loop and
used constants (much like below) instead of indexing data.

Obviously faster than the vec-loop preceding

Terje Mathisen

unread,

Feb 13, 2023, 4:28:28 AM2/13/23

to

No, this requires something like Mitch's My66000 which keeps all
constants inline as code, including branch tables. Possibly excluding OO
code that's using VMT method launching? Even that should be OK if the
compiler keeps the actual tables in the code segment?

>
> Even if an MMU supports EO protection it is of limited use.

Yeah, unless the rest of the architecture actually makes it useful.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Feb 13, 2023, 8:00:30 AM2/13/23

to

MitchAlsup <Mitch...@aol.com> schrieb:
> On Sunday, February 12, 2023 at 3:36:32 PM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > On Sunday, February 12, 2023 at 11:06:45 AM UTC-6, EricP wrote:
>> >> Agner Fog wrote:
>> >> >
>> >> > EricP wrote:
>> >> >> Most MMU's do not support Execute-Only protection, just Read-Execute,
>> >> >> Read-only, Read-Write, and usually Read-Write-Execute.
>> >> >
>> >> > ForwardCom supports execute-only protection as the default for executable memory.
>> >> > It is not allowed to put data in execute memory. Jump tables etc. are placed in read-only memory.
>> >>
>> >> Execute-only protection doesn't actually accomplish anything
>> >> because code still needs to load program constants.
>> >> It moves RIP-relative load of a constant to a different offset.
>> ><
>> > My 66000 has no loading of constants, the constants traverse the memory
>> > hierarchy as if they were instructions (even going through the ICache). Thus
>> > these constants are Execute-only.
>> There are exceptions. An example:
>>
>> static double const c[] = {1.0, -0.5, 0.3333333333333333, -0.25, 0.2};
><
> You did ask for these constant to be in memory with "static". Thus, while
> these containers do not change their value, they are not constants in the
> sense I am using the word.

It is the closest C equivalent to

integer, parameter :: wp = real64
real(kind=wp), parameter :: c(5) = [1.0_wp, -0.5_wp, 0.3333333333333333_wp, &
-0.25_wp, 0.2_wp]

that I could find :-)

>> double poly(double x)
>> {
>> double res;
>> res = c[4];
>> for (int i=3; i>=0; i--) {
>> res = res * x + c[i];
><
> I am am sure at this point the compiler knows that c[i] has a value
> that never changes (on constant i); because I have seen code
> {probably in EMBench} where the compiler unrolled a loop and
> used constants (much like below) instead of indexing data.

Sure. If you let a good compiler unroll like for a "normal"
architecture, it will produce fmac code from the loop.
Unfortunately, this will also destroy many VVM opportunities,
increase code size and use more registers...

Good strategies for combining VVM with unrolling are quite
difficult, I think.

EricP

unread,

Feb 13, 2023, 10:59:46 AM2/13/23

to

MitchAlsup wrote:
> On Sunday, February 12, 2023 at 10:14:01 AM UTC-6, EricP wrote:
>> MitchAlsup wrote:
>>> <
>>> I also fail to see how memory mapped files will work when each file
>>> can be extended by simply writing after the EOF point. To me, this implies
>>> that each MM File has to have its own segment.
>>> <
>> EXE and DLL files are memory mapped files with multiple
>> memory sections whose size is statically defined in the EXE/DLL.
>> I gather Agner wants to aggregate all the EXE/DLL file memory sections
>> in the closure set of a program into a smaller set.
>>
>> A MM data file is a single *mapping window* onto a file,
>> the mapping window size being defined when the section is created
>> and sized <= file size. Even if an OS allows a MM file to be extended
>> the existing mapping windows won't change size because each had to
>> reserved a range of virtual address when the MM section was created.
> <
> So, all we need now is a story of how he manages 1,000-1,000,000 of these
> MM files.

I think it can be made to work, just not quite the way Agner describes.
I'm not sure how big Agner thinks his HW range-mapping TLB is.
He said he does not want TLB misses so I suspect he wants hundreds
of entries. But I don't think that will fly.

Such a large HW table is problematic.
I assume that if an Intel/Amd x64 x64 TLB has 32 entries,
and a range compare entry is twice the line load,
so a Range-TLB would only have, say, 16 entries.

Even if a RNG-TLB with hundreds of entries could be built,
it assumes that the whole table is re-loaded at process switch
which is expensive and wasteful because only a small fraction
of the entries would be in use at once.

So this RNG-TLB needs to have table miss exceptions and
is loaded from some larger in-memory Section Range Table (SRT)
(functional equivalent to a radix Page Table tree).

The HW RNG-TLB would be software managed so the exact design of
this SRT is left to OS people (eg. red-black or AVL tree).
And it avoids designing a HW table walker and having a fixed
format for the SRT.

Each RNG entry requires a descriptor with three 64-bit values
be loaded to define the virtual lower and upper addresses,
the physical section start address, and control flags.
Special privileged instructions could move descriptor
values from/to R0,R1,R2 and the RNG-TLB.Entry[index].

FIFO loading of entries by the table miss handler is good enough
so RNG-TLB HW doesn't need to track which entries are LRU (KISS).

RNG-TLB would also need housekeeping instructions to
invalidate entries in a certain virtual address range,
and invalidate entries at a specific index.
Each RNG-TLB entry could be marked as global or process,
and process entries tagged with an ASID, say 12-bits,
if one can afford the extra CAM bits per entry.

When the RNG-TLB gets a hit, it has to do arithmetic VA->PA
relocation which puts a 64-bit adder into the translation data path.
And that relocation add only begins after there is a RNG-TLB hit
(unless one can afford a 64-bit adder per table entry,
so in this example a bank of 16 64-bit adders working
in parallel to the RNG-TLB lookup and burning power).

The relocation data path could add a clock to every memory access.

Scott Lurndal

unread,

Feb 13, 2023, 11:22:59 AM2/13/23

to

EricP <ThatWould...@thevillage.com> writes:
>MitchAlsup wrote:
>> On Sunday, February 12, 2023 at 10:14:01 AM UTC-6, EricP wrote:
>>> MitchAlsup wrote:
>>>> <
>>>> I also fail to see how memory mapped files will work when each file
>>>> can be extended by simply writing after the EOF point. To me, this implies
>>>> that each MM File has to have its own segment.
>>>> <
>>> EXE and DLL files are memory mapped files with multiple
>>> memory sections whose size is statically defined in the EXE/DLL.
>>> I gather Agner wants to aggregate all the EXE/DLL file memory sections
>>> in the closure set of a program into a smaller set.
>>>
>>> A MM data file is a single *mapping window* onto a file,
>>> the mapping window size being defined when the section is created
>>> and sized <= file size. Even if an OS allows a MM file to be extended
>>> the existing mapping windows won't change size because each had to
>>> reserved a range of virtual address when the MM section was created.
>> <
>> So, all we need now is a story of how he manages 1,000-1,000,000 of these
>> MM files.
>
>I think it can be made to work, just not quite the way Agner describes.
>I'm not sure how big Agner thinks his HW range-mapping TLB is.
>He said he does not want TLB misses so I suspect he wants hundreds
>of entries. But I don't think that will fly.
>
>Such a large HW table is problematic.
>I assume that if an Intel/Amd x64 x64 TLB has 32 entries,

Just the CAM itself will be expensive. And note that the
intel/amd L1 TLB entries are backed by the L2 TLB which
has at least 1K entries (some ARM64 CPUs have 2K entries).

EricP

unread,

Feb 13, 2023, 1:49:58 PM2/13/23

to

An L2 TLB requires a HW table walker and a fixed format for
the Section Range Table, both of which are unnecessary for
a software managed TLB. Having a table format defined by
software allows one to experiment with different layouts.
Once the range table format is chosen then later versions
can consider HW walker and interior table entry caching.

For example, it looks like each Range Table Entry (RTE) descriptor
requires ~24 bytes, so a 64-byte cache line can hold 2.66 RTE's.
If it can get the RTE size down to 20 bytes then we get 3 per line.

It also the needs some index structure for those RTE's,
perhaps optimally implemented as a cache line aware B+ tree,
but using physical addresses to link the tree nodes so the
table walk is not a recursive translation.
(However VAX/VMS stored its virtual address translation tables
inside virtual space, so recursive translation is possible.)

MitchAlsup

unread,

Feb 13, 2023, 4:46:08 PM2/13/23

to

One can get a cache line aligned _base_ and a cache line aligned
_bounds_, and still have 12-14-bits left over for describing rights
into 2 DoubleWords. It does not go down to the byte level, but
it is 64× more efficient (smaller) than page tables. {And you should
probably NOT be sharing sub-lines across processes that should
not be seeing each others memory}

<
>
> It also the needs some index structure for those RTE's,
> perhaps optimally implemented as a cache line aware B+ tree,
> but using physical addresses to link the tree nodes so the
> table walk is not a recursive translation.
<

Any literature indicating the nested-tables (recursive translation}
are a performance hindrance ?

EricP

unread,

Feb 14, 2023, 12:41:49 PM2/14/23

to

MitchAlsup wrote:
> On Monday, February 13, 2023 at 12:49:58 PM UTC-6, EricP wrote:
>> Scott Lurndal wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> MitchAlsup wrote:
>>>>> <
>>>>> So, all we need now is a story of how he manages 1,000-1,000,000 of these
>>>>> MM files.

>>>> I think it can be made to work, just not quite the way Agner describes..

>>>> I'm not sure how big Agner thinks his HW range-mapping TLB is.
>>>> He said he does not want TLB misses so I suspect he wants hundreds
>>>> of entries. But I don't think that will fly.
>>>>
>>>> Such a large HW table is problematic.
>>>> I assume that if an Intel/Amd x64 x64 TLB has 32 entries,
>>> Just the CAM itself will be expensive. And note that the
>>> intel/amd L1 TLB entries are backed by the L2 TLB which
>>> has at least 1K entries (some ARM64 CPUs have 2K entries).
> <
>> An L2 TLB requires a HW table walker and a fixed format for
>> the Section Range Table, both of which are unnecessary for
>> a software managed TLB. Having a table format defined by
>> software allows one to experiment with different layouts.
>> Once the range table format is chosen then later versions
>> can consider HW walker and interior table entry caching.
>>
>> For example, it looks like each Range Table Entry (RTE) descriptor
>> requires ~24 bytes, so a 64-byte cache line can hold 2.66 RTE's.
>> If it can get the RTE size down to 20 bytes then we get 3 per line.
> <
> One can get a cache line aligned _base_ and a cache line aligned
> _bounds_, and still have 12-14-bits left over for describing rights
> into 2 DoubleWords. It does not go down to the byte level, but
> it is 64× more efficient (smaller) than page tables. {And you should
> probably NOT be sharing sub-lines across processes that should
> not be seeing each others memory}
> <

Here I was pointing out the RTE's will have an odd shape and
don't pack nicely into 64B cache lines. A 512-bit cache line
can hold three 170-bit RTE's, 3*170=510 bits with 2 spare bits.

If the RPE layout is bit aligned inside a cache line,
OS memory management software is going to be doing a lot of
double-wide bit field extract and inserts to edit them.
So the ISA needs good bit-wise instructions.

Section range limits and relocation offsets don't need byte resolution,
4kB is sufficient, and the lower and upper section limits
Virtual Page Number (VPN) size would be 64-12=52-bits.

The maximum physical address size could be anything from 52 to 64 bits,
giving a Physical Frame Number (PFN) size of 40 to 52 bits.

If those 3 RTE fields each use 52-bits then that leaves 14 bits for
all other fields, Present, access protection, cache control, etc.
It could get more bits by cutting back maximum physical address space
size from 64 to 60 or 56 bits.

>> It also the needs some index structure for those RTE's,
>> perhaps optimally implemented as a cache line aware B+ tree,
>> but using physical addresses to link the tree nodes so the
>> table walk is not a recursive translation.
> <
> Any literature indicating the nested-tables (recursive translation}
> are a performance hindrance ?
> <
>> (However VAX/VMS stored its virtual address translation tables
>> inside virtual space, so recursive translation is possible.)

They should be the same performance though I doubt it has been measured.
Intel x86/x64 MMU saves page table interior PTE's in special caches as
it walks down the page table from the root (level 4) to leaf (level 1).
The VAX MMU also caches interior page table PTE's but does so using TLB
entries. In the end they both effect a bottom-up table walk, Intel x86
with special purpose interior caches, VAX by using extra TLB entries.

Intel x86/x64 MMU caches page table interior PTE's as it walks down the
page table from the root (level 4) to leaf (level 1).
If there is a TLB miss at level 1 then it uses those cached PTE's
to walk backwards up the page table from level 1 to level 4 looking
for interior PTE hits to save walking from the root every time.
The interior cached PTE's are indexed using the portion of the
virtual address that was used to translate them in the first place.

VAX MMU accomplishes effectively the same thing. The page table
base register holds a *virtual* address for an array of PTE's.
As each virtual address to a PTE is resolved, an entry is added
to cache that translation in the the main TLB.

The net result is both cache interior page table entries.
It is just that x86 uses special purpose TLB caches to do this
and requires separate lookaside tables to hold levels 2,3,4.
VAX saves interior PTE entries in the TLB and so requires a
larger single TLB CAM to accomplishes the same reverse table walk.

The reason the VAX approach is interesting is because the
page tables levels have *virtual* base addresses.
For a range table index tree, the OS software needs to be able to
balance it, so it wants to work with virtual addresses pointers,
whereas hardware tables walkers want physical addresses pointers.

(Just musing out loud...)
Some kind of VAX style nested recursive translation might be the answer.
The level-1 RTE's are stored in an array at a *virtual* base address.
The level-2 and up indexes, the B+ tree, to those RTE's is an array of
cache line buckets which give the *virtual* address of level-1 RTE's,
which also have a virtual base address. To access the level-2 indexes
their virtual addresses must be translated, which requires extra
level-1 RTE's that map them, and so on recursively up the tree.
(I'm not really sure what that means either :-) )

MitchAlsup

unread,

Feb 14, 2023, 1:03:23 PM2/14/23

to

Interior page entries are called page table pointers (PTPs) {

<
> Intel x86/x64 MMU caches page table interior PTE's as it walks down the
> page table from the root (level 4) to leaf (level 1).
> If there is a TLB miss at level 1 then it uses those cached PTE's
> to walk backwards up the page table from level 1 to level 4 looking
> for interior PTE hits to save walking from the root every time.
> The interior cached PTE's are indexed using the portion of the
> virtual address that was used to translate them in the first place.
>
> VAX MMU accomplishes effectively the same thing. The page table
> base register holds a *virtual* address for an array of PTE's.
> As each virtual address to a PTE is resolved, an entry is added
> to cache that translation in the the main TLB.
>
> The net result is both cache interior page table entries.
> It is just that x86 uses special purpose TLB caches to do this
> and requires separate lookaside tables to hold levels 2,3,4.
> VAX saves interior PTE entries in the TLB and so requires a
> larger single TLB CAM to accomplishes the same reverse table walk.
>
<

} Interior page entries are called page table pointers (PTPs)

Brian G. Lucas

unread,

Feb 14, 2023, 2:06:15 PM2/14/23

to

I agree. Sometimes I wish we could generate code from a basic block several
different ways and pick the "best".

Quadibloc

unread,

Feb 14, 2023, 4:08:41 PM2/14/23

to

On Monday, February 13, 2023 at 6:00:30 AM UTC-7, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:

> > I am am sure at this point the compiler knows that c[i] has a value
> > that never changes (on constant i); because I have seen code
> > {probably in EMBench} where the compiler unrolled a loop and
> > used constants (much like below) instead of indexing data.

> Sure. If you let a good compiler unroll like for a "normal"
> architecture, it will produce fmac code from the loop.
> Unfortunately, this will also destroy many VVM opportunities,
> increase code size and use more registers...

> Good strategies for combining VVM with unrolling are quite
> difficult, I think.

Even if it is true, it may not be a problem. If VVM provides the
same benefits as loop unrolling, then it won't matter if one can't
unroll a loop to which VVM has been applied.

My limited understanding of VVM is that it is intended to do the
following - without adding many new instructions to the instruction set,
effectively convert loops into vector instructions.

Unlike the CRAY vector machines, though, VVM doesn't have vector
registers, which were what allowed them to work so much better than
the vector machines which preceded the CRAY-I that worked on vectors
in memory. How does VVM avoid this problem?

It needs to be applied to loops which have multiple arithmetic operations in
them, not just one operation at a time. This way, explicit vector registers
aren't needed because intermediate results are handled by operand forwarding.

This sounds like the desired operations are going to be converted into
a near-ideal sequence of micro-operations, which should mean that
loop unrolling is not necessary.

When loops are too short to benefit from VVM, then they should be
unrolled, at least if they're short enough to unroll as well.

With ordinary loops, it makes sense to, say, unroll four iterations of the
loop, and then iterate the loop 1/4 as many times. It cuts the loop overhead
down to a quarter of its value. What would that do with VVM?

There isn't really any loop overhead to speak of. Depending on the
specific implementation, it _could_ have a benefit, for example if
the explicit use of more temporary registers in the unrolled loop allows
more operand forwarding chains to have something to hang on - for
some strange reason, those chains are limited in length, but you can
have more of them if you want. To me, that sounds like a bizarre situation
that is unlikely to occur in practice, and it's hard for me to imagine how
else there would be an opportunity to benefit.

But it might still be done with VVM if we're talking about an ISA that
has a range of implementations, some that just execute VVM code
in a compatible manner, but as ordinary loops, because vector hardware
is lacking on the smaller implementations. So all that's needed in that
situation is to ensure that, except for enlarging the memory footprint of
the code, loop unrolling doesn't have negative implications.

Otherwise, one ends up with a situation where code has to be distributed
in a form that gets compiled on every machine on which it is used, so as
to be optimum for different implementations. Which means that the ISA
isn't really the ISA.

John Savard

MitchAlsup

unread,

Feb 14, 2023, 4:51:12 PM2/14/23

to

On Tuesday, February 14, 2023 at 3:08:41 PM UTC-6, Quadibloc wrote:
> On Monday, February 13, 2023 at 6:00:30 AM UTC-7, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> > > I am am sure at this point the compiler knows that c[i] has a value
> > > that never changes (on constant i); because I have seen code
> > > {probably in EMBench} where the compiler unrolled a loop and
> > > used constants (much like below) instead of indexing data.
>
> > Sure. If you let a good compiler unroll like for a "normal"
> > architecture, it will produce fmac code from the loop.
> > Unfortunately, this will also destroy many VVM opportunities,
> > increase code size and use more registers...
>
> > Good strategies for combining VVM with unrolling are quite
> > difficult, I think.
<
> Even if it is true, it may not be a problem. If VVM provides the
> same benefits as loop unrolling, then it won't matter if one can't
> unroll a loop to which VVM has been applied.
>
> My limited understanding of VVM is that it is intended to do the
> following - without adding many new instructions to the instruction set,
> effectively convert loops into vector instructions.
>
> Unlike the CRAY vector machines, though, VVM doesn't have vector
> registers, which were what allowed them to work so much better than
> the vector machines which preceded the CRAY-I that worked on vectors
> in memory. How does VVM avoid this problem?
<

The first V in VVM is Virtual. VVM does not add a vector register file, that
virtual file ends up being memory buffers. A CRAY-1 vector file would
change the context switch overhead from 10-cycles to 132-cycles,
simply from the size (number of bits) in the file. So, I did not want that.
<
Not having a vector register file means the kernel can use VVM for its
various loops without saving the VRF before hand (unlike RISC-V vectors
or any SIMD stuff). Compiling for the kernel is no different than compiling
for applications -- BECAUSE of lack of state !!

>
> It needs to be applied to loops which have multiple arithmetic operations in
> them, not just one operation at a time. This way, explicit vector registers
> aren't needed because intermediate results are handled by operand forwarding.
<

The original CRAY-1 was a 1-calculation per cycle per function unit design.
That is 1-lane wide per function unit.
Modern cores have enough resources to have multiple lanes of function units,
so one can perform {2, 4, 8} × {FADDs, FMULs, FMACs} simultaneously.

>
> This sounds like the desired operations are going to be converted into
> a near-ideal sequence of micro-operations, which should mean that
> loop unrolling is not necessary.
>
> When loops are too short to benefit from VVM, then they should be
> unrolled, at least if they're short enough to unroll as well.
<

Even:
<
for( i = 0; i < n; i++ )
a[i] = 0;
<
is worthy of VVMing. VVM should be able to perform this at cache access
width (128-bits) even on the smallest conceived My 66000 design point.
This loop consists of 1 instruction-modifier and 2 instructions--the actual
looping contains only 2 instructions.
<
Although:: memset( a[0], o, n*sizeof(a[0]) ); // is the preferred embodiment.

>
> With ordinary loops, it makes sense to, say, unroll four iterations of the
> loop, and then iterate the loop 1/4 as many times. It cuts the loop overhead
> down to a quarter of its value. What would that do with VVM?
<

Note: under VVM vectorization, the programmer retains his "linear"
view of the vectorized calculation. Should an exception transpire, the
IP is pointing at the instruction that caused the exception, the loop
iterator and all other registers have the values appropriate for IP to
be pointing at that instruction in that iteration. The poor programmer
does not have to understand vectorization in order to debug his
program.

>
> There isn't really any loop overhead to speak of. Depending on the
> specific implementation, it _could_ have a benefit, for example if
> the explicit use of more temporary registers in the unrolled loop allows
> more operand forwarding chains to have something to hang on - for
> some strange reason, those chains are limited in length, but you can
> have more of them if you want. To me, that sounds like a bizarre situation
> that is unlikely to occur in practice, and it's hard for me to imagine how
> else there would be an opportunity to benefit.
<

My guess is that VVMing a loop to strcpy 3-5 characters is the break even
point. Copying 3-4 characters is nearing breakeven, strcpy of 1-2 character
will loose by a small cycle count.

>
> But it might still be done with VVM if we're talking about an ISA that
> has a range of implementations, some that just execute VVM code
> in a compatible manner, but as ordinary loops, because vector hardware
> is lacking on the smaller implementations. So all that's needed in that
> situation is to ensure that, except for enlarging the memory footprint of
> the code, loop unrolling doesn't have negative implications.
<

It is possible to make a MY 66000 implementation that supports VVM
strictly as a sequencer in the DECODE stage with no VVM extras what-
so-ever. {As I have said in the past:: that would be a shame, but it is
possible.}

>
> Otherwise, one ends up with a situation where code has to be distributed
> in a form that gets compiled on every machine on which it is used, so as
> to be optimum for different implementations. Which means that the ISA
> isn't really the ISA.
<

Otherwise, there is no preservation of the software programming effort from
generation to generation. With VVM, there is preservation of the software
investment.
<
{Vector machines preserve the software investment significantly better than
SIMD machines preserve (ha-ha) the software investment. The mere fact that
you deliver a software package and have it specialized for the target machine
illustrates the lack of such preservation.}
<
VVM should be able to port a.out from machine to machine without
recompilation, relinking, just:: load and it runs just fine from one end of the
machine spectrum to the other.
<
VVM is a mechanism that gets the ISA out of implementation decisions.
So, with 2 instructions My 66000 gains access to something like 1200
instructions. At some point the R in RISC should stand for "Reduced".
>
> John Savard