Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Introducing ForwardCom: An open ISA with variable-length vector registers

1,119 views
Skip to first unread message

Agner Fog

unread,
Jan 30, 2023, 6:03:36 AM1/30/23
to
I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.

ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:

A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.

A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.

This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.

There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.

It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.

ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.

The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.

All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.

The flexible instruction format allows the CPU to do more work per instruction. Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit. Multiplication, division, and floating point arithmetic have longer latencies.

The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
for (int r0 = 0; r0 < 100; r0++) { }

There are many other features, including strong security, efficient memory management, an efficient library system, efficient global register allocation, and more.

See www.forwardcom.info for all the details.

The current status of the ForwardCom project is:

* Instruction set and ABI standards are defined in detail.
* The following binary tools have been developed: high-level assembler, disassembler, linker, library manager, emulator, and debugger. A compiler is not available yet.
* A hardware implementation in an FPGA soft core is available with full documentation. The current version supports integer instructions only.
* Everything is free and open source

MitchAlsup

unread,
Jan 30, 2023, 2:07:31 PM1/30/23
to
On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
> I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.
>
> ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:
>
> A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.
>
> A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.
>
> This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.
>
> There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.
>
> It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.
>
> ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.
<
My 66000 is mostly RISC with a few of the better parts of CISC throw in. I also
think that something between RISC and CISC has potential for being better than
either. And in particular, Brian's LLVM compiler produces My 66000 ASM for
benchmarks (with available source code) that average 75%-79% the instruction
count of RISC-V LLVM compiler (with same optimization settings).
>
> The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.
<
When I designed My 66000 ISA I developed the instruction set so that every
instruction could have 1 immediate, or 1 displacement, or both; and that
the immediate could be in either operand position. This enables my ISA to
<
SLL R7,#1,R9
STD #3.1415926535897932,[Rbase+Rindex<<scale+Displacement]
FMAC R9,R7,#3.2768D5,R9
FMAC R10,R6,R5,#1.7D0
<
as single instructions. There is no way to deliver an operand into calculation
that is of lower power than a constant.
>
> All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.
<
Same.
>
> The flexible instruction format allows the CPU to do more work per instruction.
<
Agreed.
<
> Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit.
<
I worked in sign control, too; so::
<
ADD R7,R9,-R14 // is the subtract version
ADD R7,-R9,R14 // is the reverse subtract version
<
> Multiplication, division, and floating point arithmetic have longer latencies.
<
Of course they do.

MitchAlsup

unread,
Jan 30, 2023, 6:06:57 PM1/30/23
to
On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
>
> The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
> for (int r0 = 0; r0 < 100; r0++) { }
>
This reminds me of CDC 6600 assembly where essentially all calculations are
set x7 = x3 + x2
<
How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
And what code would be produced ??
<

Agner Fog

unread,
Jan 31, 2023, 2:00:46 AM1/31/23
to
31. januar 2023 kl. 00.06.57 UTC+1 MitchAlsup wrote:
> How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
> And what code would be produced ??

The ForwardCom instruction format has one field specifying operand type and precision, so you cannot mix operand sizes in a single instruction. Unused parts of a register are always set to zero, so a 32-bit operation will zero-extend the result to 64 bits in a 64-bit register.
int32 r7 = r6 + r19
will do a 32-bit addition and zero-extend the result to 64 bits in r7. This will work if r19 is known to contain an 8-bit unsigned integer. If the bits 8-31 of R19 are not known to be zero then you have to cut them off first:
int8 r19 = r19.
There is no implicit sign extension. If you want to sign-extend an integer to a larger size you must use an extra instruction for doing so.

Thomas Koenig

unread,
Feb 1, 2023, 9:03:45 AM2/1/23
to
Agner Fog <ag...@dtu.dk> schrieb:

> The ForwardCom assembly language is simple and immediately
> intelligible. Adding two integers is as simple as: int r1 =
> r2 + r3. Branches and loops are written just like in C or Java,
> for example:

> for (int r0 = 0; r0 < 100; r0++) { }

Quite interesting, thanks a lot for sharing!

One question: What would be the best way to handle loop-carried
dependencies (let's say a memmove, where operands can overlap,
or C's

void add (const int *a, int *b, int *c, int n)
{

for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}

in your ISA?

Quadibloc

unread,
Feb 1, 2023, 9:48:06 AM2/1/23
to
On Monday, January 30, 2023 at 12:07:31 PM UTC-7, MitchAlsup wrote:

> My 66000 is mostly RISC with a few of the better parts of CISC
thrown
> in.

The only thing I'm not happy about in your design is the fact that
some actions, treated as single instructions by the microarchitecture
for efficiency, are specified as multiple instructions in the code.

To me, this appears to massively complicate the decoding of
instructions, making the architecture more difficult to implement.

Of course, though, it doesn't make the processor less efficient,
since instruction decoding can be done well ahead of execution,
so this need not have any impact on latencies.

John Savard

Terje Mathisen

unread,
Feb 1, 2023, 11:26:50 AM2/1/23
to
Does that make sense? Adding a 'const' to the target array?

I could see how noalias would be useful for the compiler but what is
const supposed to do here?

BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
loop, effectively something like:

while (i < n) {
i += vector_add(a,b,c,n-i);
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,
Feb 1, 2023, 12:19:32 PM2/1/23
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Agner Fog <ag...@dtu.dk> schrieb:
>>
>>> The ForwardCom assembly language is simple and immediately
>>> intelligible. Adding two integers is as simple as: int r1 =
>>> r2 + r3. Branches and loops are written just like in C or Java,
>>> for example:
>>
>>> for (int r0 = 0; r0 < 100; r0++) { }
>>
>> Quite interesting, thanks a lot for sharing!
>>
>> One question: What would be the best way to handle loop-carried
>> dependencies (let's say a memmove, where operands can overlap,
>> or C's
>>
>> void add (const int *a, int *b, int *c, int n)
>> {
>>
>> for (int i=0; i<n; i++)
>> a[i] = b[i] + c[i];
>> }
>
> Does that make sense? Adding a 'const' to the target array?

Not at all, I had not thought that through, const should be on
the other two arguments.

> I could see how noalias would be useful for the compiler but what is
> const supposed to do here?

> BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
> loop, effectively something like:
>
> while (i < n) {
> i += vector_add(a,b,c,n-i);
> }

Hm.

Let's assume C semantics without noalias or restrict, and assume that
the function (without the nonsenscial const) is called as

int *arr;
add (arr+1, arr, arr+2, n);

Then, the store to a[0] would overwrite arr[1], and in the next
iteration b[1] would be referenced, which would be arr[1], so the
store to a[0] would have to be carried over to the next iteration
of the loop. Classic aliasing (and I made the case especially
obnoxious because not even loop reversal would help).

Mitch's My66000 would detect the interdependency at runtime and
drop down to scalar mode.

For Fortran, there is a related problem. While arguments cannot
overlap, the language proscribes that the right-hand side of
an assignment is evaluated before the actual assignment.
This leads to generation of temporary arrays for code like

a(n:m) = a(n-1:m-1) + a(n+1:m+1)

or when overlap / non-overlap cannot be detected at compile time.

My question would be: Would this kind of thing be left to the
programmer (or the language definition, or the compiler) or are
there special provisions in the ISA to deal with this kind of
thing one way or another?

Mitch's VVM drops down to scalar if an interdpendency is found,
via a hardware check.

MitchAlsup

unread,
Feb 1, 2023, 2:05:09 PM2/1/23
to
On Wednesday, February 1, 2023 at 8:03:45 AM UTC-6, Thomas Koenig wrote:
> Agner Fog <ag...@dtu.dk> schrieb:
> > The ForwardCom assembly language is simple and immediately
> > intelligible. Adding two integers is as simple as: int r1 =
> > r2 + r3. Branches and loops are written just like in C or Java,
> > for example:
>
> > for (int r0 = 0; r0 < 100; r0++) { }
> Quite interesting, thanks a lot for sharing!
>
> One question: What would be the best way to handle loop-carried
> dependencies (let's say a memmove, where operands can overlap,
> or C's
<
MM Rto,Rfrom,Rcnt
>
> void add (const int *a, int *b, int *c, int n)
> {
> for (int i=0; i<n; i++)
> a[i] = b[i] + c[i];
> }
<
I fail to see a loop carried dependence. Nothing loaded or
calculated in the loop is used in the subsequent loop. Do I
have a bad interpretation of "loop carried" ?
>
> in your ISA?

Thomas Koenig

unread,
Feb 1, 2023, 2:20:41 PM2/1/23
to
MitchAlsup <Mitch...@aol.com> schrieb:
It's not obvious, but (see my reply to Terje) the pointers can
point to overlapping parts of an array. In this case, a non-
obvious loop carried depenence can be introduced (note the
lack of restrict in the argument list), apart from the
const which is wrong.

Tim Rentsch

unread,
Feb 1, 2023, 2:39:13 PM2/1/23
to
Thomas Koenig <tko...@netcologne.de> writes:

[...]

> For Fortran, there is a related problem. While arguments cannot
> overlap, the language proscribes that the right-hand side of
> an assignment is evaluated before the actual assignment.

I expect you mean prescribes, not proscribes.

MitchAlsup

unread,
Feb 1, 2023, 3:25:25 PM2/1/23
to
I can envision an implementation where there are no loop
carried dependences.
>
But if your point was to illustrate that C has bad semantics
wrt array aliasing, I have to agree. The Fortran version has
no loop carried dependence.

Quadibloc

unread,
Feb 1, 2023, 4:19:58 PM2/1/23
to
On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:

> ForwardCom can vectorize array loops in a new way that automatically
> adjusts to the maximum vector length supported by the CPU. It works
> in the following way:

> A register used as loop counter is initialized to the array length. This
> counter is decremented by the vector register length and repeats as
> long as it is positive. The counter register also specifies the vector
> register length. The vector registers used in the loop will have the
> maximum length as long as the counter exceeds this value. There
> is no loop tail because the vector length is automatically adjusted
> in the last iteration of the loop to fit the remaining number of array
> elements.

This isn't completely new. The IBM System/370 at one point added a
vector feature which was largely modelled on that of the Cray I. But
unlike the Cray I, it was designed so that one model of the 370 might
have vector registers with 64 elements, and another one might have
vector registers with 256 elements.

To allow code to be written that was independent of the hardware
vector size, a 16-bit field in the vector control register was used as
the loop counter.

This is not the same as the vector feature recently added to the
current IBM z/Architecture mainframes, which is just one comparable
to AVX on x86 microprocessors.

John Savard

Thomas Koenig

unread,
Feb 1, 2023, 5:04:46 PM2/1/23
to
The point I was trying to make (but which probably didn't come
across very well) is that loop-carried dependencies which are
allowed by the language can carry large performance penalties,
and that they impede any sort of vectorization - with wrong code
or with performance degradation, or both.

And Fortran has its own issues in that respect, not with arguments,
but with array expressions. Conceptually, everything on the
right-hand side of an assignment is evaluated before assignment
to the left-hand side. This is no problem if there are no
dependencies between the two sides. If there are simple
dependencies, like

a = sin(a)

that is not a problem, because the compiler can easily prove
that this can be scalarized into into a simple loop. For
something like

a (1:n) = a(0:n-1)

(array slice notation, where after the assignment a(1) has the
value that a(0) had previously, a(2) previously a(1), etc.) this
is conceptually identical to

tmp(1:n) = a(0:n-1)
a(1:n) = tmp

but the compiler cannot use a forward loop. It could either
actually create a temporary, or reverse the loop.

And in doubt, the compiler has to err on the side of correct code,
rather than speed.

Terje Mathisen

unread,
Feb 1, 2023, 5:25:03 PM2/1/23
to
IMHO, this is exactly like the IBM RLL-decoding using intentionally
overlapping moves (MVC?): You really, really should not do it, and if
you do, then you deserve whatever ills befall you.

MitchAlsup

unread,
Feb 1, 2023, 6:40:20 PM2/1/23
to
On Wednesday, February 1, 2023 at 4:04:46 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> > But if your point was to illustrate that C has bad semantics
> > wrt array aliasing, I have to agree. The Fortran version has
> > no loop carried dependence.
<
> The point I was trying to make (but which probably didn't come
> across very well) is that loop-carried dependencies which are
> allowed by the language can carry large performance penalties,
> and that they impede any sort of vectorization - with wrong code
> or with performance degradation, or both.
<
A subtle point you may have missed with VVM; is that one CAN
vectorize such loops, and the ones with such a dependency simply
run slower (at the speed the dependency resolves) rather than the
speed the pipeline is capable of running if the dependence were
not present. In both cases, the correct results are obtained.
<
Loop carried dependencies do not prevent VVM vectorization.

Thomas Koenig

unread,
Feb 2, 2023, 1:30:43 AM2/2/23
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
I concur. The question is who is "you" in your sentence above...

Compiler writers have to get it right, according to the language
specification, so wrong code is not an option. If the language
specification results in pessimized code to cater for a case that
hardly anybody uses intentionally, well... what is a compiler
writer to do?

Thomas Koenig

unread,
Feb 2, 2023, 1:45:30 AM2/2/23
to
MitchAlsup <Mitch...@aol.com> schrieb:
I know that, but didn't explicitly mention it :-)
><
> Loop carried dependencies do not prevent VVM vectorization.

The _really_ good thing is that it only happens when detected at
runtime, so there is no performance penalty when it is not needed.

But if loop carried dependencies occur, they will have a large
performance impact.

Also, VVM faithfully reproduces a special ordering at the loop,
the one that it was written in. This is obviously ideal for
languages which write out explicit loops, like C, but does not
offer improvment for languages (like Fortran :-) which operate
more on vectors or arrays.

Consider the Fortran statement

a(n:m) = a(k:k+m-n) + 2.0

which can be efficiently implemented with either a forward or a backward
loop, depending on the sign of n - k. The compiler has two choices:
Either copy the intermediate result to a temporary, or perform a
run-time test and switch the loop ordering, depending on the result.

Depending on the length of the loop, there is then always the question
of which is faster, because the run-time test adds to the overhead of
the loop startup.

This is one case where a genuine vector ISA could offer advantages,
it could do the loop reversal check in hardware, and fast. I simply
don't know if ForwardCom can do that, or not.

Agner Fog

unread,
Feb 2, 2023, 1:57:11 AM2/2/23
to

Thomas Koenig wrote:
What would be the best way to handle loop-carried dependencies (let's say a memmove, where operands can overlap)

The memmove function allows overlap, unlike memcpy. The memmove library function is using a backwards loop when necessary. The ForwardCom library function will do the same, of course. A backwards loop is not as elegant as a forward loop, but still possible.

An optimizing compiler will not vectorize a loop automatically unless it can prove that there is no aliasing problem. The programmer may help by explicit vectorization, or a __restrict keyword, or a no-aliasing compiler option. ForwardCom does not differ from other SIMD processors in this respect.


Quadibloc wrote:
some actions, treated as single instructions by the microarchitecture for efficiency, are specified as multiple instructions in the code.

The assembler understands the high-level construct:
for (int r1=0; r1<n; r1++) { }
The loop prolog (r1=0) is a single instruction (or two instructions if n is a register and it is not certain that n > 0).
The loop epilog (r1++, jump if r1<n) is a single instruction. It is the high-level assembler that does the translation. The microprocessor sees the epilog as a single instruction and treats it like any other combined ALU-branch instruction, so it has no problem decoding this instruction.
The compiler is free to pass the 'for' loop as a high-level expression to the assembler or translate it to low-level instruction mnemonics.

Thomas Koenig wrote:
This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.

It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

Agner Fog

unread,
Feb 2, 2023, 2:02:04 AM2/2/23
to

> Thomas Koenig wrote:
> This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.
> It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

What I meant was, an out-of-order My66000 might do the equivalent of loop reversal in small, simple cases, but not for big or complicated loops.

Thomas Koenig

unread,
Feb 2, 2023, 12:26:04 PM2/2/23
to
Thomas Koenig <tko...@netcologne.de> schrieb:

> Also, VVM faithfully reproduces a special ordering at the loop,
> the one that it was written in. This is obviously ideal for
> languages which write out explicit loops, like C, but does not
> offer improvment for languages (like Fortran :-) which operate
> more on vectors or arrays.

Actually, that should read

"does not offer additional improvement". Having VVM for Fortran would
already speed up a lot of code.

Stephen Fuld

unread,
Feb 2, 2023, 12:56:55 PM2/2/23
to
I get what you are saying, but I am not sure it is true. Specifically,
while the CPU can do only one add at a time (due to the dependency), so
it doesn't offer the advantage of operating on multiple elements
simultaneously, as it does on non loop carried dependency loops.

But it may be, and Mitch is the expert here, that the other advantages
of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
what they are officially called) offer an advantage over pure scaler code.

BTW, whether you call what VVM does for this loop "vectorizing" is, I
think a matter of ones definition of vectorizing, specifically whether
the definition must include operating on more than one element
simultaneously or not.




--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Feb 2, 2023, 3:52:27 PM2/2/23
to
On Thursday, February 2, 2023 at 11:56:55 AM UTC-6, Stephen Fuld wrote:
> On 2/2/2023 9:26 AM, Thomas Koenig wrote:
> > Thomas Koenig <tko...@netcologne.de> schrieb:
> >
> >> Also, VVM faithfully reproduces a special ordering at the loop,
> >> the one that it was written in. This is obviously ideal for
> >> languages which write out explicit loops, like C, but does not
> >> offer improvment for languages (like Fortran :-) which operate
> >> more on vectors or arrays.
> >
> > Actually, that should read
> >
> > "does not offer additional improvement". Having VVM for Fortran would
> > already speed up a lot of code.
<
> I get what you are saying, but I am not sure it is true. Specifically,
> while the CPU can do only one add at a time (due to the dependency),
> so it doesn't offer the advantage of operating on multiple elements
> simultaneously, as it does on non loop carried dependency loops.
<
If we are talking about the following loop (snatched from above)::
<
void add (const int *a, int *b, int *c, int n)
{
for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}
<
The only thing preventing performing 2-4-8 adds, at the same time,
is whether (or not) a[] actually aliases b[] or c[]. If/when there is no
aliasing, this can be performed as wide as the implementation has
resources. If aliasing is actually taking place then one is limited by
the actual data-flow.
<
Note: That in the case above::
<
add( a, a, @a[37]);
<
Does not perform and ACTUAL aliasing up to execution widths of
32 per cycle (8 per cycle FP due to latency).
>
> But it may be, and Mitch is the expert here, that the other advantages
> of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
> what they are officially called) offer an advantage over pure scaler code.
<
Consider that a processor is comprised of many function units, each
having a characteristic latency:: integer is 1 cycle, LDs are 3, STs are 1
cycle, FP is 4 cycles. Now consider that there is 1 FU of each kind and
that the LOOP instruction executes in 1 cycle. Finally consider that the
underlying machine is 1-wide and in-order (classic 1st generation RISC).
Cache width is 128-bits--not 1st generation RISC, but suitable for mis-
aligned DoubleWord access at full speed.
>
Here, loops are bound by the number of times a FU is used within the
loop. DAXPY, which has 2 LDs, 1 ST, 1 FMAC, per iteration. LDs and STs
are being performed 128-bits per access, or 2 Iterations per access.
<
On my 1-wide in order machine, we are reading 128-bits per for the
2 LDs, and writing 128-bits for the ST every 3 (2*)cycles, so with a
single FMAC unit we can perform 2 LDs, 1 FMAC, 1 ST and the equi-
valent of 1×(ADD-CMP-BLT) every 3 cycles (7 instructions in 3 cycles
= 2+1/3 I/C). 2.3 IPC is not bad for a 1-wide machine being about
3.5× the performance of the scalar code all by itself.
<
(*) The nature of cache lines having 1 tag and 4 quadword columns
enables the 2 LDs and the ST to both be performed in 2 cycles since
the cache line is 4 accesses wide and we need 3 tag accesses every
4 cycles. {This may require an additional QuadWord of buffering for
STs.} In order to do this, the column being stored cannot be a column
being read by either of the 2 LDs. 7 effective instructions in 2 cycles
is 3.5 IPC and 5× what scalar code could perform.
<
Also note: The C (or Fortran) source has manually unrolled this loop 4
times. This just multiplies the number of instructions in the loop by
4 while leaving a single LOOP instruction for iteration control.
<
> BTW, whether you call what VVM does for this loop "vectorizing" is, I
> think a matter of ones definition of vectorizing, specifically whether
> the definition must include operating on more than one element
> simultaneously or not.
>
In the same way a CARY-like vector machine does not need fetch
bandwidth while its vector instructions ae being performed, a VVM
implementation does not need the Fetch-Parse-Decode stages of
the pipeline to be active while the loop is being performed.

Anton Ertl

unread,
Feb 2, 2023, 6:03:22 PM2/2/23
to
Agner Fog <ag...@dtu.dk> writes:
>ForwardCom can vectorize array loops in a new way that automatically adjust=
>s to the maximum vector length supported by the CPU. It works in the follow=
>ing way:
>
>A register used as loop counter is initialized to the array length. This co=
>unter is decremented by the vector register length and repeats as long as i=
>t is positive. The counter register also specifies the vector register leng=
>th. The vector registers used in the loop will have the maximum length as l=
>ong as the counter exceeds this value. There is no loop tail because the ve=
>ctor length is automatically adjusted in the last iteration of the loop to =
>fit the remaining number of array elements.
...
>There is no global vector length register. You can have different vectors w=
>ith different lengths at the same time. The length is stored in the vector =
>register itself. When you save and restore a vector register, it will only =
>save the part of the register that is actually used.=20

In recent time we have had the problem that CPU manufacturers combine
different cores on the same CPU, and want to migrate threads between
these cores. So they use the same SIMD lengths on all cores, even, in
the case of Intel, disabling AVX-512 on CPUs that do not have the
smaller cores, i.e., where all cores could do AVX-512 if it was not
disabled.

It seems to me that you don't have an answer for that. And I guess
that as long as there are architectural (programmer-visible) SIMD
registers, the problem will persist. The SIMD registers must not be
an architectural feature, as in VVM (or maybe Helium?), to make
migration from cores with longer to cores with shorter vectors
possible.

The other option would be for Intel to implement AVX-512 in the small
cores without 512-bit units, the same way that AMD has been using time
and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
and Zen1, and ZMM is 2x256 bits on Zen4.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Agner Fog

unread,
Feb 3, 2023, 12:47:28 AM2/3/23
to
Anton Ertl wrote:
> In recent time we have had the problem that CPU manufacturers combine
> different cores on the same CPU, and want to migrate threads between
> these cores. So they use the same SIMD lengths on all cores, even, in
> the case of Intel, disabling AVX-512 on CPUs that do not have the
> smaller cores, i.e., where all cores could do AVX-512 if it was not
> disabled.
Yes, this was a big mistake in my opinion. The first version of Intel's Alder Lake had access to different cores with and without AVX512 support. Your program will crash if the software detects that it can use AVX512 and later jumps to another core without AVX512. So they had to disable the AVX512 completely, including the new half-precision floating point instructions. Did Intel expect the OS to move a thread back to a capable core in this case? You cannot expect an OS to give special treatment to every new processor on the market with new quirks. I have discussed this problem at https://www.agner.org/forum/viewtopic.php?f=1&t=79
Unfortunately, ARM does the same in their big.LITTLE architecture.

Regarding pointer aliasing:
I think it is the responsibility of the programmer to consider pointer aliasing problems and do a loop backwards if necessary. If you want the hardware to fix all kinds of bad programming, you will soon find yourself down a very deep rabbit hole.

Terje Mathisen

unread,
Feb 3, 2023, 1:54:28 AM2/3/23
to
Generate correct code, so for max optimization give a warning, but
produce code that tests and branches to alternate implementations
depending upon the parameters, i.e. in you example both b[] and c[] must
be checked for overlap with a[]. The fallback can be fast, as in
memmove(), if zero or one input overlap, then fall back on scalar for
really bad cases?

robf...@gmail.com

unread,
Feb 3, 2023, 3:45:42 AM2/3/23
to
Re: memory management,

I skimmed through the manual and was somewhat surprised by the memory
management section. It sounds almost somewhat like a segmented system.
Is it possible to get a little more detail? If I have got this right, there is a table(s)
of base block addresses and block lengths and a number of comparators are
used to map a virtual address to a physical one. Can an object span across
multiple blocks? Is the block table swapped on a context switch, or are there
entries for multiple contexts?

While many systems contain block address translations, they also typically
include TLB's and paged memory. That style of memory management seems
to have won-out.

Anton Ertl

unread,
Feb 3, 2023, 6:35:44 AM2/3/23
to
Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:
>> In recent time we have had the problem that CPU manufacturers combine=20
>> different cores on the same CPU, and want to migrate threads between=20
>> these cores. So they use the same SIMD lengths on all cores, even, in=20
>> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
>> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
>> disabled.=20
>Yes, this was a big mistake in my opinion. The first version of Intel's Ald=
>er Lake had access to different cores with and without AVX512 support.

What I read is that you could have a BIOS setting where you could
enable AVX-512 but you could only do that if you disabled the E-cores.
Intel soon provided BIOS updates that eliminated this option.

>Your=
> program will crash if the software detects that it can use AVX512 and late=
>r jumps to another core without AVX512. So they had to disable the AVX512 c=
>ompletely, including the new half-precision floating point instructions. Di=
>d Intel expect the OS to move a thread back to a capable core in this case?=
> You cannot expect an OS to give special treatment to every new processor o=
>n the market with new quirks.

One way to deal with this would be to not report the availability of
AVX-512 (and maybe even disable them) unless the OS tells the CPU in
through a model-specific register that it is prepared for
heterogeneous AVX-512 support.

An OS that is unaware of the issue would just work, without AVX-512,
as is currently the case with Alder Lake; but the option I outlined
would open the possibility for other approaches:

One way an OS could handle the issue is to only tell the CPU that it
should report AVX-512 for processes that are limited to P-cores (with
thread-affinity or somesuch).

Another way would be to always tell the CPU that, and, as you wrote,
to migrate the process back to a P-Core if it executes an AVX-512
instruction.

Intel was not very coordinated when designing Alder Lake. They had
two teams designing cores with different instruction sets, then did
not provide OS workarounds for that, then dealt with the resulting
fallout by first restricting AVX-512 severely, but for whatever reason
even that was not enough for them, and finally they disabled AVX-512
completely.

>Unfortunately, ARM does the same in their big.LITTLE architecture.

A little more coordinated, it seems: All their cores are limited to
128-bit SIMD in SVE, which however eliminates the USP of SVE:
scalability to wider SIMD units.

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary.

I think it's something the programming language design should take
care of. If the language has vectors as a value type, this eliminates
the problem. The Fortran array sub-language is not quite there, but
still pretty good.

Michael S

unread,
Feb 3, 2023, 7:41:59 AM2/3/23
to
SIMD registers are an architectural feature of Helium/MVE.
They overlap with scalar FP registers in a manner, different from
what is common in x86 world, but consistent with a way in which
ARMv7-M FPv5 double precision register overlap with single-precision.

If anything, actual execution width (1, 2 or beats per Architecture tick)
is more exposed to SW in Helium than in most other SIMD environments.

"The SIMD registers must not be an architectural feature" appears
to be a fashion of comp.arch newsgroup, but it's not a popular
view in wider world.

>
> The other option would be for Intel to implement AVX-512 in the small
> cores without 512-bit units, the same way that AMD has been using time
> and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
> and Zen1, and ZMM is 2x256 bits on Zen4.
>

Intel did it many times in the past. All SSEx implementations before Core had
64-bit floating-point execution units. They did it later, too.
Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
Alder Lake was/is just a mess, as you correctly state in other post.
Although somewhat less messy than its predecessor Lakefield.

Michael S

unread,
Feb 3, 2023, 7:49:53 AM2/3/23
to
Yes, Arm Inc. manages their big.LITTLE combos well.
It was Samsung that did bad things like combining Cortex-A53
as LITTLE with their own Mongoose cores as big, despite the
mismatch in cache lines width, which is certainly visible to
application software.
Mongoose is dead now, so happiness returned.

Scott Lurndal

unread,
Feb 3, 2023, 8:46:01 AM2/3/23
to
Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary. If you want the hardwar=
>e to fix all kinds of bad programming, you will soon find yourself down a v=
>ery deep rabbit hole.

Which is why memmove(3) was standardized decades ago.

Anton Ertl

unread,
Feb 3, 2023, 9:49:46 AM2/3/23
to
Michael S <already...@yahoo.com> writes:
>"The SIMD registers must not be an architectural feature" appears
>to be a fashion of comp.arch newsgroup, but it's not a popular
>view in wider world.

Yet? Certainly there have been complaints about SIMD width as
instruction set feature (as in SSE, AVX, AVX-512), and they have
resulted in SVE, and AFAIK the RISC-V vector extension. But with that
the SIMD width is still architectural, which prevents migration to
implementations with a different SIMD width; that's very clear in
Alder Lake, but is also a problem when migrating VMs to different
computers, e.g., migrating an SVE program from a Fujitsu supercomputer
to some ARMv9 server.

Keeping things that you want to be able to change as
microarchitectural feature rather than turning them into architectural
features has been used successfully in the past. If we could use it
for SIMD/vectors, that would be great. I am not sure how realistic
VVM is, but if it can be realized, the fact that the SIMD size is a
microarchitectural rather than architectural feature is a major
benefit as far as I am concerned (I am not a believer in
auto-vectorization, even if it happens in the hardware and avoids the
aliasing problems.

>Intel did it many times in the past. All SSEx implementations before Core had
>64-bit floating-point execution units. They did it later, too.
>Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
>Alder Lake was/is just a mess, as you correctly state in other post.

So the fact that Gracemont (and therefore Alder Lake) does not support
AVX-512 is probably a result of Intel trying to perform market
segmentation with instruction set extensions. If so, apparently Intel
marketing does not understand the market very well.

Anton Ertl

unread,
Feb 3, 2023, 10:25:41 AM2/3/23
to
Thomas Koenig <tko...@netcologne.de> writes:
>For Fortran, there is a related problem. While arguments cannot
>overlap, the language proscribes that the right-hand side of
>an assignment is evaluated before the actual assignment.
>This leads to generation of temporary arrays for code like
>
> a(n:m) = a(n-1:m-1) + a(n+1:m+1)
>
>or when overlap / non-overlap cannot be detected at compile time.

And what is the problem? How would you define it instead?

I think that Fortran defines this well. I think that it's even better
to have vectors as value types not just for intermediate results of
expressions, but also in variables, but what Fortran does here is a
good intermediate step.

Having the result in different memory than the source is a fine
solution (and, if all of a is overwritten, and a had a value type, it
could even be done without extra copying). But in the present case,
there is another solution (which might be found by your compiler):

if (n<=m) {
double aim1 = a[n-1] + a[n+1];
for (i=n+1; i<=m; i++) {
double ai = a[i-1] + a[i+1];
a[i-1] = aim1;
aim1 = ai;
}
a[i-1] = aim1;
}

You can unroll the loop by a factor of 2 and perform some renaming to
eliminate "aim1 = ai;". You can use SIMD instead of double, with the
same logic otherwise to make use of SIMD.

Your example looks similar to a convolution filter, but those usually
have more than one dimension, which makes the whole optimization
harder (e.g., for a 2D convolution filter you would have to store more
than one line in intermediate storage before copying it back to the
original location. Just having the result in a new location looks
much more attractive then.

BGB

unread,
Feb 3, 2023, 1:40:30 PM2/3/23
to
IMO:
This isn't a huge loss, as IME the general utility of SIMD drops off
quickly as one goes wider.

So, 4-element vectors, 128-bit vectors, ... can almost be considered as
magic numbers.

Likewise, if one does go to 8-element vectors, it is often "more useful"
to handle them more as "2x 4-elements" than as a homogeneous 8.


Main use-case for 256-bit would be 4x Binary64, but not as much else.
But, if one has a CPU where this has no advantage over 128-bit (because
the CPU is just faking it), well then, 128-bit remains on top.



And, ironically, the main alternate use case for SIMD being for memory
copies, which are then "however wide the CPU can natively handle"
(multiplies by a scale of pipeline latency).

So, for BJX2 in my current cores, this leads to copying 512-bits at a
time being the "most efficient" case for general memcpy. Though larger
blocks can still see an advantage by reducing loop overhead.

Copying 2048 or 4096 bits is enough to basically "hide" looping related
overheads, but is only really "useful" for bulk copying (say, one needs
to copy a buffer of 16K or more before it makes much difference).


For something like an LZ77 decoder, is may make sense to ignore
bulk-copy cases (the majority of matches being relatively short).

In this case, it makes more sense to try to be "most efficient" for
copying in the 4-16 byte range. Copies much over 256 bytes being
relatively uncommon.


Otherwise, I have skepticism about compiler autovectorization being able
to make effective use of wider SIMD types, when (as a programmer) it is
also difficult to make effective use of them.


>> Regarding pointer aliasing:
>> I think it is the responsibility of the programmer to consider pointer alia=
>> sing problems and do a loop backwards if necessary.
>
> I think it's something the programming language design should take
> care of. If the language has vectors as a value type, this eliminates
> the problem. The Fortran array sub-language is not quite there, but
> still pretty good.
>

Would be useful if C gave a little more here.


Ideally, I would like both keywords for aliasing (and unaligned
load/store, etc); along with maybe a way to specify vector types and
vector ops that worked across compilers and platforms, and whether or
not the target actually supports them natively.

Though, GCC's vectors can give at least one of these points.
* Works across targets.
But:
* Not really portable outside of GCC (and Clang)
* Still seems to depend on native SIMD support AFAIK.


For BGBCC, I had defined my own system, with partial compatibility with
GCC's notation. Operations are defined relative to the vector size/type,
rather than whether or not it maps to a native HW vector (though, with
the caveat that trying to use vectors when no SIMD is enabled, will
generate runtime calls...).

...

Thomas Koenig

unread,
Feb 3, 2023, 3:52:33 PM2/3/23
to
Quadibloc <jsa...@ecn.ab.ca> schrieb:
> On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:
>
>> ForwardCom can vectorize array loops in a new way that automatically
>> adjusts to the maximum vector length supported by the CPU. It works
>> in the following way:
>
>> A register used as loop counter is initialized to the array length. This
>> counter is decremented by the vector register length and repeats as
>> long as it is positive. The counter register also specifies the vector
>> register length. The vector registers used in the loop will have the
>> maximum length as long as the counter exceeds this value. There
>> is no loop tail because the vector length is automatically adjusted
>> in the last iteration of the loop to fit the remaining number of array
>> elements.
>
> This isn't completely new. The IBM System/370 at one point added a
> vector feature which was largely modelled on that of the Cray I. But
> unlike the Cray I, it was designed so that one model of the 370 might
> have vector registers with 64 elements, and another one might have
> vector registers with 256 elements.

A colleague used it on an IBM 3090. Compared to the Fujitsu VP
they also had at the computer center, it was as slow as molasses.

[...]

Agner Fog

unread,
Feb 4, 2023, 2:34:37 AM2/4/23
to
Re: memory management
The main purpose of the memory map in ForwardCom is not virtual address translation, but memory protection. Virtual address translation is not needed in simple well-behaved cases.

> Can an object span across multiple blocks?
Yes, but there would be no point in doing so because blocks can have arbitrary sizes. For example, if a data object spans two blocks of 3 kB and 4 kB, you can join them into one block of 7 kB. You only need to subdivide a data structure if different parts have different access permissions. A typical application has initialized static date, uninitialized static data, stack, and heap. These all have the same access permissions (read and write) so they can be joined into a single block large enough to serve all the needs of the running process for read/write data. A typical application program has three memory blocks with different access permissions: 1. execute only, 2. read only, 3. read and write. If there is a child thread with thread-private memory, then this thread will have a memory map with one extra entry for private data that is not accessible to parent and sibling threads. Thread-private memory is a ForwardCom feature.

ForwardCom has features to predict the maximum size of stack and heap and reserve the calculated size of memory when a program is loaded. If this prediction fails, then the memory becomes fragmented. Then you will need an extra entry in the memory map to access a vacant physical memory block that is not contiguous with the rest, and possibly use virtual address translation to make it appear to be contiguous.

ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.

The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

Thomas Koenig

unread,
Feb 4, 2023, 6:03:26 AM2/4/23
to
Agner Fog <ag...@dtu.dk> schrieb:

> ForwardCom has features to predict the maximum size of stack
> and heap and reserve the calculated size of memory when a program
> is loaded.

That sounds like an extremely hard thing to do. What do these
heuristics loook like, and how would they estimate (for example)
heap size of the program

program main
real, dimension(:,:), allocatable :: a, b c
read (*,*) n
allocate (a(n,n), b(n,n))
read (*,*) a, b
c = matmul(a,b)
print *,c
end

? Do they rely on statistics gathered in earlier runs (sort
of profile-guided feedback, which is not very reliable), or are
they functions of the program itself? What about self-compiled
programs?

> If this prediction fails, then the memory becomes
> fragmented. Then you will need an extra entry in the memory map
> to access a vacant physical memory block that is not contiguous
> with the rest, and possibly use virtual address translation to
> make it appear to be contiguous.

> ForwardCom is using various features to avoid memory
> fragmentation, including an innovative system for function
> libraries. This should keep the number of entries in the memory
> map of each process or thread small enough to fit on the CPU chip
> without the need for TLB and multi-level page tables. A small memory
> map with few variable-size entries is implemented most efficiently
> with comparators. A large memory map for a highly fragmented memory
> cannot use this method, but must use fixed-size pages and large
> page tables. Intel has up to 5 levels of page tables.

It is also a question of the size of the page tables...

Do you envision having page tables as a backup, or will your method
converge to something equivalent to page tables when the application
manages to beat all of your heuristics?

And is the current performance of page tables expected to be better
or worse than what you propose, when your heuristics don't work well?

Really, how well your heuristics work is a central question. I guess
this is not possible without extensive profiling of typical workloads,
where the question of what "typical" is is always problematic.

Very probably, profilign existing applications on other processors
can help a lot there.

> The memory map needs to be saved and restored on context
> switches. It will typically be smaller than the register file
> so that it can be saved efficiently. The CPU may or may not have
> multiple memory maps on-chip to switch between.

So, if the memory map exceeds the (probably) fixed number of
entries that are on the chip, what happens?

Thomas Koenig

unread,
Feb 4, 2023, 7:18:02 AM2/4/23
to
MitchAlsup <Mitch...@aol.com> schrieb:

> Also note: The C (or Fortran) source has manually unrolled this loop 4
> times. This just multiplies the number of instructions in the loop by
> 4 while leaving a single LOOP instruction for iteration control.

Is such unrolling actually (likely to be) helpful in a VVM
implementation for speed?

Agner Fog

unread,
Feb 4, 2023, 7:20:41 AM2/4/23
to
Thomas Koenig wrote:
> how would they estimate (for example) heap size of the program?
>Do they rely on statistics gathered in earlier runs

The maximum stack size is calculated by the compiler and the linker. In case of deeply recursive functions, the programmer may specify an estimated maximum recursion level.

And yes, the maximum heap size relies on statistics gathered in earlier runs. The programmer may add a suggested value.

If a heap grows beyond the estimated maximum, then the OS may add a large extra block of memory, perhaps the double of the previous size. Assume, for example, that a program is using 4 memory map entries when it is loaded, and the hardware has space for 256 entries, then the heap can grow to 2^252 times the initial size before we run out of memory map entries. This will never happen even if we use disk space as virtual memory.

There can be other causes of filling the memory map, though. The most likely is demand-paged memory. The use of demand-paged memory is strongly discouraged in ForwardCom. Another possibility is starting and ending so many processes that the entire memory space gradually becomes totally fragmented.

> So, if the memory map exceeds the (probably) fixed number of entries that are on the chip, what happens?
The best solution will probably be to start a garbage collector that can suspend running processes for a short time while rearranging the fragmented memory it is using. A fallback to fixed-size page tables is possible if the hardware supports it, but that is the expensive solution we are trying to avoid.

The performance problem of garbage collection exists in current systems as well. The programmer of a performance-critical memory-hungry program will take care to avoid garbage collection. This can be done by allocating a large memory block when starting, and by recycling allocated memory. Another solution is to start a new thread when a large block of memory is needed, for example in a video editing program. The main thread will not need an extra memory map entry if the new memory block is private to the new thread.

A memory map with space for, for example, 256 entries will require at most 512 comparators. This will be much more efficient than a traditional TLB with millions of pages in multiple levels, and we will have no TLB misses.


lørdag den 4. februar 2023 kl. 12.03.26 UTC+1 wrote Thomas Koenig:
> Agner Fog schrieb:

Quadibloc

unread,
Feb 4, 2023, 8:03:57 AM2/4/23
to
It will be interesting to hear Mitch's answer to this.

I would be (initially) inclined to say yes. Why?

Mitch's VVM instructions _look_ like ordinary loop instructions. But
the processor treats them as vector instructions instead.

To me, that implies they *take the place* of Cray-like vector
instructions, removing complexity from the instruction set.
That would mean that they have _less_ overhead than a regular
loop, but not that they have _zero_ overhead. And zero
overhead is what an unrolled loop has, right?

And that's where I see why I'm wrong. Compared to an ideal
vector instruction, an unrolled loop certainly does have
"overhead" - all those additional times you fetch and decode
the instructions for the later copies of the loop. (The execute,
of course, you have to do anyways.)

But as long as even a loop with VVM has some nonzero
amount of overhead, it's certainly possible that unrolling it
a very _small_ number of times might still be an optimization.
However, that wouldn't be true if the small number was less
than 2, and given Mitch's description of how he intends
VVM to be implemented, it does seem as though
this would be quite possible.

John Savard

EricP

unread,
Feb 4, 2023, 11:19:22 AM2/4/23
to
Agner Fog wrote:
> Re: memory management
>
> ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.
>
> The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

A hardware table of comparators is more than double the cost per-entry
as a TLB binary CAM. A binary CAM compare == entry is a bunch of XOR's
and an AND gate. A binary range comparator (low <= addr <= upr) entry is
two bunches of XOR's plus a priority selector (find-last-set),
so double the wire loading on the compare value bus, and more than
double the power consumption per entry, plus longer propagation delay.

Real TLBs for radix page tables can have multiple CAMs to cache
table interior nodes, allowing a reverse page table walk to speed
TLB miss resolution, which adds to the whole TLB cost.

Also each EXE or DLL has its own set of virtual memory section
requirements, plus all the DLLs referenced by other DLLs,
and any of these virtual sections may be relocated when created.
A single instance of a running program can have dozens to hundreds
of memory sections, each with its range of addresses and attributes.

A HW range table would have a limited number of entries so only a
subset of a programs memory section entries can be loaded at once.
How would a range table miss work?

How does virtual to physical relocation work for this range table?
A radix page table TLB does relocation by substitution so there is
zero cost beyond the TLB lookup for a table hit.
A range table can't use substitution and must use arithmetic relocation,
which costs an N-bit addition propagation delay above the table lookup.
For a 64-bit address this could add a clock to every memory access.

How would paging work for the range table? Hardware needs to track which
pages in each memory section are present, or trigger a page fault if not.
Where are these page Present bits stored for each memory section,
or is this going to use whole-segment faulting/swapping?


Agner Fog

unread,
Feb 4, 2023, 11:39:16 AM2/4/23
to
4. februar 2023 kl. 17.19.22 UTC+1 EricP wrote:
> Agner Fog wrote:
> > Re: memory management
> A hardware table of comparators is more than double the cost per-entry
> as a TLB binary CAM.

Yes, but there are fewer entries

> Real TLBs for radix page tables can have multiple CAMs to cache
> table interior nodes, allowing a reverse page table walk to speed
> TLB miss resolution, which adds to the whole TLB cost.

No page tables = no table walk

> Also each EXE or DLL has its own set of virtual memory section
> requirements, plus all the DLLs referenced by other DLLs,
> and any of these virtual sections may be relocated when created.
> A single instance of a running program can have dozens to hundreds
> of memory sections, each with its range of addresses and attributes.

Yes, exactly. ForwardCom has no DLLs, but a new library system with a relink feature. This removes the biggest source of memory fragmentation and avoids a lot of waste of partially used pages. See https://www.forwardcom.info/forwardcom_libraries.php

> How would paging work for the range table? Hardware needs to track which
> pages in each memory section are present, or trigger a page fault if not.

Usually, all memory used by a process will be loaded when the process is loaded. If a memory block is too big for loading it all at once, it needs one entry in the memory map for the present part and one entry for the absent part.

MitchAlsup

unread,
Feb 4, 2023, 12:53:09 PM2/4/23