Introducing ForwardCom: An open ISA with variable-length vector registers

641 views
Skip to first unread message

Agner Fog

unread,
Jan 30, 2023, 6:03:36 AMJan 30
to
I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.

ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:

A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.

A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.

This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.

There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.

It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.

ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.

The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.

All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.

The flexible instruction format allows the CPU to do more work per instruction. Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit. Multiplication, division, and floating point arithmetic have longer latencies.

The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
for (int r0 = 0; r0 < 100; r0++) { }

There are many other features, including strong security, efficient memory management, an efficient library system, efficient global register allocation, and more.

See www.forwardcom.info for all the details.

The current status of the ForwardCom project is:

* Instruction set and ABI standards are defined in detail.
* The following binary tools have been developed: high-level assembler, disassembler, linker, library manager, emulator, and debugger. A compiler is not available yet.
* A hardware implementation in an FPGA soft core is available with full documentation. The current version supports integer instructions only.
* Everything is free and open source

MitchAlsup

unread,
Jan 30, 2023, 2:07:31 PMJan 30
to
On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
> I have developed ForwardCom with help and suggestions from many people. This is a project to improve not only the ISA, but the entire ecosystem of software, ABI standard, development tools, etc.
>
> ForwardCom can vectorize array loops in a new way that automatically adjusts to the maximum vector length supported by the CPU. It works in the following way:
>
> A register used as loop counter is initialized to the array length. This counter is decremented by the vector register length and repeats as long as it is positive. The counter register also specifies the vector register length. The vector registers used in the loop will have the maximum length as long as the counter exceeds this value. There is no loop tail because the vector length is automatically adjusted in the last iteration of the loop to fit the remaining number of array elements.
>
> A special addressing mode can load and store vector registers at an address calculated as the end of the array minus the loop counter.
>
> This array loop mechanism makes the same software code run optimally on different CPUs with different maximum vector lengths. This is what I call forward compatibility: You don't have to update or recompile the software when you get access to a new CPU with larger vectors.
>
> There is no global vector length register. You can have different vectors with different lengths at the same time. The length is stored in the vector register itself. When you save and restore a vector register, it will only save the part of the register that is actually used.
>
> It is straightforward to call mathematical functions inside a vector loop. A library function takes vector registers as inputs and outputs. The same function can handle scalars and vectors of any length because the length is stored in each register.
>
> ForwardCom is neither RISC nor CISC. It is a combination that gets the best of both worlds. There are few instructions, but many variants of each instruction. The instruction length can be one, two, or three 32-bit words. Decoding is still efficient because the length is specified by just two bits in the first code word.
<
My 66000 is mostly RISC with a few of the better parts of CISC throw in. I also
think that something between RISC and CISC has potential for being better than
either. And in particular, Brian's LLVM compiler produces My 66000 ASM for
benchmarks (with available source code) that average 75%-79% the instruction
count of RISC-V LLVM compiler (with same optimization settings).
>
> The instruction set is orthogonal: Operands can be general purpose registers, vector registers, immediate constants, or memory operands with different addressing modes. The same instruction can handle integers of different sizes and floating point numbers of different precisions.
<
When I designed My 66000 ISA I developed the instruction set so that every
instruction could have 1 immediate, or 1 displacement, or both; and that
the immediate could be in either operand position. This enables my ISA to
<
SLL R7,#1,R9
STD #3.1415926535897932,[Rbase+Rindex<<scale+Displacement]
FMAC R9,R7,#3.2768D5,R9
FMAC R10,R6,R5,#1.7D0
<
as single instructions. There is no way to deliver an operand into calculation
that is of lower power than a constant.
>
> All instructions are coded according to a flexible and consistent template system. You can use a short single-word form of an instruction if the operands are simple, or a longer instruction format if you need extra registers, large immediate constants, large memory addresses, complex addressing modes, or extra option bits.
<
Same.
>
> The flexible instruction format allows the CPU to do more work per instruction.
<
Agreed.
<
> Various option bits adds extra functionality to many instructions, but only as long as it fits into a smooth pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock per execution unit.
<
I worked in sign control, too; so::
<
ADD R7,R9,-R14 // is the subtract version
ADD R7,-R9,R14 // is the reverse subtract version
<
> Multiplication, division, and floating point arithmetic have longer latencies.
<
Of course they do.

MitchAlsup

unread,
Jan 30, 2023, 6:06:57 PMJan 30
to
On Monday, January 30, 2023 at 5:03:36 AM UTC-6, Agner Fog wrote:
>
> The ForwardCom assembly language is simple and immediately intelligible. Adding two integers is as simple as: int r1 = r2 + r3. Branches and loops are written just like in C or Java, for example:
> for (int r0 = 0; r0 < 100; r0++) { }
>
This reminds me of CDC 6600 assembly where essentially all calculations are
set x7 = x3 + x2
<
How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
And what code would be produced ??
<

Agner Fog

unread,
Jan 31, 2023, 2:00:46 AMJan 31
to
31. januar 2023 kl. 00.06.57 UTC+1 MitchAlsup wrote:
> How would one write:: uint64_t R7 = (int32_t)R6 + (uint8_t)R19; ??
> And what code would be produced ??

The ForwardCom instruction format has one field specifying operand type and precision, so you cannot mix operand sizes in a single instruction. Unused parts of a register are always set to zero, so a 32-bit operation will zero-extend the result to 64 bits in a 64-bit register.
int32 r7 = r6 + r19
will do a 32-bit addition and zero-extend the result to 64 bits in r7. This will work if r19 is known to contain an 8-bit unsigned integer. If the bits 8-31 of R19 are not known to be zero then you have to cut them off first:
int8 r19 = r19.
There is no implicit sign extension. If you want to sign-extend an integer to a larger size you must use an extra instruction for doing so.

Thomas Koenig

unread,
Feb 1, 2023, 9:03:45 AMFeb 1
to
Agner Fog <ag...@dtu.dk> schrieb:

> The ForwardCom assembly language is simple and immediately
> intelligible. Adding two integers is as simple as: int r1 =
> r2 + r3. Branches and loops are written just like in C or Java,
> for example:

> for (int r0 = 0; r0 < 100; r0++) { }

Quite interesting, thanks a lot for sharing!

One question: What would be the best way to handle loop-carried
dependencies (let's say a memmove, where operands can overlap,
or C's

void add (const int *a, int *b, int *c, int n)
{

for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}

in your ISA?

Quadibloc

unread,
Feb 1, 2023, 9:48:06 AMFeb 1
to
On Monday, January 30, 2023 at 12:07:31 PM UTC-7, MitchAlsup wrote:

> My 66000 is mostly RISC with a few of the better parts of CISC
thrown
> in.

The only thing I'm not happy about in your design is the fact that
some actions, treated as single instructions by the microarchitecture
for efficiency, are specified as multiple instructions in the code.

To me, this appears to massively complicate the decoding of
instructions, making the architecture more difficult to implement.

Of course, though, it doesn't make the processor less efficient,
since instruction decoding can be done well ahead of execution,
so this need not have any impact on latencies.

John Savard

Terje Mathisen

unread,
Feb 1, 2023, 11:26:50 AMFeb 1
to
Does that make sense? Adding a 'const' to the target array?

I could see how noalias would be useful for the compiler but what is
const supposed to do here?

BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
loop, effectively something like:

while (i < n) {
i += vector_add(a,b,c,n-i);
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,
Feb 1, 2023, 12:19:32 PMFeb 1
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Agner Fog <ag...@dtu.dk> schrieb:
>>
>>> The ForwardCom assembly language is simple and immediately
>>> intelligible. Adding two integers is as simple as: int r1 =
>>> r2 + r3. Branches and loops are written just like in C or Java,
>>> for example:
>>
>>> for (int r0 = 0; r0 < 100; r0++) { }
>>
>> Quite interesting, thanks a lot for sharing!
>>
>> One question: What would be the best way to handle loop-carried
>> dependencies (let's say a memmove, where operands can overlap,
>> or C's
>>
>> void add (const int *a, int *b, int *c, int n)
>> {
>>
>> for (int i=0; i<n; i++)
>> a[i] = b[i] + c[i];
>> }
>
> Does that make sense? Adding a 'const' to the target array?

Not at all, I had not thought that through, const should be on
the other two arguments.

> I could see how noalias would be useful for the compiler but what is
> const supposed to do here?

> BTW, I would expect ForwardCom to compile this as a size-agnostic SIMD
> loop, effectively something like:
>
> while (i < n) {
> i += vector_add(a,b,c,n-i);
> }

Hm.

Let's assume C semantics without noalias or restrict, and assume that
the function (without the nonsenscial const) is called as

int *arr;
add (arr+1, arr, arr+2, n);

Then, the store to a[0] would overwrite arr[1], and in the next
iteration b[1] would be referenced, which would be arr[1], so the
store to a[0] would have to be carried over to the next iteration
of the loop. Classic aliasing (and I made the case especially
obnoxious because not even loop reversal would help).

Mitch's My66000 would detect the interdependency at runtime and
drop down to scalar mode.

For Fortran, there is a related problem. While arguments cannot
overlap, the language proscribes that the right-hand side of
an assignment is evaluated before the actual assignment.
This leads to generation of temporary arrays for code like

a(n:m) = a(n-1:m-1) + a(n+1:m+1)

or when overlap / non-overlap cannot be detected at compile time.

My question would be: Would this kind of thing be left to the
programmer (or the language definition, or the compiler) or are
there special provisions in the ISA to deal with this kind of
thing one way or another?

Mitch's VVM drops down to scalar if an interdpendency is found,
via a hardware check.

MitchAlsup

unread,
Feb 1, 2023, 2:05:09 PMFeb 1
to
On Wednesday, February 1, 2023 at 8:03:45 AM UTC-6, Thomas Koenig wrote:
> Agner Fog <ag...@dtu.dk> schrieb:
> > The ForwardCom assembly language is simple and immediately
> > intelligible. Adding two integers is as simple as: int r1 =
> > r2 + r3. Branches and loops are written just like in C or Java,
> > for example:
>
> > for (int r0 = 0; r0 < 100; r0++) { }
> Quite interesting, thanks a lot for sharing!
>
> One question: What would be the best way to handle loop-carried
> dependencies (let's say a memmove, where operands can overlap,
> or C's
<
MM Rto,Rfrom,Rcnt
>
> void add (const int *a, int *b, int *c, int n)
> {
> for (int i=0; i<n; i++)
> a[i] = b[i] + c[i];
> }
<
I fail to see a loop carried dependence. Nothing loaded or
calculated in the loop is used in the subsequent loop. Do I
have a bad interpretation of "loop carried" ?
>
> in your ISA?

Thomas Koenig

unread,
Feb 1, 2023, 2:20:41 PMFeb 1
to
MitchAlsup <Mitch...@aol.com> schrieb:
It's not obvious, but (see my reply to Terje) the pointers can
point to overlapping parts of an array. In this case, a non-
obvious loop carried depenence can be introduced (note the
lack of restrict in the argument list), apart from the
const which is wrong.

Tim Rentsch

unread,
Feb 1, 2023, 2:39:13 PMFeb 1
to
Thomas Koenig <tko...@netcologne.de> writes:

[...]

> For Fortran, there is a related problem. While arguments cannot
> overlap, the language proscribes that the right-hand side of
> an assignment is evaluated before the actual assignment.

I expect you mean prescribes, not proscribes.

MitchAlsup

unread,
Feb 1, 2023, 3:25:25 PMFeb 1
to
I can envision an implementation where there are no loop
carried dependences.
>
But if your point was to illustrate that C has bad semantics
wrt array aliasing, I have to agree. The Fortran version has
no loop carried dependence.

Quadibloc

unread,
Feb 1, 2023, 4:19:58 PMFeb 1
to
On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:

> ForwardCom can vectorize array loops in a new way that automatically
> adjusts to the maximum vector length supported by the CPU. It works
> in the following way:

> A register used as loop counter is initialized to the array length. This
> counter is decremented by the vector register length and repeats as
> long as it is positive. The counter register also specifies the vector
> register length. The vector registers used in the loop will have the
> maximum length as long as the counter exceeds this value. There
> is no loop tail because the vector length is automatically adjusted
> in the last iteration of the loop to fit the remaining number of array
> elements.

This isn't completely new. The IBM System/370 at one point added a
vector feature which was largely modelled on that of the Cray I. But
unlike the Cray I, it was designed so that one model of the 370 might
have vector registers with 64 elements, and another one might have
vector registers with 256 elements.

To allow code to be written that was independent of the hardware
vector size, a 16-bit field in the vector control register was used as
the loop counter.

This is not the same as the vector feature recently added to the
current IBM z/Architecture mainframes, which is just one comparable
to AVX on x86 microprocessors.

John Savard

Thomas Koenig

unread,
Feb 1, 2023, 5:04:46 PMFeb 1
to
The point I was trying to make (but which probably didn't come
across very well) is that loop-carried dependencies which are
allowed by the language can carry large performance penalties,
and that they impede any sort of vectorization - with wrong code
or with performance degradation, or both.

And Fortran has its own issues in that respect, not with arguments,
but with array expressions. Conceptually, everything on the
right-hand side of an assignment is evaluated before assignment
to the left-hand side. This is no problem if there are no
dependencies between the two sides. If there are simple
dependencies, like

a = sin(a)

that is not a problem, because the compiler can easily prove
that this can be scalarized into into a simple loop. For
something like

a (1:n) = a(0:n-1)

(array slice notation, where after the assignment a(1) has the
value that a(0) had previously, a(2) previously a(1), etc.) this
is conceptually identical to

tmp(1:n) = a(0:n-1)
a(1:n) = tmp

but the compiler cannot use a forward loop. It could either
actually create a temporary, or reverse the loop.

And in doubt, the compiler has to err on the side of correct code,
rather than speed.

Terje Mathisen

unread,
Feb 1, 2023, 5:25:03 PMFeb 1
to
IMHO, this is exactly like the IBM RLL-decoding using intentionally
overlapping moves (MVC?): You really, really should not do it, and if
you do, then you deserve whatever ills befall you.

MitchAlsup

unread,
Feb 1, 2023, 6:40:20 PMFeb 1
to
On Wednesday, February 1, 2023 at 4:04:46 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> > But if your point was to illustrate that C has bad semantics
> > wrt array aliasing, I have to agree. The Fortran version has
> > no loop carried dependence.
<
> The point I was trying to make (but which probably didn't come
> across very well) is that loop-carried dependencies which are
> allowed by the language can carry large performance penalties,
> and that they impede any sort of vectorization - with wrong code
> or with performance degradation, or both.
<
A subtle point you may have missed with VVM; is that one CAN
vectorize such loops, and the ones with such a dependency simply
run slower (at the speed the dependency resolves) rather than the
speed the pipeline is capable of running if the dependence were
not present. In both cases, the correct results are obtained.
<
Loop carried dependencies do not prevent VVM vectorization.

Thomas Koenig

unread,
Feb 2, 2023, 1:30:43 AMFeb 2
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
I concur. The question is who is "you" in your sentence above...

Compiler writers have to get it right, according to the language
specification, so wrong code is not an option. If the language
specification results in pessimized code to cater for a case that
hardly anybody uses intentionally, well... what is a compiler
writer to do?

Thomas Koenig

unread,
Feb 2, 2023, 1:45:30 AMFeb 2
to
MitchAlsup <Mitch...@aol.com> schrieb:
I know that, but didn't explicitly mention it :-)
><
> Loop carried dependencies do not prevent VVM vectorization.

The _really_ good thing is that it only happens when detected at
runtime, so there is no performance penalty when it is not needed.

But if loop carried dependencies occur, they will have a large
performance impact.

Also, VVM faithfully reproduces a special ordering at the loop,
the one that it was written in. This is obviously ideal for
languages which write out explicit loops, like C, but does not
offer improvment for languages (like Fortran :-) which operate
more on vectors or arrays.

Consider the Fortran statement

a(n:m) = a(k:k+m-n) + 2.0

which can be efficiently implemented with either a forward or a backward
loop, depending on the sign of n - k. The compiler has two choices:
Either copy the intermediate result to a temporary, or perform a
run-time test and switch the loop ordering, depending on the result.

Depending on the length of the loop, there is then always the question
of which is faster, because the run-time test adds to the overhead of
the loop startup.

This is one case where a genuine vector ISA could offer advantages,
it could do the loop reversal check in hardware, and fast. I simply
don't know if ForwardCom can do that, or not.

Agner Fog

unread,
Feb 2, 2023, 1:57:11 AMFeb 2
to

Thomas Koenig wrote:
What would be the best way to handle loop-carried dependencies (let's say a memmove, where operands can overlap)

The memmove function allows overlap, unlike memcpy. The memmove library function is using a backwards loop when necessary. The ForwardCom library function will do the same, of course. A backwards loop is not as elegant as a forward loop, but still possible.

An optimizing compiler will not vectorize a loop automatically unless it can prove that there is no aliasing problem. The programmer may help by explicit vectorization, or a __restrict keyword, or a no-aliasing compiler option. ForwardCom does not differ from other SIMD processors in this respect.


Quadibloc wrote:
some actions, treated as single instructions by the microarchitecture for efficiency, are specified as multiple instructions in the code.

The assembler understands the high-level construct:
for (int r1=0; r1<n; r1++) { }
The loop prolog (r1=0) is a single instruction (or two instructions if n is a register and it is not certain that n > 0).
The loop epilog (r1++, jump if r1<n) is a single instruction. It is the high-level assembler that does the translation. The microprocessor sees the epilog as a single instruction and treats it like any other combined ALU-branch instruction, so it has no problem decoding this instruction.
The compiler is free to pass the 'for' loop as a high-level expression to the assembler or translate it to low-level instruction mnemonics.

Thomas Koenig wrote:
This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.

It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

Agner Fog

unread,
Feb 2, 2023, 2:02:04 AMFeb 2
to

> Thomas Koenig wrote:
> This is one case where a genuine vector ISA could offer advantages, it could do the loop reversal check in hardware, and fast. I simply don't know if ForwardCom can do that, or not.
> It cannot. ForwardCom treats the code as it is written. A loop reversal check in hardware would require that the entire loop is decoded and stored in a micro-op cache and all addresses calculated first. What if the loop is big with branches and function calls and nested loops etc. My66000 cannot do this either. It can only handle small loops.

What I meant was, an out-of-order My66000 might do the equivalent of loop reversal in small, simple cases, but not for big or complicated loops.

Thomas Koenig

unread,
Feb 2, 2023, 12:26:04 PMFeb 2
to
Thomas Koenig <tko...@netcologne.de> schrieb:

> Also, VVM faithfully reproduces a special ordering at the loop,
> the one that it was written in. This is obviously ideal for
> languages which write out explicit loops, like C, but does not
> offer improvment for languages (like Fortran :-) which operate
> more on vectors or arrays.

Actually, that should read

"does not offer additional improvement". Having VVM for Fortran would
already speed up a lot of code.

Stephen Fuld

unread,
Feb 2, 2023, 12:56:55 PMFeb 2
to
I get what you are saying, but I am not sure it is true. Specifically,
while the CPU can do only one add at a time (due to the dependency), so
it doesn't offer the advantage of operating on multiple elements
simultaneously, as it does on non loop carried dependency loops.

But it may be, and Mitch is the expert here, that the other advantages
of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
what they are officially called) offer an advantage over pure scaler code.

BTW, whether you call what VVM does for this loop "vectorizing" is, I
think a matter of ones definition of vectorizing, specifically whether
the definition must include operating on more than one element
simultaneously or not.




--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Feb 2, 2023, 3:52:27 PMFeb 2
to
On Thursday, February 2, 2023 at 11:56:55 AM UTC-6, Stephen Fuld wrote:
> On 2/2/2023 9:26 AM, Thomas Koenig wrote:
> > Thomas Koenig <tko...@netcologne.de> schrieb:
> >
> >> Also, VVM faithfully reproduces a special ordering at the loop,
> >> the one that it was written in. This is obviously ideal for
> >> languages which write out explicit loops, like C, but does not
> >> offer improvment for languages (like Fortran :-) which operate
> >> more on vectors or arrays.
> >
> > Actually, that should read
> >
> > "does not offer additional improvement". Having VVM for Fortran would
> > already speed up a lot of code.
<
> I get what you are saying, but I am not sure it is true. Specifically,
> while the CPU can do only one add at a time (due to the dependency),
> so it doesn't offer the advantage of operating on multiple elements
> simultaneously, as it does on non loop carried dependency loops.
<
If we are talking about the following loop (snatched from above)::
<
void add (const int *a, int *b, int *c, int n)
{
for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}
<
The only thing preventing performing 2-4-8 adds, at the same time,
is whether (or not) a[] actually aliases b[] or c[]. If/when there is no
aliasing, this can be performed as wide as the implementation has
resources. If aliasing is actually taking place then one is limited by
the actual data-flow.
<
Note: That in the case above::
<
add( a, a, @a[37]);
<
Does not perform and ACTUAL aliasing up to execution widths of
32 per cycle (8 per cycle FP due to latency).
>
> But it may be, and Mitch is the expert here, that the other advantages
> of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
> what they are officially called) offer an advantage over pure scaler code.
<
Consider that a processor is comprised of many function units, each
having a characteristic latency:: integer is 1 cycle, LDs are 3, STs are 1
cycle, FP is 4 cycles. Now consider that there is 1 FU of each kind and
that the LOOP instruction executes in 1 cycle. Finally consider that the
underlying machine is 1-wide and in-order (classic 1st generation RISC).
Cache width is 128-bits--not 1st generation RISC, but suitable for mis-
aligned DoubleWord access at full speed.
>
Here, loops are bound by the number of times a FU is used within the
loop. DAXPY, which has 2 LDs, 1 ST, 1 FMAC, per iteration. LDs and STs
are being performed 128-bits per access, or 2 Iterations per access.
<
On my 1-wide in order machine, we are reading 128-bits per for the
2 LDs, and writing 128-bits for the ST every 3 (2*)cycles, so with a
single FMAC unit we can perform 2 LDs, 1 FMAC, 1 ST and the equi-
valent of 1×(ADD-CMP-BLT) every 3 cycles (7 instructions in 3 cycles
= 2+1/3 I/C). 2.3 IPC is not bad for a 1-wide machine being about
3.5× the performance of the scalar code all by itself.
<
(*) The nature of cache lines having 1 tag and 4 quadword columns
enables the 2 LDs and the ST to both be performed in 2 cycles since
the cache line is 4 accesses wide and we need 3 tag accesses every
4 cycles. {This may require an additional QuadWord of buffering for
STs.} In order to do this, the column being stored cannot be a column
being read by either of the 2 LDs. 7 effective instructions in 2 cycles
is 3.5 IPC and 5× what scalar code could perform.
<
Also note: The C (or Fortran) source has manually unrolled this loop 4
times. This just multiplies the number of instructions in the loop by
4 while leaving a single LOOP instruction for iteration control.
<
> BTW, whether you call what VVM does for this loop "vectorizing" is, I
> think a matter of ones definition of vectorizing, specifically whether
> the definition must include operating on more than one element
> simultaneously or not.
>
In the same way a CARY-like vector machine does not need fetch
bandwidth while its vector instructions ae being performed, a VVM
implementation does not need the Fetch-Parse-Decode stages of
the pipeline to be active while the loop is being performed.

Anton Ertl

unread,
Feb 2, 2023, 6:03:22 PMFeb 2
to
Agner Fog <ag...@dtu.dk> writes:
>ForwardCom can vectorize array loops in a new way that automatically adjust=
>s to the maximum vector length supported by the CPU. It works in the follow=
>ing way:
>
>A register used as loop counter is initialized to the array length. This co=
>unter is decremented by the vector register length and repeats as long as i=
>t is positive. The counter register also specifies the vector register leng=
>th. The vector registers used in the loop will have the maximum length as l=
>ong as the counter exceeds this value. There is no loop tail because the ve=
>ctor length is automatically adjusted in the last iteration of the loop to =
>fit the remaining number of array elements.
...
>There is no global vector length register. You can have different vectors w=
>ith different lengths at the same time. The length is stored in the vector =
>register itself. When you save and restore a vector register, it will only =
>save the part of the register that is actually used.=20

In recent time we have had the problem that CPU manufacturers combine
different cores on the same CPU, and want to migrate threads between
these cores. So they use the same SIMD lengths on all cores, even, in
the case of Intel, disabling AVX-512 on CPUs that do not have the
smaller cores, i.e., where all cores could do AVX-512 if it was not
disabled.

It seems to me that you don't have an answer for that. And I guess
that as long as there are architectural (programmer-visible) SIMD
registers, the problem will persist. The SIMD registers must not be
an architectural feature, as in VVM (or maybe Helium?), to make
migration from cores with longer to cores with shorter vectors
possible.

The other option would be for Intel to implement AVX-512 in the small
cores without 512-bit units, the same way that AMD has been using time
and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
and Zen1, and ZMM is 2x256 bits on Zen4.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Agner Fog

unread,
Feb 3, 2023, 12:47:28 AMFeb 3
to
Anton Ertl wrote:
> In recent time we have had the problem that CPU manufacturers combine
> different cores on the same CPU, and want to migrate threads between
> these cores. So they use the same SIMD lengths on all cores, even, in
> the case of Intel, disabling AVX-512 on CPUs that do not have the
> smaller cores, i.e., where all cores could do AVX-512 if it was not
> disabled.
Yes, this was a big mistake in my opinion. The first version of Intel's Alder Lake had access to different cores with and without AVX512 support. Your program will crash if the software detects that it can use AVX512 and later jumps to another core without AVX512. So they had to disable the AVX512 completely, including the new half-precision floating point instructions. Did Intel expect the OS to move a thread back to a capable core in this case? You cannot expect an OS to give special treatment to every new processor on the market with new quirks. I have discussed this problem at https://www.agner.org/forum/viewtopic.php?f=1&t=79
Unfortunately, ARM does the same in their big.LITTLE architecture.

Regarding pointer aliasing:
I think it is the responsibility of the programmer to consider pointer aliasing problems and do a loop backwards if necessary. If you want the hardware to fix all kinds of bad programming, you will soon find yourself down a very deep rabbit hole.

Terje Mathisen

unread,
Feb 3, 2023, 1:54:28 AMFeb 3
to
Generate correct code, so for max optimization give a warning, but
produce code that tests and branches to alternate implementations
depending upon the parameters, i.e. in you example both b[] and c[] must
be checked for overlap with a[]. The fallback can be fast, as in
memmove(), if zero or one input overlap, then fall back on scalar for
really bad cases?

robf...@gmail.com

unread,
Feb 3, 2023, 3:45:42 AMFeb 3
to
Re: memory management,

I skimmed through the manual and was somewhat surprised by the memory
management section. It sounds almost somewhat like a segmented system.
Is it possible to get a little more detail? If I have got this right, there is a table(s)
of base block addresses and block lengths and a number of comparators are
used to map a virtual address to a physical one. Can an object span across
multiple blocks? Is the block table swapped on a context switch, or are there
entries for multiple contexts?

While many systems contain block address translations, they also typically
include TLB's and paged memory. That style of memory management seems
to have won-out.

Anton Ertl

unread,
Feb 3, 2023, 6:35:44 AMFeb 3
to
Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:
>> In recent time we have had the problem that CPU manufacturers combine=20
>> different cores on the same CPU, and want to migrate threads between=20
>> these cores. So they use the same SIMD lengths on all cores, even, in=20
>> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
>> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
>> disabled.=20
>Yes, this was a big mistake in my opinion. The first version of Intel's Ald=
>er Lake had access to different cores with and without AVX512 support.

What I read is that you could have a BIOS setting where you could
enable AVX-512 but you could only do that if you disabled the E-cores.
Intel soon provided BIOS updates that eliminated this option.

>Your=
> program will crash if the software detects that it can use AVX512 and late=
>r jumps to another core without AVX512. So they had to disable the AVX512 c=
>ompletely, including the new half-precision floating point instructions. Di=
>d Intel expect the OS to move a thread back to a capable core in this case?=
> You cannot expect an OS to give special treatment to every new processor o=
>n the market with new quirks.

One way to deal with this would be to not report the availability of
AVX-512 (and maybe even disable them) unless the OS tells the CPU in
through a model-specific register that it is prepared for
heterogeneous AVX-512 support.

An OS that is unaware of the issue would just work, without AVX-512,
as is currently the case with Alder Lake; but the option I outlined
would open the possibility for other approaches:

One way an OS could handle the issue is to only tell the CPU that it
should report AVX-512 for processes that are limited to P-cores (with
thread-affinity or somesuch).

Another way would be to always tell the CPU that, and, as you wrote,
to migrate the process back to a P-Core if it executes an AVX-512
instruction.

Intel was not very coordinated when designing Alder Lake. They had
two teams designing cores with different instruction sets, then did
not provide OS workarounds for that, then dealt with the resulting
fallout by first restricting AVX-512 severely, but for whatever reason
even that was not enough for them, and finally they disabled AVX-512
completely.

>Unfortunately, ARM does the same in their big.LITTLE architecture.

A little more coordinated, it seems: All their cores are limited to
128-bit SIMD in SVE, which however eliminates the USP of SVE:
scalability to wider SIMD units.

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary.

I think it's something the programming language design should take
care of. If the language has vectors as a value type, this eliminates
the problem. The Fortran array sub-language is not quite there, but
still pretty good.

Michael S

unread,
Feb 3, 2023, 7:41:59 AMFeb 3
to
SIMD registers are an architectural feature of Helium/MVE.
They overlap with scalar FP registers in a manner, different from
what is common in x86 world, but consistent with a way in which
ARMv7-M FPv5 double precision register overlap with single-precision.

If anything, actual execution width (1, 2 or beats per Architecture tick)
is more exposed to SW in Helium than in most other SIMD environments.

"The SIMD registers must not be an architectural feature" appears
to be a fashion of comp.arch newsgroup, but it's not a popular
view in wider world.

>
> The other option would be for Intel to implement AVX-512 in the small
> cores without 512-bit units, the same way that AMD has been using time
> and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
> and Zen1, and ZMM is 2x256 bits on Zen4.
>

Intel did it many times in the past. All SSEx implementations before Core had
64-bit floating-point execution units. They did it later, too.
Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
Alder Lake was/is just a mess, as you correctly state in other post.
Although somewhat less messy than its predecessor Lakefield.

Michael S

unread,
Feb 3, 2023, 7:49:53 AMFeb 3
to
Yes, Arm Inc. manages their big.LITTLE combos well.
It was Samsung that did bad things like combining Cortex-A53
as LITTLE with their own Mongoose cores as big, despite the
mismatch in cache lines width, which is certainly visible to
application software.
Mongoose is dead now, so happiness returned.

Scott Lurndal

unread,
Feb 3, 2023, 8:46:01 AMFeb 3
to
Agner Fog <ag...@dtu.dk> writes:
>Anton Ertl wrote:

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary. If you want the hardwar=
>e to fix all kinds of bad programming, you will soon find yourself down a v=
>ery deep rabbit hole.

Which is why memmove(3) was standardized decades ago.

Anton Ertl

unread,
Feb 3, 2023, 9:49:46 AMFeb 3