Benefits of a write once constraint on register values in an instruction loop

JimBrakefield

unread,

Jul 28, 2021, 5:53:23 PM7/28/21

to

Consider a portion of a RISC register file used in a write-once mode per loop execution on a nominally OO design: Register renaming is not needed.
Loop instructions can be issued in any order, even simultaneously.

Issues: what are the savings from not having register renaming?
Are there enough registers (e.g. is a larger register file needed)?
For short loops mechanism needed to identify a register's loop number so multiple loop executions can exist simultaneously?
Loading of loop constant values from elsewhere in the register file into the functional units prior to loop execution?
A burst mechanism for initializing the function unit input buffers or reservation stations?
This is a somewhat novel approach? to high performance loops based on the type of iterations found in the Livermore Loops. As such the best ISA for
it is undetermined? There is considerable room for innovation?

Jim Brakefield

Ivan Godard

unread,

Jul 28, 2021, 6:29:34 PM7/28/21

to

On 7/28/2021 2:53 PM, JimBrakefield wrote:
> Consider a portion of a RISC register file used in a write-once mode per loop execution on a nominally OO design: Register renaming is not needed.
> Loop instructions can be issued in any order, even simultaneously.

Write-once is complicated to do in a genreg machine, because the
encoding needs some way to distinguish write-once from write-many.

A belt is entirely write-once, so while I can't comment on their use in
genregs, our experience with the belt may be useful to you. YMMV.

> Issues: what are the savings from not having register renaming?

Large, and increases super-linearly as the number of ports increases.
The belt doesn't have ports in the sense of a regfile, but it does have
multiple concurrent reads, just as multiported regfiles do. IANAHG, but
the bigger win is reported to be the absence of write port equivalents.

> Are there enough registers (e.g. is a larger register file needed)?

No, although you do need cheap spill/fill

> For short loops mechanism needed to identify a register's loop number so multiple loop executions can exist simultaneously?

You need some way to distinguish values by iteration number. In a belt
this falls out, because they are at different (and statically known)
belt positions. While putting an iteration number in the register name
seems plausible, you migh get encoding problems.

> Loading of loop constant values from elsewhere in the register file into the functional units prior to loop execution?

Can't say, we don't do it. Our decode is enough of a firehose that it's
simple enough to just drop them n the belt each iteration. Yes, that
contributes to belt pressure (reg pressure for you); it doesn't seem to
be a problem on our belt sizes, but I can see where it might be with regs.

> A burst mechanism for initializing the function unit input buffers or reservation stations?

Don't know.

> This is a somewhat novel approach? to high performance loops based on the type of iterations found in the Livermore Loops. As such the best ISA for
> it is undetermined?

Certainly novel, at least to me. If you push it with more thinking you
might up not much like a regular ISA at all, but more like a cellular
automaton or stream machine.

> There is considerable room for innovation?

Yes :-)

MitchAlsup

unread,

Jul 28, 2021, 6:39:47 PM7/28/21

to

On Wednesday, July 28, 2021 at 4:53:23 PM UTC-5, JimBrakefield wrote:
> Consider a portion of a RISC register file used in a write-once mode per loop execution on a nominally OO design: Register renaming is not needed.
<

OO is Object Oriented
OoO is Out of Order

<
> Loop instructions can be issued in any order, even simultaneously.
>
> Issues: what are the savings from not having register renaming?
<

Probably not enough to make this a key microarchitectural assumption. If you
have resourced the machine to run at peak performance for multiple milli-
seconds in a row, you already have enough rename read ports. If not, you
are already up sh!t creek.
<
But note: The VEC-LOOP construct in My 66000 uses a single rename over
the entire loop (body) and concatenates a loop iteration count to this name
making it loop unique without eating lots of rename space (and allowing
the front end to be quiescent during loop execution.

<
> Are there enough registers (e.g. is a larger register file needed)?
<

The general trick is that (logical) register names are used to setup the
data-flow dependencies. I might note that nothing in Livermore loops (*.c)
required anything more than the 32 GPRs in My 66000.

<
> For short loops mechanism needed to identify a register's loop number so multiple loop executions can exist simultaneously?
<

You need to solve the question of whether memory is dense and independent
and at this point rewriting into SIMD is fairly easy. The second problem is to
identify the produce in this loop and consume in this loop from the produce in
a previous loop and consumed in this loop. I call the former vector data and the
later loop-carried data.

<
> Loading of loop constant values from elsewhere in the register file into the functional units prior to loop execution?
<

My 66000 loads scalar register values into station entries during loop installation.
Subsequently, these values do not need to be read fro the RF on a per loop
iteration. Each station will await its vector or loop-carried dependencies just
like any reservation station entry would.

<
> A burst mechanism for initializing the function unit input buffers or reservation stations?
<

I did not find this necessary as it speeds up only the first iteration.

<
> This is a somewhat novel approach?
<

Sounds essentially what VVM does (or enables).........

<
> to high performance loops based on the type of iterations found in the Livermore Loops. As such the best ISA for
> it is undetermined? There is considerable room for innovation?
<

Having read the code for LL spit out from Brian's compiler, I can't see room
for a lot of improvement within the realm of RISC architectures, except perhaps
in code density.
>
> Jim Brakefield

JimBrakefield

unread,

Jul 28, 2021, 10:19:32 PM7/28/21

to

Although the belt is write once, it is logically ordered by instruction sequence. Which the compiler arranges.
So there does not seem to be a simple mapping between the belt and this write once register block idea.
I'll skip the details and state that the register numbers in the function unit reservation stations are not used to write to the registers,
instead the register numbers describe the dependency graph of results and thus control the order of evaluation.
(am assuming function units wait for both operands which are obtained individually when function units broadcast their results)

So the idea maps most easily to OoO RISC architecture. There is a mapping to a stack/accumulator machine (the writes are to the data stack).
At the end of the loop the stack pointer is reset. Requires in-order instruction issue?

> might up not much like a regular ISA at all, but more like a cellular
> automaton or stream machine.

Here the goal is to keep as many pipelined function units busy as possible.
So in that sense it is a stream machine. The "draw" is that the OoO RISC hardware, without renaming, is repurposed into a stream machine.

JimBrakefield

unread,

Jul 28, 2021, 10:28:15 PM7/28/21

to

On Wednesday, July 28, 2021 at 5:39:47 PM UTC-5, MitchAlsup wrote:
> On Wednesday, July 28, 2021 at 4:53:23 PM UTC-5, JimBrakefield wrote:
> > Consider a portion of a RISC register file used in a write-once mode per loop execution on a nominally OO design: Register renaming is not needed.
> <
> OO is Object Oriented
> OoO is Out of Order
> <
> > Loop instructions can be issued in any order, even simultaneously.
> >
> > Issues: what are the savings from not having register renaming?
> <
> Probably not enough to make this a key microarchitectural assumption. If you
> have resourced the machine to run at peak performance for multiple milli-
> seconds in a row, you already have enough rename read ports. If not, you
> are already up sh!t creek.
> <
> But note: The VEC-LOOP construct in My 66000 uses a single rename over
> the entire loop (body) and concatenates a loop iteration count to this name
> making it loop unique without eating lots of rename space (and allowing
> the front end to be quiescent during loop execution.

Ah, now I can understand the My 66000 looping implementation.

> <
> > Are there enough registers (e.g. is a larger register file needed)?
> <
> The general trick is that (logical) register names are used to setup the
> data-flow dependencies. I might note that nothing in Livermore loops (*.c)
> required anything more than the 32 GPRs in My 66000.
> <
> > For short loops mechanism needed to identify a register's loop number so multiple loop executions can exist simultaneously?
> <
> You need to solve the question of whether memory is dense and independent
> and at this point rewriting into SIMD is fairly easy. The second problem is to
> identify the produce in this loop and consume in this loop from the produce in
> a previous loop and consumed in this loop. I call the former vector data and the
> later loop-carried data.
> <
> > Loading of loop constant values from elsewhere in the register file into the functional units prior to loop execution?
> <
> My 66000 loads scalar register values into station entries during loop installation.
> Subsequently, these values do not need to be read fro the RF on a per loop
> iteration. Each station will await its vector or loop-carried dependencies just
> like any reservation station entry would.
> <
> > A burst mechanism for initializing the function unit input buffers or reservation stations?
> <
> I did not find this necessary as it speeds up only the first iteration.
> <
> > This is a somewhat novel approach?
> <
> Sounds essentially what VVM does (or enables).........
> <
> > to high performance loops based on the type of iterations found in the Livermore Loops. As such the best ISA for
> > it is undetermined? There is considerable room for innovation?
> <
> Having read the code for LL spit out from Brian's compiler, I can't see room
> for a lot of improvement within the realm of RISC architectures, except perhaps
> in code density.

Livermore Loops code density would appear to benefit from load/store register indirect auto increment.
And from an accumulator or stack ISA (one operand source and result destination implied).

> >
> > Jim Brakefield

Thomas Koenig

unread,

Jul 29, 2021, 6:19:04 AM7/29/21

to

JimBrakefield <jim.bra...@ieee.org> schrieb:

> Livermore Loops code density would appear to benefit from
> load/store register indirect auto increment.

Why is there such a big concern with Livermore Loops, especially
for code density? I would assume that any CPU used for scientific
work would run these out of its L1 cache without problems.

Apart from that, any auto increment / decrement will result in two
stores, so presumably it will be cracked into two microops anyway.

If you want do do something for instruction density where it
helps many of today's users, look towards browers, java VM and
Javascript engines and video codecs.

Plus, of course, games, but I don't know of any benchmark
there.

Anton Ertl

unread,

Jul 29, 2021, 10:03:27 AM7/29/21

to

JimBrakefield <jim.bra...@ieee.org> writes:
>Consider a portion of a RISC register file used in a write-once mode per loop execution on a nominally OO design: Register renaming is not needed.
>Loop instructions can be issued in any order, even simultaneously.
>
>Issues: what are the savings from not having register renaming?

None on a modern OoO design. Register renaming is essential not only
for OoO execution of straight-line code, but also for branch
prediction and precise exceptions. That's why we don't see OoO
without register renaming nor in-order with register renaming.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Jul 29, 2021, 2:28:01 PM7/29/21

to

Not as much as you might think. The auto-increment part is fulfilled by the LOOP
instruction (which also does CMP and BC) and the efficient [Rbase+Rindex<<scale+Disp]
memory address generation.
>
> > >
> > > Jim Brakefield

MitchAlsup

unread,

Jul 29, 2021, 2:29:32 PM7/29/21

to

On Thursday, July 29, 2021 at 5:19:04 AM UTC-5, Thomas Koenig wrote:
> JimBrakefield <jim.bra...@ieee.org> schrieb:
> > Livermore Loops code density would appear to benefit from
> > load/store register indirect auto increment.
> Why is there such a big concern with Livermore Loops, especially
> for code density? I would assume that any CPU used for scientific
> work would run these out of its L1 cache without problems.
>
> Apart from that, any auto increment / decrement will result in two
> stores, so presumably it will be cracked into two microops anyway.
<

Two register file accesses: LD-w/AI 2 writes ST-w/AI 1write 1 read.

Marcus

unread,

Jul 29, 2021, 2:40:16 PM7/29/21

to

Browsers (C++) and JavaScript (JIT), yes. Java VM, probably not so much
these days?

>
> Plus, of course, games, but I don't know of any benchmark
> there.

Games are tricky because they typically require lots of hardware
and platform support (like OpenGL/Vulkan, audio, various forms of I/O,
threading, etc), which may not be available in a lab environment that
you may have when evaluating an ISA. They may also be developed in a
more-complex-than-C language (e.g. C++17).

That said, there are some modern open source game engines that may be
interesting to dive into to find performance sensitive code, e.g. the
Godot Engine [1].

I have personally only gotten as far as Quake (written in C, in the
1990s), and I've found that it uses lots of Pentium/CISC:isms and
solutions that are not quite as relevant today as they were when the
game was created. So while it is an easy target, I'm very careful when
using it as a benchmark.

/Marcus

[1] https://github.com/godotengine/godot

JimBrakefield

unread,

Jul 29, 2021, 6:33:28 PM7/29/21

to

On Thursday, July 29, 2021 at 5:19:04 AM UTC-5, Thomas Koenig wrote:

Livermore Loops are short and simple, and presumably often used in
physics. Am manic on code density. For an accumulator/stack machine
an instruction is under 16-bits: ALU operation, register reference and an indirect
auto-increment flag.

|> Why is there such a big concern with Livermore Loops, especially
|> for code density?

Need to look at game engines.

JimBrakefield

unread,

Jul 29, 2021, 6:43:59 PM7/29/21

to

There is a remaining degree of freedom on the RISC mapping: within the write-once register block the registers can be allocated sequentially.
So if the block starts at zero, the destination register fields are sequential. Which helps with multiple issue. Likewise helps with mapping to
a belt machine or an accumulator/stack machine.

A bigger problem is that not all loops want a clean slate of the write-once register block. Reusing previous values can reduce memory traffic
(such as in FIR filters, etc). The tags associated with the write-once registers get more complicated.

MitchAlsup

unread,

Jul 29, 2021, 7:15:20 PM7/29/21

to

You must arrange the situation where the subroutine calling convention is integrated
with the register in block ordering such that you do not require ANY added data
movement in order to enter a loop.

JimBrakefield

unread,

Jul 29, 2021, 7:28:29 PM7/29/21

to

Would expect a RISC enter loop instruction to indicate the write-once register block by, say, two register fields.
My 66000 uses a bit mask which is fine?

MitchAlsup

unread,

Jul 29, 2021, 8:30:05 PM7/29/21

to

But this bit mask specifies the registers which carry loop-to-loop dependencies
So that during installation of the loop, the registers can be classified into Scalar
(read only in the loop), vector (written in the loop), carry (written in one loop
read in a subsequent loop). This carry registers are delivered to the register file
when the loop terminates. It ends up that the very vast majority of loops do not
have any carry dependencies.

JimBrakefield

unread,

Aug 16, 2021, 2:42:52 PM8/16/21

to

Good point, auto-increment and auto-decrement addressing modes will require many register updates for Livermore loops.
Which in turn, makes PDP11 style addressing modes problematical for high performance.
The first order solution is a dedicated index register and addressing modes that use it.
E.g. (R++) becomes (R + ix<<size) where R is the base address of the data array (or derived from it)
And (--R) becomes (R - ix<<size) for use with descending index (count down to zero) loop.
If one has larger/longer instructions, a register field for the index register is possible.

MitchAlsup

unread,

Aug 16, 2021, 9:00:47 PM8/16/21

to

On Monday, August 16, 2021 at 1:42:52 PM UTC-5, JimBrakefield wrote:
> On Thursday, July 29, 2021 at 1:29:32 PM UTC-5, MitchAlsup wrote:
> > On Thursday, July 29, 2021 at 5:19:04 AM UTC-5, Thomas Koenig wrote:
> > > JimBrakefield <jim.bra...@ieee.org> schrieb:
> > > > Livermore Loops code density would appear to benefit from
> > > > load/store register indirect auto increment.
> > > Why is there such a big concern with Livermore Loops, especially
> > > for code density? I would assume that any CPU used for scientific
> > > work would run these out of its L1 cache without problems.
> > >
> > > Apart from that, any auto increment / decrement will result in two
> > > stores, so presumably it will be cracked into two microops anyway.
> > <
> > Two register file accesses: LD-w/AI 2 writes ST-w/AI 1write 1 read.
> > >
> > > If you want do do something for instruction density where it
> > > helps many of today's users, look towards browers, java VM and
> > > Javascript engines and video codecs.
> > >
> > > Plus, of course, games, but I don't know of any benchmark
> > > there.
> Good point, auto-increment and auto-decrement addressing modes will require many register updates for Livermore loops.
> Which in turn, makes PDP11 style addressing modes problematical for high performance.
<

As if were not already so.........

<
> The first order solution is a dedicated index register and addressing modes that use it.
> E.g. (R++) becomes (R + ix<<size) where R is the base address of the data array (or derived from it)
> And (--R) becomes (R - ix<<size) for use with descending index (count down to zero) loop.
<

In K9 (My x86-64 machine) we had a HW peep hole optimizer that would take a
series of stack pushes and pops and replace them with a series of STs/LDs terminated
with a single increment which decoupled the AGEN instructions from the arithmetic
instructions.