Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

VLIW Vectorisation Instruction format

4,376 views
Skip to first unread message

lkcl

unread,
Jun 20, 2019, 11:39:10 AM6/20/19
to
https://libre-riscv.org/simple_v_extension/specification/#vliw-format

Write-up (preliminary spec / context), above

I am developing a hybrid processor based on extending RISCV with hidden state that does not change the instruction format, it merely specifies that if certain registers are used, the instruction issue phase is to go into a hardware "for loop" for a bit, issuing multiple contiguous instructions with sequentially-increasing register numbers.

This task is made a lot easier if a good multi issue execution engine is available, hence the interest in the 6600 which Mitch has been very kindly helping me to understand and extend.

The initial design of the Vectorisation system used CSRs to create a kind of CAM that took the register to be "overridden" and what to do if that register was ever encountered in any given "standard" RISCV instruction.

The problem is, it takes 16 bits to spec that out in full (per register), such that if 4 registers need to be spec'd out, it requires *four* instructions, one of which is a 64 bit LD, just to get the darn information!

This was so ridiculously long that an alternative scheme was envisaged: a 48 bit format that embedded the 32 bit RISCV format and specified some minimalist subset of SV Vectorisation.

The only problem with that scheme is: it extends the instruction size by 50%, and the highly compact RVC opcodes (RVC) cannot be used.

So there is yet another, different, penalty, if using the SVPrefix format.

A solution has to exist, and it occurred to me a couple days ago, well why not create a VLIW format?

This would start by first specifying hoe long the VLIW Group was, so that standard RISCV 16 and 32 bit ops would fit into it.

It would also specify how many registers were to be treated as "vectorised", and how many to be "predicates" (standard scalar RISCV has no concept of predication).

In addition it would set the Vector Length fields, so that any instructions that were marked as "vectorised if used", the execution engine would know how many contiguous sequential ops to issue.

The only thing that is bugging me is, I have never done anything like this, or envisaged the possibility of designing a VLIW execution engine. So I have no idea what to expect, and am concerned about missing something crucial, hence the reason for this post.

The one thing that I can immediately see as problematic is branches and exceptions.

Both could result in execution jumping or returning into the middle of a VLIW Group, and unless the right context information was supplied, the processor would need to desperately try to work out where the start of the VLIW opcode was, in order to establish its Vectorisation Context.

Clearly that way lies madness, so I thought of the idea of creating a sub-PC - a VLIW PC that begins at the start of the VLIW Group.

It would increment INSTEAD of the PC, which would leave the PC pointing at the START of the group, for the full duration of sub-execution.

The sub-PC would also need to be copied to a CSR on an exception, and, what is nice about that is that RISCV already has the concept of mepc (Machine Exception Program Counter), this sub-PC would need a corresponding CSR and the code modifications to support it are very reasonable and quite minimal.

Context switches would also save / restore this sub-exception PC.

Branches I am not sure if it is a good idea to add ways to let programs jump in and out of different VLIW Groups. As there is not much time, I figured that the simplest thing to do would be to change the meaning of branches to be relative to the start of the VLIW Group.

Yes this is quite short, only about 10 16 bit opcodes or 5 32 bit ones can fit into a VLIW Group, still it is enough to do some very tight inner loops.

Where long distance branches are needed, I figured it is always possible to just end the VLIW Group, and to require that branches be done as "standard" RISCV operations.

Can anyone advise if there is anything that might have been overlooked here? Anything that could throw a spanner in the works?

I love the idea of a VLIW format backed up by a proper multi-issue execution engine to preserve instruction order.

I believe that with no complexity in the branches or exceptions related to the PC, a pre-processing phase would be all that it would take, buffering the VLIWs as they are identified, then throwing the sub-instructions into another queue, where, normally, non-VLIW opcodes would go, directly.

Each subinstruction, unlike a "normal" one, would have the Vectorisation State carried along with it (predication, Vector Length), for further unpacking at the right time.

It honestly does not sound very hard to implement, hence my concern that I may have overlooked something.

Feedback appreciated.

L.

Ivan Godard

unread,
Jun 20, 2019, 12:07:28 PM6/20/19
to
On 6/20/2019 8:39 AM, lkcl wrote:

<snip>

> Can anyone advise if there is anything that might have been overlooked here? Anything that could throw a spanner in the works?

Not a spanner, just a dumb question: if you are going to the trouble of
having a VLIW engine, why not use it for everything and just do a normal
VLIW?

MitchAlsup

unread,
Jun 20, 2019, 12:31:07 PM6/20/19
to
On Thursday, June 20, 2019 at 10:39:10 AM UTC-5, lkcl wrote:
> https://libre-riscv.org/simple_v_extension/specification/#vliw-format
>
> Write-up (preliminary spec / context), above
>
> I am developing a hybrid processor based on extending RISCV with hidden state that does not change the instruction format, it merely specifies that if certain registers are used, the instruction issue phase is to go into a hardware "for loop" for a bit, issuing multiple contiguous instructions with sequentially-increasing register numbers.
>
> This task is made a lot easier if a good multi issue execution engine is available, hence the interest in the 6600 which Mitch has been very kindly helping me to understand and extend.
>
> The initial design of the Vectorisation system used CSRs to create a kind of CAM that took the register to be "overridden" and what to do if that register was ever encountered in any given "standard" RISCV instruction.

CSR = Control Status Register ????
>
> The problem is, it takes 16 bits to spec that out in full (per register), such that if 4 registers need to be spec'd out, it requires *four* instructions, one of which is a 64 bit LD, just to get the darn information!
>
> This was so ridiculously long that an alternative scheme was envisaged: a 48 bit format that embedded the 32 bit RISCV format and specified some minimalist subset of SV Vectorisation.
>
> The only problem with that scheme is: it extends the instruction size by 50%, and the highly compact RVC opcodes (RVC) cannot be used.
>
> So there is yet another, different, penalty, if using the SVPrefix format.
>
> A solution has to exist, and it occurred to me a couple days ago, well why not create a VLIW format?

The solution I found was to "blow" an instruction prior to entering the loop.
This instruction provides the information regular instructions do not carry.

For my Virtual Vector Method, the instructions supplies a bit field and the
bits in the field are used to denote that a "register" in the "loop" is either
Scalar (0) or Vector (1).

So if we have::

LOOP:
LDD R7,[Ra+Ri<<3] // A[i]
LDD R8,[Rb+Ri<<3] // B[i]
FMAC R9,R8,R7,R9 // sum += A[i]*B[i]
ADD Ri,Ri,1
CMP Rt,Ri,Rmax
BLT Rt,LOOP

Ra and Rb are Scalar and can be obtained once.
Ri is the loop indexing variable.
Rt is a phantom used once value carrier
R7, R8, R9 are Vector registers--they have a different value each loop.

So I have an instruction that provides the Scalar/Vector information,
and I have an instruction that provides the ADD/CMP/BC triplett semantic.
Now I can encode the loop as::

VEC {{VSV}{VSV}{VVVV}{VS}}
LOOP:
LDD R7,[Ra+Ri<<3] // A[i]
LDD R8,[Rb+Ri<<3] // B[i]
FMAC R9,R8,R7,R9 // sum += A[i]*B[i]
LOOP Ri,Rmax,LT

In this case I have VECTORIZED loops not vectorized instructions; although
the execution of the loops will appear to be the same at the data path level;
It is not the same at the register storage level. My method vectorizes loops
without any vector register storage!

Also note: the register writes do not need to transpire while the loop is in
progress unless an exception or interrupt is raised.
>
> This would start by first specifying hoe long the VLIW Group was, so that standard RISCV 16 and 32 bit ops would fit into it.

Above
>
> It would also specify how many registers were to be treated as "vectorised", and how many to be "predicates" (standard scalar RISCV has no concept of predication).

Above
>
> In addition it would set the Vector Length fields, so that any instructions that were marked as "vectorised if used", the execution engine would know how many contiguous sequential ops to issue.

Unnecessary.
>
> The only thing that is bugging me is, I have never done anything like this, or envisaged the possibility of designing a VLIW execution engine. So I have no idea what to expect, and am concerned about missing something crucial, hence the reason for this post.
>
> The one thing that I can immediately see as problematic is branches and exceptions.
>
> Both could result in execution jumping or returning into the middle of a VLIW Group, and unless the right context information was supplied, the processor would need to desperately try to work out where the start of the VLIW opcode was, in order to establish its Vectorisation Context.

Why not "back the machine" up to the top of the loop and set Rvl to the
calculations not yet performed; or for a vector register file machine
back up to the beginning of the instruction raising the exception/interrupt.

lkcl

unread,
Jun 20, 2019, 12:33:41 PM6/20/19
to
Not dumb. We need RISCV scalar interoperability otherwise the entire design is completely isolated and cut off from binary interoperability with the RISCV world.

Clearly that would be bad.

We can just about tolerate, as a tiny team, the creation of a special port of LLVM that focusses on the IR (not even has clang frontend support).

The Vulkan 3D engine that we are working on does SPIRV IR to LLVM IR compilation, and that means that (using the special Vectorised opcodes) the LLVM JIT compiler translates that LLVM IR into Vectorised assembly.

I envisaged a separate pass, inside the JIT Engine, that would take standard Vectorised opcodes and, by identifying those which had the exact same Vectorisation Prefix, batch them up instead into a much shorter VLIW Group.

This would be much simpler than trying to get LLVM to understand VLIW Grouping.

Bottom line, minimising the amount of compiler work is a crucial strategic move intended to decrease development time and increase the chances of success.

A full-on VLIW design, isolated from the RISCV community, goes directly against that. It would easily add a decade to the completion time.

L.

EricP

unread,
Jun 20, 2019, 12:42:07 PM6/20/19
to
I'm not sure but I think Luke may be using the term VLIW to
mean RISC-V Variable Length Instruction Word and not the
traditional usage Very Long Instruction Word

https://en.wikipedia.org/wiki/Very_long_instruction_word

lkcl

unread,
Jun 20, 2019, 12:58:10 PM6/20/19
to
On Friday, June 21, 2019 at 12:31:07 AM UTC+8, MitchAlsup wrote:

>
> CSR = Control Status Register ????

Yes.


> > A solution has to exist, and it occurred to me a couple days ago, well why not create a VLIW format?
>
> The solution I found was to "blow" an instruction prior to entering the loop.
> This instruction provides the information regular instructions do not carry.

Similar principle. Have to be careful about not carrying too much state, for context switching gets very burdensome.

> For my Virtual Vector Method, the instructions supplies a bit field and the
> bits in the field are used to denote that a "register" in the "loop" is either
> Scalar (0) or Vector (1).


Sigh I would love to just be able to use q bitfield! It would be 32 bits, because RISCV has 32 registers.

The issue with our design is, as it is also a hybrid GPU / VPU, we have to extend the regfile to 128 entries. In addition, the hidden state can specify a "width override" that sets the operation to 8, 16 or 32 bit (or default).

The state information table per register is therefore not 1 bit, it's 16 bits:

* register to change if it is used (5 bits)
* register to ACTUALLY use (7 bits, because it is 0 to 127 now)
* register is scalar or vector (1 bit)
* register is FP or INT.
* width override (2 bits).

It is a LOT of info, hence my concern about both setup time as well as context switching.

> So if we have::
>
> LOOP:
> LDD R7,[Ra+Ri<<3] // A[i]
> LDD R8,[Rb+Ri<<3] // B[i]
> FMAC R9,R8,R7,R9 // sum += A[i]*B[i]
> ADD Ri,Ri,1
> CMP Rt,Ri,Rmax
> BLT Rt,LOOP
>
> Ra and Rb are Scalar and can be obtained once.
> Ri is the loop indexing variable.
> Rt is a phantom used once value carrier
> R7, R8, R9 are Vector registers--they have a different value each loop.
>
> So I have an instruction that provides the Scalar/Vector information,
> and I have an instruction that provides the ADD/CMP/BC triplett semantic.
> Now I can encode the loop as::
>
> VEC {{VSV}{VSV}{VVVV}{VS}}
> LOOP:
> LDD R7,[Ra+Ri<<3] // A[i]
> LDD R8,[Rb+Ri<<3] // B[i]
> FMAC R9,R8,R7,R9 // sum += A[i]*B[i]
> LOOP Ri,Rmax,LT
>
> In this case I have VECTORIZED loops not vectorized instructions; although
> the execution of the loops will appear to be the same at the data path level;

How is a context switch handled?

> It is not the same at the register storage level. My method vectorizes loops
> without any vector register storage!

This aspect of the 88000 still melts my brain :)

> >
> > Both could result in execution jumping or returning into the middle of a VLIW Group, and unless the right context information was supplied, the processor would need to desperately try to work out where the start of the VLIW opcode was, in order to establish its Vectorisation Context.
>
> Why not "back the machine" up to the top of the loop and set Rvl to the
> calculations not yet performed; or for a vector register file machine
> back up to the beginning of the instruction raising the exception/interrupt.

Hmmm, it would work if the interrupt was external (a different privilege level). And with shadowing it is quite easy to say that the instruction group be atomic by casting a commit shadow across the entire group. Only when the last instruction in the group finished would commit be allowed.

There are a couple of problems I can forsee, one is on LD/ST exceptions within the group, particularly illegal mem acceses. The execution would point to the beginning of the group, the actual exception would be half way through.

Second, much more important, if there is a loop that modifies large areas of memory, there is not going to be enough space in the engine to buffer all the commits waiting for exit from the group.

Hence why my first reaction was, don't go there, just create a sub-group PC.

L.

lkcl

unread,
Jun 20, 2019, 1:07:56 PM6/20/19
to
Kinda, yes. RISCV happens to have defined an opcode format that is very long.

I sm using that to embed *variable length* VLIW instructions a la VLIW as a way to add "Vectorisation Context" to a specific group of "standard 16/32/48 bit RISCV opcodes".

The prior version (pre VLIW esque) being so cumbersome to specify the vectorisation context it could not be left undealt with.

One strong advantage that this idea has over more traditional VLIW, it is normally left to the compiler to sort out the dog's dinner morass of opcodes and potential WaR, RaW and WaW hazards.

In this design, all the hazards are taken care of by the 6600 style augmented precise and multi issue scoreboard, making the task of getting LLVM JIT to understand the format a relatively straightforward task instead of a massive multi year redesign.

L.

lkcl

unread,
Jun 20, 2019, 1:29:04 PM6/20/19
to
On Friday, June 21, 2019 at 12:31:07 AM UTC+8, MitchAlsup wrote:

> Why not "back the machine" up to the top of the loop and set Rvl to the
> calculations not yet performed; or for a vector register file machine
> back up to the beginning of the instruction raising the exception/interrupt.

Forgot to say, our design, like the 88000, Mitch, is not a vector processor per se. There is no vector regfile. The registers are marked as "vector-like" and a hardware for-loop starts from that regnum, pauses the PC temporarily, and issues multiple identical instructions, identical in every way *except* that the reg# is sequentially increased until it reaches N plus Vector Length minus 1.

That's the simple version.

The complex version is that the increments can be done as 8, 16, 32 or 64 bit steps, as if the regfile were a straight byte-addressable SRAM. A c union of a byte array, short array, int array and long array expresses it best.

In this way we kinda express SIMD in a variable-length vectorised format.

Btw this loop has *nothing* to do with the new VLIW idea! So the combination of the two, actually it is the sub VLIW PC that needs pausing whilst the Vector for-loop is in progress, just as the PC is paused whilst the sub VLIW PC is in effect.

Two nested hardware for-loops. Yuck! :)

L.

Stephen Fuld

unread,
Jun 20, 2019, 1:29:28 PM6/20/19
to
Have you made any progress, at least that you can talk about, about how
to handle the ending to "collect" the values in the multiple function
units into the single R9?


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,
Jun 20, 2019, 6:44:53 PM6/20/19
to
On 6/20/2019 9:33 AM, lkcl wrote:
> On Friday, June 21, 2019 at 12:07:28 AM UTC+8, Ivan Godard wrote:
>> On 6/20/2019 8:39 AM, lkcl wrote:
>>
>> <snip>
>>
>>> Can anyone advise if there is anything that might have been overlooked here? Anything that could throw a spanner in the works?
>>
>> Not a spanner, just a dumb question: if you are going to the trouble of
>> having a VLIW engine, why not use it for everything and just do a normal
>> VLIW?
>
> Not dumb. We need RISCV scalar interoperability otherwise the entire design is completely isolated and cut off from binary interoperability with the RISCV world.
>
> Clearly that would be bad.

Well - it gives you something to design around instead of design over.

Half-measures give half the results - and usually at twice the cost.

FWIW, there are existing VLIW backends for LLVM. Because LLVM gives you
SSA to work with it's not at all hard to generate VLIW code; the
dependencies are already done for you. Yes, it took us a long time, but
we had to generate for temporal addressing and exposed pipeline; you'd
have it much easier assuming you went for spatial addressing and hidden
pipe, like sensible VLIWs.

lkcl

unread,
Jun 20, 2019, 7:36:27 PM6/20/19
to
On Friday, June 21, 2019 at 6:44:53 AM UTC+8, Ivan Godard wrote:

> >
> > Not dumb. We need RISCV scalar interoperability otherwise the entire design is completely isolated and cut off from binary interoperability with the RISCV world.
> >
> > Clearly that would be bad.
>
> Well - it gives you something to design around instead of design over.

:)

> Half-measures give half the results - and usually at twice the cost.

Kinda running out of time to design, need to start acting and implementing. You know how that goes: forever stuck in design mode - no product, no progress.

> FWIW, there are existing VLIW backends for LLVM.

Oh *good*. That will make life a hell of a lot easier.

> Because LLVM gives you
> SSA to work with it's not at all hard to generate VLIW code; the
> dependencies are already done for you. Yes, it took us a long time, but
> we had to generate for temporal addressing and exposed pipeline; you'd
> have it much easier assuming you went for spatial addressing and hidden
> pipe, like sensible VLIWs.

It is fascinating how completely and utterly useless a google search is for both "VLIW hidden pipe" and spatial addressing.

I think I would have got better results searching for "VLIW hidden pie". Google asked if I meant "geospatial view addressing" on the other one. I said "no" and it still did not help.

Is there any known way to find *useful* information on those terms that will allow me to know what you are referring to?

L.

BGB

unread,
Jun 20, 2019, 8:29:47 PM6/20/19
to
My vote would be to try to bolt on something akin a "Wide-Execute Hint
Mask" instruction, which is otherwise treated like a NOP, but could use
an immediate as a bitmask to indicate which of the following
instructions may be executed in parallel.

Maybe, 'LUI X0, imm20' or similar is re-purposed as such a bit mask?...

Becomes:
WEXHM mode4, mask16
Mode 1: Next 16 ops, 0=Scalar, 1=Parallel
Mode 2: Next 8 ops, 00=Scalar, 01=Parallel, 10=True, 11=False
...

A scalar implementation could be made to merely ignore this instruction.


Likely, this could be a special internal register local to the decoder,
which shifts right as instructions are executed.

Or such...

Ivan Godard

unread,
Jun 20, 2019, 9:44:27 PM6/20/19
to
Maybe I coined the term, in contrast with the Mill's "exposed pipeline" :-)

In exposed pipeline, the execution latency of individual operations is
statically known and divorced from their issue time, yet ops are fully
pipelined in the FU so each wide instruction can start a new op in each
slot every cycle. There are no "early outs", and any (statically)
unexpected latency stalls the whole issue process, not just that slot or
FU. The Mill works like this. The Philips Trimedia 5-wide VLIW was
mostly exposed.

In hidden pipeline (hereby defined) all ops of a wide instruction issue
together and retire together, regardless of their actual latency. Early
out is possible, and may accelerate the retire of the whole instruction.
FUs may be pipelined, issuing a later wide instruction before an earlier
one retires if there are no inter-instruction dependencies;
alternatively wide instructions may wait for the retire of the previous
instruction before issuing. No particular latency is expected. There may
be special provisions for operations that have a natural latency longer
than one cycle. Classic VLIW and EPIC work like this.

There are hybrids.

lkcl

unread,
Jun 21, 2019, 2:29:23 AM6/21/19
to
On Friday, June 21, 2019 at 1:29:47 AM UTC+1, BGB wrote:
> On 6/20/2019 11:07 AM, Ivan Godard wrote:
> > Not a spanner, just a dumb question: if you are going to the trouble of
> > having a VLIW engine, why not use it for everything and just do a normal
> > VLIW?
>
> My vote would be to try to bolt on something akin a "Wide-Execute Hint
> Mask" instruction, which is otherwise treated like a NOP, but could use
> an immediate as a bitmask to indicate which of the following
> instructions may be executed in parallel.

and, given that the very prefixing itself is non-standard, it would be
ok to do so, as scalar (standard) operation would remain unchanged.

> Maybe, 'LUI X0, imm20' or similar is re-purposed as such a bit mask?...

this being one of the many available "hints" - instructions that have
no effect on execution. if it's not suitable, there are several other
potential "hint" candidates.


> Becomes:
> WEXHM mode4, mask16
> Mode 1: Next 16 ops, 0=Scalar, 1=Parallel
> Mode 2: Next 8 ops, 00=Scalar, 01=Parallel, 10=True, 11=False
> ...
>
> A scalar implementation could be made to merely ignore this instruction.
>
>
> Likely, this could be a special internal register local to the decoder,
> which shifts right as instructions are executed.

so that the engine does not have to do all of the elements, in full, in
parallel, all at once. implementations could then be compatible with
a VLIW instruction width that is (potentially much) wider than their
actual internal parallel capability.

i like it. it's a really nice idea that would simplify designs.

the only thing is, with the OoO multi-issue engine being in place already,
and having a "shadow" system that is used to (fully, at all times) preserve
instruction commit order (even on read operations), it's not needed :)

that having been said, if genuine interest is shown i'd take some time
to consider how to support simpler designs.

interestingly, the RISC-V variable-width format for up to 192 bits would
support (192-16)/16 = up to a maximum of 11 instructions (16 bits for the
header), which means that a hint that uses the "12 bit immediate" would
fit fine.

l.

lkcl

unread,
Jun 21, 2019, 2:50:08 AM6/21/19
to
On Friday, June 21, 2019 at 2:44:27 AM UTC+1, Ivan Godard wrote:

> Maybe I coined the term, in contrast with the Mill's "exposed pipeline" :-)

:)

> In exposed pipeline, the execution latency of individual operations is
> statically known and divorced from their issue time, yet ops are fully
> pipelined in the FU so each wide instruction can start a new op in each
> slot every cycle. There are no "early outs", and any (statically)
> unexpected latency stalls the whole issue process, not just that slot or
> FU. The Mill works like this. The Philips Trimedia 5-wide VLIW was
> mostly exposed.

ok, and the TI CMS Series of DSPs is like this, i believe, too. very
simple design, you have an odd-reg FPU and an even-reg FPU, with FMACs
that get created 2 cycles after issue time.

up to 14 instructions can go into one "block", one of which can be a
Zero-overhead-loop. FFT can therefore be done in a single instruction.

quite neat, and a pig to program. best done in assembler.


> In hidden pipeline (hereby defined) all ops of a wide instruction issue
> together and retire together, regardless of their actual latency. Early
> out is possible, and may accelerate the retire of the whole instruction.
> FUs may be pipelined, issuing a later wide instruction before an earlier
> one retires if there are no inter-instruction dependencies;
> alternatively wide instructions may wait for the retire of the previous
> instruction before issuing. No particular latency is expected. There may
> be special provisions for operations that have a natural latency longer
> than one cycle. Classic VLIW and EPIC work like this.

okaay. so the end of the VLIW instruction constitutes a progress barrier,
beyond which no further execution is permitted to continue until all
ops in the current group have completed.

again (luckily), with there being an OoO mult-issue engine in place,
with "shadowing" that preserves full commit order (it has the side
effect of providing full multi-issue WaW by also covering the RaR
case), and the "transitive" trick Mitch mentioned last week covering
multi-issue, i do not perceive there to be a need for inter-VLIW-op
progression barriers.

that said, as with BGB's idea, simpler implementations may be *really
interested* in such a division.

and, also, it may turn out that although we *can* begin feeding
sub-instructions from the next VLIW word (with its new Vector context)
into the instruction issue phase whilst the FUs are in the process
of munching on the previous one, there may be a design cost of doing
so that i haven't yet encountered.

it's not reeeallly a VLIW, in essence: it's just an excuse to give
standard RV scalar opcodes a "Group Context" that would otherwise
cost a lot of instruction cycles to set up:

* the VLIW-like format uses a 16-bit "packet" to define the vector length
where it could require up to a whopping NINETY SIX bits to do the same
thing (3 consecutive 32-bit CSR-setting instructions)

* setting 4 Register-vector "states" can now take only 64-bit or even
just 32-bit (if the reduced, less-capable 8-bit state-spec is used),
for the same 16-bit VLIW-like overhead

if CSRs are utilised to set the same 4 "states" it's a whopping 128
bits (and requires a LOAD operation) for the 16-bit "state" format
or 64 bits if the reduced 8-bit state-format is used.

which emphasises that the purpose here is not so much the actual parallelism
that you normally get from VLIW, it's about *massively* reducing the contextual
overhead associated with the "Vectorisation".

mitch has it easy - no element width overrides, only 32 registers. *grumble* :)

i am veery tempted to go full VLIW, to support simpler designs, though :)

l.

Ivan Godard

unread,
Jun 21, 2019, 3:08:18 AM6/21/19
to
On 6/20/2019 11:50 PM, lkcl wrote:
> On Friday, June 21, 2019 at 2:44:27 AM UTC+1, Ivan Godard wrote:
>
>> Maybe I coined the term, in contrast with the Mill's "exposed pipeline" :-)
>
> :)
>
>> In exposed pipeline, the execution latency of individual operations is
>> statically known and divorced from their issue time, yet ops are fully
>> pipelined in the FU so each wide instruction can start a new op in each
>> slot every cycle. There are no "early outs", and any (statically)
>> unexpected latency stalls the whole issue process, not just that slot or
>> FU. The Mill works like this. The Philips Trimedia 5-wide VLIW was
>> mostly exposed.
>
> ok, and the TI CMS Series of DSPs is like this, i believe, too. very
> simple design, you have an odd-reg FPU and an even-reg FPU, with FMACs
> that get created 2 cycles after issue time.
>
> up to 14 instructions can go into one "block", one of which can be a
> Zero-overhead-loop. FFT can therefore be done in a single instruction.
>
> quite neat, and a pig to program. best done in assembler.

Asm for spatial addressing is possible if you want to put in the work;
manual asm for temporal addressing (as in the Mill) is hopeless.

>> In hidden pipeline (hereby defined) all ops of a wide instruction issue
>> together and retire together, regardless of their actual latency. Early
>> out is possible, and may accelerate the retire of the whole instruction.
>> FUs may be pipelined, issuing a later wide instruction before an earlier
>> one retires if there are no inter-instruction dependencies;
>> alternatively wide instructions may wait for the retire of the previous
>> instruction before issuing. No particular latency is expected. There may
>> be special provisions for operations that have a natural latency longer
>> than one cycle. Classic VLIW and EPIC work like this.
>
> okaay. so the end of the VLIW instruction constitutes a progress barrier,
> beyond which no further execution is permitted to continue until all
> ops in the current group have completed.
>
> again (luckily), with there being an OoO mult-issue engine in place,
> with "shadowing" that preserves full commit order (it has the side
> effect of providing full multi-issue WaW by also covering the RaR
> case), and the "transitive" trick Mitch mentioned last week covering
> multi-issue, i do not perceive there to be a need for inter-VLIW-op
> progression barriers.

Yes - but then the "VLIW-ishness" would be really only a matter of
encoding; you might as well go down Mitch's path and use a linear
address and blocking dispatch because the static grouping of
dependencies provided by VLIW isn't buying you anything.

Of course, that would divert you from the pure RISC-V that you are so
desperately cling to :-)

> that said, as with BGB's idea, simpler implementations may be *really
> interested* in such a division.
>
> and, also, it may turn out that although we *can* begin feeding
> sub-instructions from the next VLIW word (with its new Vector context)
> into the instruction issue phase whilst the FUs are in the process
> of munching on the previous one, there may be a design cost of doing
> so that i haven't yet encountered.
>
> it's not reeeallly a VLIW, in essence: it's just an excuse to give
> standard RV scalar opcodes a "Group Context" that would otherwise
> cost a lot of instruction cycles to set up:
>
> * the VLIW-like format uses a 16-bit "packet" to define the vector length
> where it could require up to a whopping NINETY SIX bits to do the same
> thing (3 consecutive 32-bit CSR-setting instructions)

Only if you are heart-set on CSRs. RISC-V is not a bad conventional
scalar ISA, but you are trying to force it into what it's not.

> * setting 4 Register-vector "states" can now take only 64-bit or even
> just 32-bit (if the reduced, less-capable 8-bit state-spec is used),
> for the same 16-bit VLIW-like overhead
>
> if CSRs are utilised to set the same 4 "states" it's a whopping 128
> bits (and requires a LOAD operation) for the 16-bit "state" format
> or 64 bits if the reduced 8-bit state-format is used.
>
> which emphasises that the purpose here is not so much the actual parallelism
> that you normally get from VLIW, it's about *massively* reducing the contextual
> overhead associated with the "Vectorisation".
>
> mitch has it easy - no element width overrides, only 32 registers. *grumble* :)
>
> i am veery tempted to go full VLIW, to support simpler designs, though :)

You have imagination; I'd be very interested in what you come up with if
only you can get the RISC-V monkey off your back.

Ivan Godard

unread,
Jun 21, 2019, 3:11:25 AM6/21/19
to
On 6/20/2019 11:29 PM, lkcl wrote:

> interestingly, the RISC-V variable-width format for up to 192 bits would
> support (192-16)/16 = up to a maximum of 11 instructions (16 bits for the
> header), which means that a hint that uses the "12 bit immediate" would
> fit fine.

Once you adopt wide instructions there's no reason why the contained ops
have to be byte sized and aligned.

lkcl

unread,
Jun 21, 2019, 4:13:25 AM6/21/19
to
On Friday, June 21, 2019 at 8:08:18 AM UTC+1, Ivan Godard wrote:

> > again (luckily), with there being an OoO mult-issue engine in place,
> > with "shadowing" that preserves full commit order (it has the side
> > effect of providing full multi-issue WaW by also covering the RaR
> > case), and the "transitive" trick Mitch mentioned last week covering
> > multi-issue, i do not perceive there to be a need for inter-VLIW-op
> > progression barriers.
>
> Yes - but then the "VLIW-ishness" would be really only a matter of
> encoding; you might as well go down Mitch's path and use a linear
> address and blocking dispatch because the static grouping of
> dependencies provided by VLIW isn't buying you anything.
>
> Of course, that would divert you from the pure RISC-V that you are so
> desperately cling to :-)

:) well... we might as well pick something. MIPS, RISC-V, it doesn't
honestly matter which, as long as it has an existing compiler toolchain
with which we can knock 5-10 years off the software development.

> > * the VLIW-like format uses a 16-bit "packet" to define the vector length
> > where it could require up to a whopping NINETY SIX bits to do the same
> > thing (3 consecutive 32-bit CSR-setting instructions)
>
> Only if you are heart-set on CSRs. RISC-V is not a bad conventional
> scalar ISA, but you are trying to force it into what it's not.

found that out by trying to use CSRs *sigh*. the one other "rule" that's
been set is: no new instructions.

the reason is: the moment any new instruction (even one) is added, that
requires compiler toolchain development. even just binutils would need a
special fork that added new assembler.

by saying "nope" to that, and sticking to "external hardware for-loops
are permitted but new opcodes are not", the toolchain modifications are
greatly reduced, and isolated to one specific area.

the hardware-for-loop _used_ to be just on the VLen: now there's a second
one on the VLIW word-length.

> > i am veery tempted to go full VLIW, to support simpler designs, though :)
>
> You have imagination; I'd be very interested in what you come up with if
> only you can get the RISC-V monkey off your back.

in 1990, when was studying at imperial, i came up with an extreme escape-sequence ISA, after hearing that ARM had something that extended smaller
ops to larger ones. also, the Transputer only had like 3 registers, and
Sparc had "register windows". and, i'd heard of something called "tagged"
registers.

so i carried that through and thought, "hmmm, what if the escape-sequence
and the basic unit of the ISA was only 2 bits, what could you do in 8 bits"
that could be extended, sort-of like pushing and popping of "extending"
the context on and off of a stack?

you'd start off with only accessing 4 registers, however if you wanted to
gain access to registers numbered 8-11 you'd "push" an "extender" opcode
that said "now instead of accessing registers 0b00NN, you're going to
access registers 0b01NN until further notice". another "extender" would
push *another* two bits on the end, for accessing up to 64 registers.

similarly, actual instructions themselves would be escape-sequenced, again
using only 2 bits at a time.

with no escape-sequence, the operator would be ADD, SUB, MUL, DIV.
with one of the escape-sequences, the same opcode would be interpreted
as boolean operations AND OR XOR NOT.

the element width (8/16/32/64) would be a "tag" on the register, once again,
more context-information, again, only 2 bits.

automatic (polymorphic) type conversion (and width conversion) takes care
of not needing to jump in and out of contexts too much, although now having
attempted to implement that, it may be simpler to do explicitly (polymorphic
auto-type-conversion between FP and INT needs micro-code, really, where this
hypothetical instruction set *is* pretty much micro-code).

it all ended up that, basically, instructions are, stunningly, limited to
8 bit in "official" size, with the 8 bit opcode subdivided into 4 2-bit
groups, each with a specific purpose, you get the general idea.

i think at the time i was kinda looking to it being a VLIW style engine,
with context being in "stripes" down each lane, although now having seen
how VLIW isn't so popular i'm reluctant to consider it.

the fact that everything was 2 bits only meant that the decode phase was
ridiculously simple, and i'd expect pipelining and thus performance to
absolutely scream, with very little in the way of gate latency and not a
lot of logic, either.

what i didn't do - haven't done - is think through how context-switching
would work. potentially, it's a *lot* of state information.

however, funnily enough, this VLIW-like "context" concept would actually
be a perfect vehicle to provide that context, without needing huge state
to be stored on the stack.

that's one nice thing - an accidental side-effect - of using the VLIW-like
format for Vectorisation Context: a context-switch doesn't need to save
*any* of the register / predication context on the stack... because all
of that is restored/accessed via the *instruction*.

and, with the PC pointing directly *to* that instruction (the sub-VLIW-PC
being incremented temporarily whilst the "main" PC remains pointing at
the beginning of the VLIW instruction), the only additional context that
needs be saved/restored is: that sub-VLIW-PC.

that saves a not insignificant amount of stuff that needs to be pushed/popped
on the stack.

l.

lkcl

unread,
Jun 21, 2019, 4:16:06 AM6/21/19
to
true... i am trying to save on development time. hypothetically though,
in a future version, a suitable hardware-compression algorithm could even
be deployed!

that would be pretty cool, and, more than that, because the execution has
its own sub-VLIW-PC, the "normal" problems associated with dropping back
into the middle of a group of instructions, where you have no idea where
the beginning of the group is, all go away.

cooool.

l.

BGB

unread,
Jun 21, 2019, 5:45:06 AM6/21/19
to
On 6/21/2019 1:29 AM, lkcl wrote:
> On Friday, June 21, 2019 at 1:29:47 AM UTC+1, BGB wrote:
>> On 6/20/2019 11:07 AM, Ivan Godard wrote:
>>> Not a spanner, just a dumb question: if you are going to the trouble of
>>> having a VLIW engine, why not use it for everything and just do a normal
>>> VLIW?
>>
>> My vote would be to try to bolt on something akin a "Wide-Execute Hint
>> Mask" instruction, which is otherwise treated like a NOP, but could use
>> an immediate as a bitmask to indicate which of the following
>> instructions may be executed in parallel.
>
> and, given that the very prefixing itself is non-standard, it would be
> ok to do so, as scalar (standard) operation would remain unchanged.
>

Basically the idea.

A very simple scalar core can ignore the hint, and run the code as
scalar code.

An intermediate core can interpret it in a VLIW-like way.
A full superscalar or OoO core could ignore the hint and use dedicated
hardware.


>> Maybe, 'LUI X0, imm20' or similar is re-purposed as such a bit mask?...
>
> this being one of the many available "hints" - instructions that have
> no effect on execution. if it's not suitable, there are several other
> potential "hint" candidates.
>

I have done something vaguely similar in my own ISA design, just adapted
to something which could (potentially) fit in with the existing RISC-V
instruction encoding.


In my case, it was done via dedicated encoding ranges, and the
restriction that wide-execute code only use 32-bit instructions (to make
it simpler/cheaper).

Prefixes could potentially add a little overhead, but it seems like a
reasonable tradeoff (in the name of binary compatibility with the
existing ISA).


>
>> Becomes:
>> WEXHM mode4, mask16
>> Mode 1: Next 16 ops, 0=Scalar, 1=Parallel
>> Mode 2: Next 8 ops, 00=Scalar, 01=Parallel, 10=True, 11=False
>> ...
>>
>> A scalar implementation could be made to merely ignore this instruction.
>>
>>
>> Likely, this could be a special internal register local to the decoder,
>> which shifts right as instructions are executed.
>
> so that the engine does not have to do all of the elements, in full, in
> parallel, all at once. implementations could then be compatible with
> a VLIW instruction width that is (potentially much) wider than their
> actual internal parallel capability.
>
> i like it. it's a really nice idea that would simplify designs.
>

Yeah.

An important detail is whether the code "fits" with what the core can do
in wide-execute mode. I had some other instructions which signaled a
"profile" which said which features the code expects to use, and allows
the wide-execute mechanism to temporarily disable itself (falling back
to scalar operation) if there is a problem with this.

A goal was to minimize the "cleverness" needed by the core (the C
compiler and/or ASM programmer would be responsible for all this).


> the only thing is, with the OoO multi-issue engine being in place already,
> and having a "shadow" system that is used to (fully, at all times) preserve
> instruction commit order (even on read operations), it's not needed :)
>
> that having been said, if genuine interest is shown i'd take some time
> to consider how to support simpler designs.
>
> interestingly, the RISC-V variable-width format for up to 192 bits would
> support (192-16)/16 = up to a maximum of 11 instructions (16 bits for the
> header), which means that a hint that uses the "12 bit immediate" would
> fit fine.
>

OK.

lkcl

unread,
Jun 21, 2019, 7:53:19 AM6/21/19
to
On Friday, June 21, 2019 at 5:45:06 PM UTC+8, BGB wrote:

> > and, given that the very prefixing itself is non-standard, it would be
> > ok to do so, as scalar (standard) operation would remain unchanged.
> >
>
> Basically the idea.
>
> A very simple scalar core can ignore the hint, and run the code as
> scalar code.
>
> An intermediate core can interpret it in a VLIW-like way.
> A full superscalar or OoO core could ignore the hint and use dedicated
> hardware.

This was one of my concerns, that the hint would become a mandatory part of the spec, ie that without the hint, an OoO engine with its own full detection system would work out an execution order that was right... but did *NOT* match with what was declared in the hint.

So, as long as the hint is just a hint to the architecture, saying which operations are independent and can be parallelised, all is well.

That being the case, the hint is perfectly reasonable to include.

I think we might almost be back to that architecture that tagged registers to say that "this reg is to be a src in a following instruction, no need to write it to the regfile". At least, that's what the hint idea is reminding me of


>
> Prefixes could potentially add a little overhead, but it seems like a
> reasonable tradeoff (in the name of binary compatibility with the
> existing ISA).

yehyeh.

> > so that the engine does not have to do all of the elements, in full, in
> > parallel, all at once. implementations could then be compatible with
> > a VLIW instruction width that is (potentially much) wider than their
> > actual internal parallel capability.
> >
> > i like it. it's a really nice idea that would simplify designs.
> >
>
> Yeah.
>
> An important detail is whether the code "fits" with what the core can do
> in wide-execute mode. I had some other instructions which signaled a
> "profile" which said which features the code expects to use, and allows
> the wide-execute mechanism to temporarily disable itself (falling back
> to scalar operation) if there is a problem with this.

The nice thing about the embedding of opcodes within the longer variable length opcode, a 2 step instruction issue phase can be envisioned, which, if the non-long format is detected, that is just a queue copy operation (with no extra Vectorisation State / context added).

And when the long format *is* detected, the 1st stage starts throwing sub-ops into that same 2nd stage queue (no decoding done yet), where each sub-op is now given the Vectorisation Context from its parent VLIW Group, plus of course incrementing the SUB PC, not the main PC.

The *third* phase does the actual decode, taking into account the context info.

Between the 2nd and 1st stages, you could perhaps incorporate compressed to uncompressed instruction format conversion, so that the 3rd phase didn't have to worry about varying opcode lengths.

RVC has the nice feature that all the RVC space maps to an equivalent 32 bit opcode.

> A goal was to minimize the "cleverness" needed by the core (the C
> compiler and/or ASM programmer would be responsible for all this).

Am not a fan of compiler cleverness :) CEDAR Audio had to code in assembler (1993) because the TI DSP compiler just could not cope. With only 50 MFLOPs to play with (12.5 mhz, 2x pipeline, 2x FPUs, one for odd regs one for even, FMAC) the budget was only 1000 cycles per audio sample, and compiler inefficiency just could not be tolerated.

Have been wary of VLIW compiler "cleverness" ever since.

L.

lkcl

unread,
Jun 21, 2019, 3:36:06 PM6/21/19
to
On Friday, June 21, 2019 at 4:13:25 PM UTC+8, lkcl wrote:

> similarly, actual instructions themselves would be escape-sequenced, again
> using only 2 bits at a time.
>
> with no escape-sequence, the operator would be ADD, SUB, MUL, DIV.
> with one of the escape-sequences, the same opcode would be interpreted
> as boolean operations AND OR XOR NOT.

Comparison can be LE GT EQ NE and if you swap the operands it gives GE LT EQ NE so no need for 6 compare ops, only 4 again.

Load immediate, the size of the immediate is implicit from the current tag on the register so no need to specify the imm size in the instruction, this again saves hugely on bits in the instruction format

etc etc it was stunningly compact, amazing just how far escape sequencing can really be taken.

Also, fascinatingly, scaling from 8 bit to 16 bit to 32 to 64 bit microprocessor level, well, err it is all based around 8 bit in the first place, so scaling is like... what's the nonproblem again?

The scary bit (the time sink) is the compiler, and going from there to bootstrap up an OS. Have seen how long it took OR1K to get up and running, and how much effort Karsten Merkler and others for RISCV, and Wookie for ARM64, put into cross compiles and bootstrap ports of debian.

By comparison, the actual hardware design is easy!

L.

MitchAlsup

unread,
Jun 21, 2019, 6:00:41 PM6/21/19
to
On Friday, June 21, 2019 at 2:36:06 PM UTC-5, lkcl wrote:
> On Friday, June 21, 2019 at 4:13:25 PM UTC+8, lkcl wrote:
>
> > similarly, actual instructions themselves would be escape-sequenced, again
> > using only 2 bits at a time.
> >
> > with no escape-sequence, the operator would be ADD, SUB, MUL, DIV.
> > with one of the escape-sequences, the same opcode would be interpreted
> > as boolean operations AND OR XOR NOT.
>
> Comparison can be LE GT EQ NE and if you swap the operands it gives GE LT EQ NE so no need for 6 compare ops, only 4 again.

You have to be very careful so that FP compares put the NAaNs in the proper
then/else clause.

if( x > y )
{
// no NaNs here
}
else
{
// all NaNs here
}

compared to::

if( x <= y )
{
// no NaNs here
}
else
{
// all NaNs here
}

So, because NaNs are not comparable, FP compares are not invertible.
>
> Load immediate, the size of the immediate is implicit from the current tag on the register so no need to specify the imm size in the instruction, this again saves hugely on bits in the instruction format
>
> etc etc it was stunningly compact, amazing just how far escape sequencing can really be taken.
>
> Also, fascinatingly, scaling from 8 bit to 16 bit to 32 to 64 bit microprocessor level, well, err it is all based around 8 bit in the first place, so scaling is like... what's the nonproblem again?

If one needs to append 20 bits to a given instruction, do you append 16
and punt 4 or do you add 32 and waste 12.

lkcl

unread,
Jun 21, 2019, 6:41:51 PM6/21/19
to
On Saturday, June 22, 2019 at 6:00:41 AM UTC+8, MitchAlsup wrote:
> On Friday, June 21, 2019 at 2:36:06 PM UTC-5, lkcl wrote:
> > On Friday, June 21, 2019 at 4:13:25 PM UTC+8, lkcl wrote:
> >
> > > similarly, actual instructions themselves would be escape-sequenced, again
> > > using only 2 bits at a time.
> > >
> > > with no escape-sequence, the operator would be ADD, SUB, MUL, DIV.
> > > with one of the escape-sequences, the same opcode would be interpreted
> > > as boolean operations AND OR XOR NOT.
> >
> > Comparison can be LE GT EQ NE and if you swap the operands it gives GE LT EQ NE so no need for 6 compare ops, only 4 again.
>
> You have to be very careful so that FP compares put the NAaNs in the proper
> then/else clause.
>
> if( x > y )
> {
> // no NaNs here
> }
> else
> {
> // all NaNs here
> }
>
> compared to::
>
> if( x <= y )
> {
> // no NaNs here
> }
> else
> {
> // all NaNs here
> }
>
> So, because NaNs are not comparable, FP compares are not invertible.

There are some other quirks like that, related to FP. In 1990 I didn't think through FP at all, I knew very little about it.

> >
> >
> > Also, fascinatingly, scaling from 8 bit to 16 bit to 32 to 64 bit microprocessor level, well, err it is all based around 8 bit in the first place, so scaling is like... what's the nonproblem again?
>
> If one needs to append 20 bits to a given instruction, do you append 16
> and punt 4 or do you add 32 and waste 12.

I was 20, it was a thought experiment, I hadn't got that far!

Hypothetically though as long as there is room for an escape extended immediate it *may* be possible to set how bits of immediate are to be subsequently read.

I would need to redo the full suite of opcodes to see if there was room and if such an immediate opcode could be added to one of the "pages".

Now in 2019 I would be inclined to do some statistical analysis of most commonly used ops and prioritise those in the quickest (shortest to reach with the least number of 8 bit escape sequences).

However given that unlike traditional CISC escape sequencing (a la x86) where the decode starts all over again with each new opcode, this hypothetical arch *leaves* the instruction decoder in the *last used escape mode* just like in a stack.

So the traditional statistical analysis techniques just wouldn't work, because if you issue the escape sequences to get to FP ops, the *next* ops are *also going to be FP* [unless you issue the "cancel" (pop) context opcode.]

This done because if you need one FP op, chances are higher that the 2nd and 3rd ops are going to be FP as well.

One of the reasons for having a VLIW like engine here (the normal VLIW, not the VLIW of this post) was so that 2 or 3 of the 8 bit "lanes" could be left in FP Op mode, one in INT arith, one in INT bool, one in "control" mode, thus it would save instructions by not having to jump too frequently between contexts.

Oh, I remember! I'd wanted to do the PC increment and branches etc as "just another lane". So one lane would be set up to always add 1 to the PC in most cases, branch would set PC from an immed instead of add 1, and so on. Don't know how long the legs are on that idea though.

L.

BGB

unread,
Jun 21, 2019, 9:15:28 PM6/21/19
to
On 6/21/2019 6:53 AM, lkcl wrote:
> On Friday, June 21, 2019 at 5:45:06 PM UTC+8, BGB wrote:
>
>>> and, given that the very prefixing itself is non-standard, it would be
>>> ok to do so, as scalar (standard) operation would remain unchanged.
>>>
>>
>> Basically the idea.
>>
>> A very simple scalar core can ignore the hint, and run the code as
>> scalar code.
>>
>> An intermediate core can interpret it in a VLIW-like way.
>> A full superscalar or OoO core could ignore the hint and use dedicated
>> hardware.
>
> This was one of my concerns, that the hint would become a mandatory part of the spec, ie that without the hint, an OoO engine with its own full detection system would work out an execution order that was right... but did *NOT* match with what was declared in the hint.
>
> So, as long as the hint is just a hint to the architecture, saying which operations are independent and can be parallelised, all is well.
>

The main options are:
Things happen in scalar order;
Some ops can be optionally be parallelized.

If scalar order remains canonical, then little is different from before.
If an OoO engine reorders stuff, but the visible side effects are still
the same as they are with the canonical (scalar) ordering, all is well.


Side note: This is why I did not add wide-execute predicated operations
to my ISA; there wasn't a good way to do this which would not adversely
effect a scalar implementation. Though, with my ISA design (and a
simplistic core), this does mean predicated sections would fall back to
scalar execution. This was seen as a reasonable trade-off though.


> That being the case, the hint is perfectly reasonable to include.
>
> I think we might almost be back to that architecture that tagged registers to say that "this reg is to be a src in a following instruction, no need to write it to the regfile". At least, that's what the hint idea is reminding me of
>

The function and intent is a little different though.


In my case, it was because, generally:
How far to advance PC is determined in the IF stage;
Dependencies between ops can't be determined until D2 or E1;
...

IOW:
IF D1 D2 E1 E2 WB
Where:
IF: Fetch, also advances PC
D1: Decode (Primary)
D2: Register Fetch
E1: Execute 1 / AGEN / ALU
E2: Execute 2 / Memory Access / ALU (for 2-stage ops)
WB: Write Back


If it is possible determine early (IF stage) whether or not the ops will
execute in parallel, then PC can be advanced as such, and the output
from multiple decoders can be used. Otherwise the output from the other
decoders is ignored (treated as if they decoded NOPs).

With superscalar, it is necessary either to use a more complicated
organization (no longer a simple scalar pipeline, eg, needs a more
obvious split between the decode and execute stages).


If we know in advance, the pipeline can look just like a normal scalar
pipeline, just with a few extra parts glued on.

Main costs are mostly the alternate execute unit, additional register
ports, and the added pipeline machinery.

To reduce cost, as can be noted:
It can be limited to 32-bit ops or similar (2x 32b decoders);
Second lane might only be able to do basic ALU ops or similar;
Ops like memory store might take both lanes (if doing 4R/2W);
...


>
>>
>> Prefixes could potentially add a little overhead, but it seems like a
>> reasonable tradeoff (in the name of binary compatibility with the
>> existing ISA).
>
> yehyeh.
>
>>> so that the engine does not have to do all of the elements, in full, in
>>> parallel, all at once. implementations could then be compatible with
>>> a VLIW instruction width that is (potentially much) wider than their
>>> actual internal parallel capability.
>>>
>>> i like it. it's a really nice idea that would simplify designs.
>>>
>>
>> Yeah.
>>
>> An important detail is whether the code "fits" with what the core can do
>> in wide-execute mode. I had some other instructions which signaled a
>> "profile" which said which features the code expects to use, and allows
>> the wide-execute mechanism to temporarily disable itself (falling back
>> to scalar operation) if there is a problem with this.
>
> The nice thing about the embedding of opcodes within the longer variable length opcode, a 2 step instruction issue phase can be envisioned, which, if the non-long format is detected, that is just a queue copy operation (with no extra Vectorisation State / context added).
>
> And when the long format *is* detected, the 1st stage starts throwing sub-ops into that same 2nd stage queue (no decoding done yet), where each sub-op is now given the Vectorisation Context from its parent VLIW Group, plus of course incrementing the SUB PC, not the main PC.
>
> The *third* phase does the actual decode, taking into account the context info.
>
> Between the 2nd and 1st stages, you could perhaps incorporate compressed to uncompressed instruction format conversion, so that the 3rd phase didn't have to worry about varying opcode lengths.
>
> RVC has the nice feature that all the RVC space maps to an equivalent 32 bit opcode.
>

OK.


As noted, in my design, the idea was that things would execute in a
mostly VLIW-like way when WEX was used (and enabled):
Instruction fetch fetches 2x or 3x 32-bit instruction words;
PC is advanced by 2x or 3x 32-bit words;
Decode happens in parallel;
Execute also happens in parallel.

Unlike superscalar, no machinery is needed to detect dependencies
between instructions, nor is there a need to deviate from a simplistic
scalar-like pipeline.

The cost, as noted, is that only things valid for the core can be
accepted. If something is specified which isn't doable, there is little
option besides either raising a fault or the core behaving in an
undefined way.


I had also previously considered the possibility of gluing on a VLIW
mode (in a manner similar to how Thumb works in ARM), but ended up
leaning against this as it would carry a lot of the drawbacks of VLIW
(eg: lots of code bloat due to NOP padding) and also effectively require
compiling two separate versions of the program if one wants to also
support a scalar-only implementation.


>> A goal was to minimize the "cleverness" needed by the core (the C
>> compiler and/or ASM programmer would be responsible for all this).
>
> Am not a fan of compiler cleverness :) CEDAR Audio had to code in assembler (1993) because the TI DSP compiler just could not cope. With only 50 MFLOPs to play with (12.5 mhz, 2x pipeline, 2x FPUs, one for odd regs one for even, FMAC) the budget was only 1000 cycles per audio sample, and compiler inefficiency just could not be tolerated.
>
> Have been wary of VLIW compiler "cleverness" ever since.
>


Either the compiler does it, or the processor needs to be able to do so.

For a core where detecting inter-instruction dependencies (as-needed for
an in-order superscalar) is too expensive, it can be a viable
alternative to scalar.

The compiler still needs to order the instructions for an in-order
superscalar to do its thing. For explicit tagging, it does this, but
also explicitly flags which instructions match the criteria to be able
to execute in parallel.



Otherwise, kinda burnt out right now, and my day job has now gone over a
6x10 schedule (6-day work weeks with 10 hour days, mandatory overtime;
from the 5x10 and 5x12 weeks it was previously). I haven't been getting
much sleep recently (been going weeks on ~5 hours/night; with a little
more on weekends, but now I will only have 1 day of weekend...).

I like my free time, and time not spent loading/unloading/running a
waterjet (and cutting sometimes several hundred parts in a day, nearly
all of which need to be deburred by hand).

Or times which aren't running the machine, are spent throwing ~8 or so
55lb bags of garnet sand one after another into a hopper, or having to
fix the high-pressure water line if/when it next decides to explode, or
having to unclog the nozzle whenever it collides with material which is
being floaty (IOW: using a paperclip stuck in a power-drill to try to
drill out the garnet sand so I can get back to cutting parts, need to
move quickly when trying to keep up with a queue with stacks of
work-orders, ...), basically because thin sheet metal doesn't like to
lay perfectly flat on ragged metal slats.

( Granted, do get paid a bit more for dealing with this, but grr... )

Kinda sucks right now...

MitchAlsup

unread,
Jun 21, 2019, 10:12:18 PM6/21/19
to
For simple versions (i.e. 1-wide and small superscalar) of My 66000 ISA
I have come around to preferring::

IF/PA/DE/EX/CA/AL/wa/WB --integer and memory
IF/PA/DE/E1/E2/E3/E4/WB --FP and long ops

This aligns the 4-cycle FMAC unit with the 1-cycle integer and 3-cycle
load units, and does not writes to RF or Caches prior to WB. in effect,
making Cache coherent with the pipeline (along with eliminating Spectré
like attacks.)

IF: Instruction Fetch at least Quad-word access
PA: instruction parse--determine instruction boundaries
DE: decode instruction register fetch
EX: execute
CA: cache
AL: Align
wa: wait
WB: Write back

I lobbed integer multiply and divide over in the FMAC unit as it was already
doing FMUL and FDIV.

You an argue that CA/AL takes too much time, but this seems to be the way of
faster processors anyway. This gives the Cache SRAMs 1 entire cycle to be
accessed, and allows plenty of time for align and wire delay on the way back.

The wa (wait) cycle is present only to align FMAC with the other pipeline.
A DP FMAC at 4-cycles remains doable as long as wire delay does not eat up
the wire routes in the multiplier tree.

>
> If it is possible determine early (IF stage) whether or not the ops will
> execute in parallel, then PC can be advanced as such, and the output
> from multiple decoders can be used. Otherwise the output from the other
> decoders is ignored (treated as if they decoded NOPs).

This is one reason why I fetch 4 instructions at a time, so that every 32-bit
item in the instruction buffer (320-gates/entry) can expend 30-gates/entry
to predecode various things concerning that slot in the IB. Instruction
parse takes these predecoded items and performs find-next-one instruction
parsing to the extent necessary. {Remember the ISA has 1-to-5 word instructions
to provide any and all needed constants to the instruction stream.}
>
> With superscalar, it is necessary either to use a more complicated
> organization (no longer a simple scalar pipeline, eg, needs a more
> obvious split between the decode and execute stages).
>
>
> If we know in advance, the pipeline can look just like a normal scalar
> pipeline, just with a few extra parts glued on.

YEs.
>
> Main costs are mostly the alternate execute unit, additional register
> ports, and the added pipeline machinery.

One should note:
Branch unit probably does not need a RF Write port 97% of the time.
Store does not ever need an RF write port--and here is where I generally
steal a RF Read port when an instruction either consumes a constant or
is a 1-operand instruction (branch for example).

So one should be able to get to ~1.3 IPC with a 3R1W RF; even before
considering forwarding of results back as operands.
If you do this::
Be aware that you will want to implement the following tokens::

This result forwards and writes
This result forwards but write was elided [phantom]
This operand comes from slot[k]

and depending on how aggressive branch prediction is going to be:

This operand comes from slot[k] when taken or slot[j] otherwise.
>
> The compiler still needs to order the instructions for an in-order
> superscalar to do its thing. For explicit tagging, it does this, but
> also explicitly flags which instructions match the criteria to be able
> to execute in parallel.
>
>
>
> Otherwise, kinda burnt out right now, and my day job has now gone over a
> 6x10 schedule (6-day work weeks with 10 hour days, mandatory overtime;
> from the 5x10 and 5x12 weeks it was previously). I haven't been getting
> much sleep recently (been going weeks on ~5 hours/night; with a little
> more on weekends, but now I will only have 1 day of weekend...).

You know there is a reason "they" invented alcohol.....

Terje Mathisen

unread,
Jun 22, 2019, 4:08:59 AM6/22/19
to
MitchAlsup wrote:
> On Friday, June 21, 2019 at 2:36:06 PM UTC-5, lkcl wrote:
>> On Friday, June 21, 2019 at 4:13:25 PM UTC+8, lkcl wrote:
>>
>>> similarly, actual instructions themselves would be escape-sequenced, again
>>> using only 2 bits at a time.
>>>
>>> with no escape-sequence, the operator would be ADD, SUB, MUL, DIV.
>>> with one of the escape-sequences, the same opcode would be interpreted
>>> as boolean operations AND OR XOR NOT.
>>
>> Comparison can be LE GT EQ NE and if you swap the operands it gives GE LT EQ NE so no need for 6 compare ops, only 4 again.

Not really, as Mitch notes:
>
> You have to be very careful so that FP compares put the NAaNs in the proper
> then/else clause.
>
> if( x > y )
> {
> // no NaNs here
> }
> else
> {
> // all NaNs here
> }
>
> compared to::
>
> if( x <= y )
> {
> // no NaNs here
> }
> else
> {
> // all NaNs here
> }
>
> So, because NaNs are not comparable, FP compares are not invertible.

I think that having a pair classifier is a very good option:

I.e. for each of the inputs you return bits indicating
NaN/Inf/Normal/Denormal/Zero, as well as bits for the compares: Less
than/Equal/Greather than/Unordered.

If you don't need a special Denormal marker as a sw helper, then you get
just 4 classes for each input plus 4 compare results, so this would be
just 6 bits in total.

Add in a third input and it would still be just 8 bits, or 10 bits if
you include a full 3-way ordering, i.e. it all fits very easily in a
register.

I believe you do something like this for your architecture Mitch, at
least for single inputs?

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Jun 22, 2019, 4:16:02 AM6/22/19
to
lkcl wrote:
> I was 20, it was a thought experiment, I hadn't got that far!

:-)
>
> Hypothetically though as long as there is room for an escape extended
> immediate it *may* be possible to set how bits of immediate are to be
> subsequently read.
>
> I would need to redo the full suite of opcodes to see if there was
> room and if such an immediate opcode could be added to one of the
> "pages".
>
> Now in 2019 I would be inclined to do some statistical analysis of
> most commonly used ops and prioritise those in the quickest (shortest
> to reach with the least number of 8 bit escape sequences).
>
> However given that unlike traditional CISC escape sequencing (a la
> x86) where the decode starts all over again with each new opcode,
> this hypothetical arch *leaves* the instruction decoder in the *last
> used escape mode* just like in a stack.
>
> So the traditional statistical analysis techniques just wouldn't
> work, because if you issue the escape sequences to get to FP ops, the
> *next* ops are *also going to be FP* [unless you issue the "cancel"
> (pop) context opcode.]

This is exactly the same logic you need to write a near-optimal
compressor for LZ4 data, i.e. you always have multiple future options
for how long the current literal stream should be, followed by a
back-referenced copy previous output stream.
>
> This done because if you need one FP op, chances are higher that the
> 2nd and 3rd ops are going to be FP as well.

Yes! This is teletype/Baudot 5-bit Shit-In/Shift-Out all over again. :-)

I really like hos this could make it possible to write a single function
that starts with a FP shift, and then branch to it just past this point
so that it can be reused for Int data. :-)

Shades of "Story of Mel"...

lkcl

unread,
Jun 22, 2019, 8:30:49 AM6/22/19
to
On Saturday, June 22, 2019 at 9:15:28 AM UTC+8, BGB wrote:

>
> Side note: This is why I did not add wide-execute predicated operations
> to my ISA; there wasn't a good way to do this which would not adversely
> effect a scalar implementation. Though, with my ISA design (and a
> simplistic core), this does mean predicated sections would fall back to
> scalar execution. This was seen as a reasonable trade-off though.

Sigh, I do love OoO multi issue 6600 style augmented scoreboards.

Predication may be done by throwing individual elements at Function Units, and having an extra predication FU similar to a branch FU that casts a "shadow" across all the FUs with elements that need a predicate bit.

Once the Predication FU has read the predicate register (remember, that can be an immediate in which case no delay, or it could be from the regfile, this is no problem), all that the Predicate FU needs to do is:

Shove the bits of the predicate into the "success" or "fail" wires of the shadow mechanism.

Ta-daaa.

Elements that were to be predicated out, these will die (remember, shadows stop them from committing). Elements allowed to proceed, the "success" signal drops the prohibition against commit.

In each case, the elements could actually be underway (already in progress), already have gotten their operands. This is fine because the shadow stops them writing.

At no time do any of the FUs need to "stall". There are no conflicts. It is incredibly elegant.

In order systems have constant compromises, and a really simple logical analysis shows that in-order systems have to deal with *exactly the same types of conflicts as an OoO one*.

It's just that the inorder one, hit a problem? Stall. Got a hazard? Stall. Need to service an interrupt? Stall it, sorry.

Stall, stall, stall. It gets really old.

And yet the exact same detection logic to even determine these hazards in the first place *still has to be there*, because the designs are (usually) pipelined.

A FSM architecture of course can only do one thing (one step of a pipeline) at a time so avoids the problem completely, but had terrible performance as a result. Whoops :)

L.

lkcl

unread,
Jun 22, 2019, 8:37:35 AM6/22/19
to
On Saturday, June 22, 2019 at 4:16:02 PM UTC+8, Terje Mathisen wrote:

> >
> > So the traditional statistical analysis techniques just wouldn't
> > work, because if you issue the escape sequences to get to FP ops, the
> > *next* ops are *also going to be FP* [unless you issue the "cancel"
> > (pop) context opcode.]
>
> This is exactly the same logic you need to write a near-optimal
> compressor for LZ4 data, i.e. you always have multiple future options
> for how long the current literal stream should be, followed by a
> back-referenced copy previous output stream.

Ha, that's funny. An ISA that turns out to be based on a compression algorithm. By accident.

> >
> > This done because if you need one FP op, chances are higher that the
> > 2nd and 3rd ops are going to be FP as well.
>
> Yes! This is teletype/Baudot 5-bit Shit-In/Shift-Out all over again. :-)
>
> I really like how this could make it possible to write a single function
> that starts with a FP shift, and then branch to it just past this point
> so that it can be reused for Int data. :-)

Yuk! :)

Object-orientated ISAs. I blame the lecturer at Imperial, myself. I wish I could remember his name. He was such a superb teacher - old enough to have worked with GEORGE III, the CDC Cyber 200 Series, and the Cambridge Capability System.

L.

Ivan Godard

unread,
Jun 22, 2019, 12:48:42 PM6/22/19
to
On 6/22/2019 1:16 AM, Terje Mathisen wrote:

> Yes! This is teletype/Baudot 5-bit Shit-In/Shift-Out all over again. :-)
>
> I really like hos this could make it possible to write a single function
> that starts with a FP shift, and then branch to it just past this point
> so that it can be reused for Int data. :-)

It's not that easy to do polymorphic code, even with a polymorphic ISA.
The Mill width-tags operands in the metadata, so ADD (for example) works
for byte, half, word, double and quad and SIMD of same. We thought that
it would be possible to have a single function that would work for both
single and double, for example.

While in principle polymorphism would be possible, say in math
libraries, in practice bitwise duplicate functions of different argument
data types were vanishingly rare. Part of the problem was that load and
stores are not width-agnostic. The compiler could handle store by
allocating the maximal operand size and only using the needed part of
it, but load ops need to be told the actual width they were to drop.

However, the biggest problem was constants. Not only did literal ops
like con() know the drop size, the actual data had different values for
different widths - think the coefficients for a numeric series. And
sometimes even the algorithm was different for different widths, to deal
with exact rounding.

We still may be able to find binary duplicates between scalar and SIMD
functions when we have SIMD working, but I'm not hopeful.

Ivan Godard

unread,
Jun 22, 2019, 12:57:57 PM6/22/19
to
On 6/22/2019 5:30 AM, lkcl wrote:
> On Saturday, June 22, 2019 at 9:15:28 AM UTC+8, BGB wrote:
>
>>
>> Side note: This is why I did not add wide-execute predicated operations
>> to my ISA; there wasn't a good way to do this which would not adversely
>> effect a scalar implementation. Though, with my ISA design (and a
>> simplistic core), this does mean predicated sections would fall back to
>> scalar execution. This was seen as a reasonable trade-off though.
>
> Sigh, I do love OoO multi issue 6600 style augmented scoreboards.
>
> Predication may be done by throwing individual elements at Function Units, and having an extra predication FU similar to a branch FU that casts a "shadow" across all the FUs with elements that need a predicate bit.
>
> Once the Predication FU has read the predicate register (remember, that can be an immediate in which case no delay, or it could be from the regfile, this is no problem), all that the Predicate FU needs to do is:
>
> Shove the bits of the predicate into the "success" or "fail" wires of the shadow mechanism.
>
> Ta-daaa.

Be careful - it's not quite that easy except with truly idempotent
operations. If there is any way that the total electronic state changes
even though a result is shadowed then you you have a Spectre-family
exfiltration path.

Opaque state changes in the caches caused by failing loads were the
original Spectre. But it's potentially any op that has differential
impact depending on the success vs fail test. For possible example: is
bypass availability for following ops the same whether or not the op failed?

lkcl

unread,
Jun 22, 2019, 1:24:53 PM6/22/19
to
On Sunday, June 23, 2019 at 12:48:42 AM UTC+8, Ivan Godard wrote:
.
>
> However, the biggest problem was constants. Not only did literal ops
> like con() know the drop size, the actual data had different values for
> different widths - think the coefficients for a numeric series. And
> sometimes even the algorithm was different for different widths, to deal
> with exact rounding.

Weeeelll... hypothetically that could be dealt with by having a new type of state register, like PC and SP, except called DP. Data Pointer.

Immediates, instead of being loaded from the PC, are loaded from the DP. The DP is implicitly walked forward by the size of the element loaded.

The fly in the ointment is loops, which would require DP to be wound back an explicit amount. All DP changes would be relative.

A new (polymorphic) subroutine could then simply set a new DP.

>
> We still may be able to find binary duplicates between scalar and SIMD
> functions when we have SIMD working, but I'm not hopeful.



SIMD baaad. Don't be seduced by its powaaaah https://www.sigarch.org/simd-instructions-considered-harmful/

But seriously, the opcode proliferation is O(N^6) including some of the combinations seen in audio.processing DSPs such as AndesStar. HI to LO and LO to HI, permutations, different src and dest bitwidths, clipping modes on the same, and the cleanup and setup gets ugly, real fast.

*internal* SIMD ALUs with predication that at the end of a loop is used to cover the corner case, *great*.

ie Vector ISA front end, SIMD engine backed with ptedication, great.

SIMD at the ISA level, except for 3D vectors and other scenarios where there is a guaranteed lack of corner case handling, baaaad.

lkcl

unread,
Jun 22, 2019, 1:38:11 PM6/22/19
to
On Sunday, June 23, 2019 at 12:57:57 AM UTC+8, Ivan Godard wrote:

> Be careful - it's not quite that easy except with truly idempotent
> operations. If there is any way that the total electronic state changes
> even though a result is shadowed then you you have a Spectre-family
> exfiltration path.

I was hoping that wouldn't happen. Sigh.

A couple of solutions present themselves.

The ultimate paranoia one is, instead of casting the shadow across write, cast it across read instead. Prevent the ops from even *reading* the regfile until the predicate has been read and decoded.

A slightly less paranoid version is to at least let the Computation Units read.and latch its operands, but not to proceed with execution.

Variations on this theme could alloe e.g the LD/ST units to at least compute the addresses, but not proceed to hit the caches until the predicate is known and its bits distributed to LD/ST CUs.

Each step allowed has advantages and disadvantages, the longer you let operations steam ahead without the predicate, potentially the more resources are wasted throwing away results, but more than that, those CUs could have been allocated ops that were not predicated or had no hazards and could have been doing useful work.

I do not know what workloads to expect, yet, so cannot say what woukd be the best point at which to freeze the CUs.

L.

Ivan Godard

unread,
Jun 22, 2019, 3:04:08 PM6/22/19
to
On 6/22/2019 10:24 AM, lkcl wrote:
> On Sunday, June 23, 2019 at 12:48:42 AM UTC+8, Ivan Godard wrote:
> .
>>
>> However, the biggest problem was constants. Not only did literal ops
>> like con() know the drop size, the actual data had different values for
>> different widths - think the coefficients for a numeric series. And
>> sometimes even the algorithm was different for different widths, to deal
>> with exact rounding.
>
> Weeeelll... hypothetically that could be dealt with by having a new type of state register, like PC and SP, except called DP. Data Pointer.
>
> Immediates, instead of being loaded from the PC, are loaded from the DP. The DP is implicitly walked forward by the size of the element loaded.
>
> The fly in the ointment is loops, which would require DP to be wound back an explicit amount. All DP changes would be relative.
>
> A new (polymorphic) subroutine could then simply set a new DP.

That's an interesting approach. The DP would have to be set by the
caller of course. It would have problems if the function-body were mixed
width - think of a function that does a series eval in double but has
args and results in single. And it wouldn't help when the algorithm
differed by width - more series stages in wider sizes, for example. But
it still might enable some cases that are otherwise unworkable. Needs
thought.

>>
>> We still may be able to find binary duplicates between scalar and SIMD
>> functions when we have SIMD working, but I'm not hopeful.
>
>
> SIMD baaad. Don't be seduced by its powaaaah https://www.sigarch.org/simd-instructions-considered-harmful/
>
> But seriously, the opcode proliferation is O(N^6) including some of the combinations seen in audio.processing DSPs such as AndesStar. HI to LO and LO to HI, permutations, different src and dest bitwidths, clipping modes on the same, and the cleanup and setup gets ugly, real fast.
>
> *internal* SIMD ALUs with predication that at the end of a loop is used to cover the corner case, *great*.
>
> ie Vector ISA front end, SIMD engine backed with ptedication, great.
>
> SIMD at the ISA level, except for 3D vectors and other scenarios where there is a guaranteed lack of corner case handling, baaaad.
>

You've been drinking the Kool-Aid and generalizing :-)

I grant you that SIMD as usually done (i.e. X86) is bad. For the FP
loops that Mitch is designing for it just gets in the way. But it is
valuable for sub-operand-size data; you really don't want to do strcpy
one byte at a time. One ALU with a split carry tree is a lot cheaper
than eight ALUs.

And opcode proliferation is a red herring - you are blaming the data
path (SIMD) for a mistake in the code path (width-sensitive instruction
encoding). Mill has one opcode for each function, for all widths and
scalar/vector of any count. Any ISA could do the same if it weren't for
legacy to the errors of the past. It's not like the idea was unknown
when the legacy was designed, either - major machines of the early
1960's used a single opcode, not only for single vs. double but also for
integer vs FP.

lkcl

unread,
Jun 22, 2019, 3:49:06 PM6/22/19
to
Apologies for removing context, phone quite awkward.

If function parameters are in registers, and the registers have been tagged by the *caller* (width and even type), and operations are polymorphic, we're good to go for a fully polymorphic function.

Caveats are more fundamental. The first is if (as I wanted to do with this VLIW style format) the context can be LOST or reset within the function, ignored or due to it being part of the (large) opcodes.

Second, The "tags" need to be context switched along with the register contents themselves.

2 bits for width (8/16/32/64), maybe 2 for type (INT/FP/other/other). Not a lot, then, really. In the case of SimpleV Vectorisation, the context is just too large to tolerate, however a simpler ISA (less registers, less option-itis) should be ok.

Funnily enough this ties in with the historical SIMD proliferation solution you mention.

Tags and polymorphism basically knock 2 to 4 dimensions off the O(N^6) SIMD proliferation space. If Mill has done that already, SIMD should be tolerable and basically sounds exactly like the "subvector" concept that got added to RVV and SimpleV.

subvector basically is a sort of variable length SIMD (1 to 4 is enough to support 3D vectors qnd Quads), audio channels could benefit from subvectors, RGB888 pixrl datq likewise would fit a subvector size of 3 and so on.

The thing is, again, just as you say Ivan, unlike the x86 approach, subvector length is *state*, not hardcoded into the ISA. That's what I meant by SIMD being harmful.

L.

Ivan Godard

unread,
Jun 22, 2019, 5:05:02 PM6/22/19
to
On 6/22/2019 12:49 PM, lkcl wrote:
> Apologies for removing context, phone quite awkward.
>
> If function parameters are in registers, and the registers have been tagged by the *caller* (width and even type), and operations are polymorphic, we're good to go for a fully polymorphic function.

We thought so too. Found very few functions that could be poly that
weren't also so small that they should be inlined rather than called.
Good luck :-)

> Caveats are more fundamental. The first is if (as I wanted to do with this VLIW style format) the context can be LOST or reset within the function, ignored or due to it being part of the (large) opcodes.
>
> Second, The "tags" need to be context switched along with the register contents themselves.
>
> 2 bits for width (8/16/32/64), maybe 2 for type (INT/FP/other/other). Not a lot, then, really. In the case of SimpleV Vectorisation, the context is just too large to tolerate, however a simpler ISA (less registers, less option-itis) should be ok.

Hmm - type bits? So you have to look at the data to decide which FU to
route the data (and the op) to? Nah.

The system that did mixed int/FP (I did a compiler for it) treated FP as
integer overflow; only one FU, really just a FP unit, but preserved
integer values as such during normalization when possible, and
early-outed them when no post-normalize was needed.

EricP

unread,
Jun 22, 2019, 5:12:57 PM6/22/19
to
Ivan Godard wrote:
>
> Be careful - it's not quite that easy except with truly idempotent
> operations. If there is any way that the total electronic state changes
> even though a result is shadowed then you you have a Spectre-family
> exfiltration path.
>
> Opaque state changes in the caches caused by failing loads were the
> original Spectre. But it's potentially any op that has differential
> impact depending on the success vs fail test. For possible example: is
> bypass availability for following ops the same whether or not the op
> failed?

Predicate instructions that bypass certain execute phases have
side channel leaks only to the same extent as conditional branches.
A predicated load or store or MUL or DIV is no different than
one guarded by a conditional branch.
If you don't want it conditionally executed, don't predicate it.

And there is nothing in predicated instructions that inherently
allows access to secrets in another privilege domain.



BGB

unread,
Jun 22, 2019, 7:20:01 PM6/22/19
to
OK.

In my case, IF figures out instruction lengths for all positions in an
I$ block, and returns the length for the position fetched for.

In the WEX case, current prototypes assume 64 bits (2x 32).

Decode is split into 2 stages, as register fetch seems to be kinda
problematic in terms of timing.

Accessing memory works by sending a request in E1 and waiting for it to
complete in E2 (E2 stalls the pipeline until it completes). Currently
accessing the L1 D$ takes longer than this, so memory access always
stalls a few cycles.

Similarly, the approach for both the multiplier and FPU was "stall until
done". As with memory access, EX2 will stall until the operation completes.



> I lobbed integer multiply and divide over in the FMAC unit as it was already
> doing FMUL and FDIV.
>

I half-considered this, but there is the problem in my case that integer
and DP multiply produce/consume different ranges of bits (the DP
multiplier would effectively need to produce a full-width output for
this to work, rather than discarding the low-order results).

Similarly, the width of the results from the 32-bit integer multiplier
are wider than those from the DP multiplier.


> You an argue that CA/AL takes too much time, but this seems to be the way of
> faster processors anyway. This gives the Cache SRAMs 1 entire cycle to be
> accessed, and allows plenty of time for align and wire delay on the way back.
>
> The wa (wait) cycle is present only to align FMAC with the other pipeline.
> A DP FMAC at 4-cycles remains doable as long as wire delay does not eat up
> the wire routes in the multiplier tree.
>

Main difference is I didn't really try to fit these entirely into the
pipeline.

The 2 EX stages are partly so it can do 64-bit ALU ops without needing 2
cycles (if the results are not used immediately). However, there is an
interlock mechanism to deal with some of this (there will be a stall if
reading from a register which will be written by the EX2 stage).


>>
>> If it is possible determine early (IF stage) whether or not the ops will
>> execute in parallel, then PC can be advanced as such, and the output
>> from multiple decoders can be used. Otherwise the output from the other
>> decoders is ignored (treated as if they decoded NOPs).
>
> This is one reason why I fetch 4 instructions at a time, so that every 32-bit
> item in the instruction buffer (320-gates/entry) can expend 30-gates/entry
> to predecode various things concerning that slot in the IB. Instruction
> parse takes these predecoded items and performs find-next-one instruction
> parsing to the extent necessary. {Remember the ISA has 1-to-5 word instructions
> to provide any and all needed constants to the instruction stream.}

My prototypes thus far were:
1 op if 16/32/48 bits (or 16/32 for Lite profile).
2 ops if both are 32 bits (if WEX is available).


>>
>> With superscalar, it is necessary either to use a more complicated
>> organization (no longer a simple scalar pipeline, eg, needs a more
>> obvious split between the decode and execute stages).
>>
>>
>> If we know in advance, the pipeline can look just like a normal scalar
>> pipeline, just with a few extra parts glued on.
>
> YEs.
>>
>> Main costs are mostly the alternate execute unit, additional register
>> ports, and the added pipeline machinery.
>
> One should note:
> Branch unit probably does not need a RF Write port 97% of the time.
> Store does not ever need an RF write port--and here is where I generally
> steal a RF Read port when an instruction either consumes a constant or
> is a 1-operand instruction (branch for example).
>
> So one should be able to get to ~1.3 IPC with a 3R1W RF; even before
> considering forwarding of results back as operands.

A 2-wide WEX was presumed to use 4R2W:
2 read ports per lane, 1 write port per lane.
Ops which need 3R would effectively disallow the second lane.

The idea before was, eg:
F4yy_yyyy F0xx_xxxx
yy: Lane 2
xx: Lane 1 (Scalar Lane)

Memory load/store, branches, FPU ops, ... Could only exist in lane 1.

Lane 2 can contain various ALU ops and F2-block (Imm10u/Imm10n) constant
loads.

Memory store and FPU ops andd similar would eat up both lanes:
GPR Ports: Rs, Rt, Ru, Rv
Store: MOV.L Rm, (Rn, Ri)
Rn=Rs, Ri=Rt, Rm=Ru

Otherwise, (Rs,Rt) would map to Lane 1, and (Ru,Rv) to Lane 2 if a WEX
encoding is used. Otherwise, Lane 1 is mapped to (Rs,Rt,Ru).
While things have to be valid in scalar order, they also need to be
valid if executed in parallel.


Dependencies between results in different lanes within a block will be
effectively undefined.

Eg:
ADD R5, R7, R9 | ADD R9, R6, R10

Would effectively be disallowed as its results are undefined.


Things like register forwarding between lanes within adjacent cycles is
more ambiguous. Intuitively, it "should" be required to work like it
would if it were scalar.

But, an case could be made for being cheap, and saying that cross-lane
forwarding and interlocks involving Lane 2 are undefined (in the name of
shaving off implementation cost).

Effectively, this would mean that results of operations in Lane 2 might
not necessarily be visible until 2 clock cycles after the fact.

Lane 1 forwarding and interlocks would be required though for typical
scalar code to work.


My initial prototypes for WEX were based on the assumption of full
support for forwarding and interlocks.

Note though that since the pipeline was a 2-wide version of the scalar
pipeline, any stalls or bubbles in the pipeline would be apply
simultaneously to both lanes.



>>
>> The compiler still needs to order the instructions for an in-order
>> superscalar to do its thing. For explicit tagging, it does this, but
>> also explicitly flags which instructions match the criteria to be able
>> to execute in parallel.
>>
>>
>>
>> Otherwise, kinda burnt out right now, and my day job has now gone over a
>> 6x10 schedule (6-day work weeks with 10 hour days, mandatory overtime;
>> from the 5x10 and 5x12 weeks it was previously). I haven't been getting
>> much sleep recently (been going weeks on ~5 hours/night; with a little
>> more on weekends, but now I will only have 1 day of weekend...).
>
> You know there is a reason "they" invented alcohol.....

Turns out it is 5x10+6, rather than 6x10, so one works 6AM to 12PM on
Saturdays rather than 6AM to 5PM (as on weekdays).

On a 5x12 schedule, it was working 6AM to 7PM.

On 5x10 and 5x12 weeks, was generally sleeping from ~ 11PM to 4AM (eg:
wake up at ~ 4AM, drink coffee and get ready, leave house a little after
5AM).


I do OK on 6 hours sleep, but start to "wear down" on 5 hours (on my
own, normally I sleep ~ 6-7 hours; for the most part I am unable to
sleep the theoretical '8 hours'; I interpret it more like "an 8 hour
block is to leave time to probably sleep ~7 hours?...").

Not as sure about other people, but these sorts of schedules don't leave
a lot of free time for working on personal projects or similar.


Similarly, while I might intend to get to sleep at ~ 10PM, I am also
rather prone to overshoot this target (hence, ~11PM).



As for sleep amount:
<=4: Generally feel terrible.
5: Tired, generally manage, gradually "wear down" during the week.
6: OK.
7: OK.
8+: Uncommon for me (usually follows being overly tired).

Present state is following 6 days of 5 hours of sleep, generally feeling
tired.


If left on my own (eg, no school or job), sleep time tends to drift to ~
3AM to 9-10AM...

Not sure how all this compares with a "typical person" though...

MitchAlsup

unread,
Jun 22, 2019, 11:26:04 PM6/22/19
to
Yes, I do several things like this::

for the compare instruction, there are two varients
for the integer compare instructions i produce a 10-bit vector encompassing
==,!=,>,>=,<,<=,HI,HS,LO,LS
I append to this a range comparison
0 < X < Y
0 <= X < Y
0 < X <= Y
0 <= X <= Y
I append to this:
any 8-bit field == 0
which 8-bit field == 8
any 16-bit field == 0
which 16-bit field == 0
any 32-bit field = 0
which 32-bit field = 0
and finally, we specially detect NEGMAX and POSMAX


In the FP set we have the 6 useful compares:
==,!=,>,>=,<,<=
and we have their inverts
!(==),!(!=),...
along with the range comparisons
0 < X < Y
0 <= X < Y
0 < X <= Y
0 <= X <= Y
and then we classify FP operand 1 into 10 categories (ala OpenGL)
minus infinity, minus normal, minus denorm, minus zero, plus zero, plus denorm, plus normal, plus infinity, NaN
So basically anything one want to sample about a FP is available.

MitchAlsup

unread,
Jun 22, 2019, 11:48:58 PM6/22/19
to
Basically, it sounds like you wrapped 1.5 stages of my pipeline per stage of your pipeline.
>
> In the WEX case, current prototypes assume 64 bits (2x 32).
>
> Decode is split into 2 stages, as register fetch seems to be kinda
> problematic in terms of timing.
>
> Accessing memory works by sending a request in E1 and waiting for it to
> complete in E2 (E2 stalls the pipeline until it completes). Currently
> accessing the L1 D$ takes longer than this, so memory access always
> stalls a few cycles.
>
> Similarly, the approach for both the multiplier and FPU was "stall until
> done". As with memory access, EX2 will stall until the operation completes.
>
>
>
> > I lobbed integer multiply and divide over in the FMAC unit as it was already
> > doing FMUL and FDIV.
> >
>
> I half-considered this, but there is the problem in my case that integer
> and DP multiply produce/consume different ranges of bits (the DP
> multiplier would effectively need to produce a full-width output for
> this to work, rather than discarding the low-order results).

Also note:
I aligned the int and FP pipes so they can share a RF write port!
Now one needs one result write port for every 1.3 instructions issued
and in the simplest machines there are no conflicts for write ports for several variants of 1-wide, to 2-wide designs.

>
> Similarly, the width of the results from the 32-bit integer multiplier
> are wider than those from the DP multiplier.

It is an interesting trade offs. 53×53 and only perform FMUL and FMAC OR
57×57 with FDIV and FSQRT OR 64×64 with IMUL and IDIV and REM.
We can all agree that if we have a int multiplier 64×64 we can perform all of the calculations
We can all agree that 53×53 is just over 50% of 64×64
We should be able to agree that there are external factors pushing a design in either direction.
Even if that lane was a 1-op
Even if that lane was a 1-op plus constant?
I always allowed this to mean last use as operand R9 and write of R9

lkcl

unread,
Jun 23, 2019, 1:22:11 AM6/23/19
to
On Sunday, June 23, 2019 at 5:05:02 AM UTC+8, Ivan Godard wrote:
> On 6/22/2019 12:49 PM, lkcl wrote:

> >
> > If function parameters are in registers, and the registers have been tagged by the *caller* (width and even type), and operations are polymorphic, we're good to go for a fully polymorphic function.
>
> We thought so too. Found very few functions that could be poly that
> weren't also so small that they should be inlined rather than called.
> Good luck :-)

:)


> > 2 bits for width (8/16/32/64), maybe 2 for type (INT/FP/other/other). Not a lot, then, really. In the case of SimpleV Vectorisation, the context is just too large to tolerate, however a simpler ISA (less registers, less option-itis) should be ok.
>
> Hmm - type bits? So you have to look at the data to decide which FU to
> route the data (and the op) to?

No, CSRs. In effect the bits that make the decision on what to tell the ALU to do are moved from the instruction into CSRs aka "hidden instruction bits".


> The system that did mixed int/FP (I did a compiler for it) treated FP as
> integer overflow; only one FU, really just a FP unit, but preserved
> integer values as such during normalization when possible, and
> early-outed them when no post-normalize was needed.

Cool. Neat trick.

lkcl

unread,
Jun 23, 2019, 9:00:40 AM6/23/19
to
On Sunday, June 23, 2019 at 1:22:11 PM UTC+8, lkcl wrote:


> No, CSRs. In effect the bits that make the decision on what to tell the ALU to do are moved from the instruction into CSRs aka "hidden instruction bits".

Architectures that have done this very trick, for decades, are PowerPC and MIPS, with the bigendian/littleendian dynamic modeswitch.

And you mentioned that older archs used to do the same thing all the time.

No, definitely not store the mode in the *data*, though.

L.

EricP

unread,
Jun 23, 2019, 11:18:45 AM6/23/19
to
You are programming a Vector Co-Processor (VCP) using the existing
CSR instructions rather than adding explicit instructions to do the same.

I see in the spec that you mention re-entrantcy but don't go into detail.
The VCP needs to be interruptible by external devices part way through
a vector op, as well as exceptions (page faults in particular, as you note),
and restartable afterwards, user or kernel mode.
Kernel must be able switch threads using the VCP at any point.
Kernel can deliver a user mode *nix signal that interrupts the VCP,
and have it call a routine that uses the VCP, and restores it on return.
(This is not as straight forward as it might seem at first look
because it is the return to the interrupted routine that re-enables
the VCP so the interrupted VCP operation restarts in its original
program context and not the nested *nix signal one.
Maybe a branch-and-enable-VCP CSR instruction.)

The VCP CSR registers need to be readable and writable so their
state can be save to the stack and restored afterward. E.G.

oldCsr1 = CsrRead1();
oldCsr2 = CsrRead2();
oldCsr3 = CsrRead3();

CsrWrite1(newCsr1);
CsrWrite2(newCsr2);
CsrWrite3(newCsr3);
nextPC = &Next;
CsrWriteBEVCP (nextPC); // Branch and Enable VCP

Next:
// Restore and restart VCP
CsrWrite1(oldCsr1);
CsrWrite2(oldCsr2);
CsrWrite3(oldCsr3);
nextPC = retPC;
CsrWriteBEVCP (nextPC); // Branch and Enable VCP

or some such thing.




BGB

unread,
Jun 23, 2019, 11:46:25 AM6/23/19
to
Could be; if I made the pipeline a little longer, I could possibly make
timing a little easier.

Costs:
More branch latency (if I extend earlier stages);
More interlocks and forwarding (if I extend later stages).

Figuring out length in IF made it easier to advance PC than if had I
done it in D1, since there is a 1c delay between giving it a PC address,
and getting the results for said PC address.


As noted, with my ISA (BJX2):
0zzz_zzzz: 16b
10zz_zzzz: 16b
110z_zzzz: 16b
1110_0zzz: 32b (OP?T / OP?F)
1110_10zz: 32b (OP?T / OP?F)
1111_00zz: 32b / 32b (Scalar / WEX2)
1111_01zz: 32b / 64b (Scalar / WEX2)
1111_100z: 32b / 32b (Scalar / WEX2)
1111_101z: 32b / 64b (Scalar / WEX2)
111z_11zz: 48b

BJX2 Lite uses a simpler rule:
0zz: 16b
10z: 16b
110: 16b
111: 32b

For Lite, 48-bit encodings are disallowed, and WEX is ignored (WEX
encodings always operate in scalar mode).

Length with WEX2 in wider configurations is more complicated. Could
theoretically support up to 4 wide with the current cache-system design
though (but, me doing 3 or 4 wide is unlikely at present given its cost).


>>
>> In the WEX case, current prototypes assume 64 bits (2x 32).
>>
>> Decode is split into 2 stages, as register fetch seems to be kinda
>> problematic in terms of timing.
>>
>> Accessing memory works by sending a request in E1 and waiting for it to
>> complete in E2 (E2 stalls the pipeline until it completes). Currently
>> accessing the L1 D$ takes longer than this, so memory access always
>> stalls a few cycles.
>>
>> Similarly, the approach for both the multiplier and FPU was "stall until
>> done". As with memory access, EX2 will stall until the operation completes.
>>
>>
>>
>>> I lobbed integer multiply and divide over in the FMAC unit as it was already
>>> doing FMUL and FDIV.
>>>
>>
>> I half-considered this, but there is the problem in my case that integer
>> and DP multiply produce/consume different ranges of bits (the DP
>> multiplier would effectively need to produce a full-width output for
>> this to work, rather than discarding the low-order results).
>
> Also note:
> I aligned the int and FP pipes so they can share a RF write port!
> Now one needs one result write port for every 1.3 instructions issued
> and in the simplest machines there are no conflicts for write ports for several variants of 1-wide, to 2-wide designs.
>

OK. The current ISA still uses separate GPRs and FPRs.

Currently, the FPU's execute stages operate in-parallel with the main
integer unit, with FPR fetch and writeback also occurring in parallel.

I had evaluated possibilities for merging them, but noted it wouldn't
save all that much ISA-wise; could either save resources or make timing
harder. Changing this would be a pretty big breaking-change though.

In the Lite profile, FPU is optional and FP arguments would be passed in
GPRs in the C ABI either way.


>>
>> Similarly, the width of the results from the 32-bit integer multiplier
>> are wider than those from the DP multiplier.
>
> It is an interesting trade offs. 53×53 and only perform FMUL and FMAC OR
> 57×57 with FDIV and FSQRT OR 64×64 with IMUL and IDIV and REM.
> We can all agree that if we have a int multiplier 64×64 we can perform all of the calculations
> We can all agree that 53×53 is just over 50% of 64×64
> We should be able to agree that there are external factors pushing a design in either direction.

Yeah.

54*54->54: 6 DSPs (high result, discard low bits of result)
32*32->64: 4 DSPs (full-width, 64b output)
64*64->128: 16 DSPs

A direct 64*64->128 multiplier is a bit steep (both expensive and slow).


Probably, one would need to do a 64*64->64 multiplier which can produce
either the low or high results (could be done with 10 DSPs).

Then an aligned high-result could potentially also be used to implement
a DP multiply.


(A3,A2,A1,A0)*(B3,B2,B1,B0)

Low Result (63:0):
A3B0
A0B3
A2B1
A1B2
A2B0
A0B2
A1B1
A1B0
A0B1
A0B0

High result (127:64):
A3B3
A3B2
A2B3
A2B2
A3B1
A1B3
A3B0
A0B3
A2B1
A1B2

So, effectively, they are partly mirrors.

So, the sub-multipliers could be fed different inputs, and the adder
chains would be assembled differently.

DSP cost would be 10 DSPs (so is the same here), however the cost in
terms of LUTs and latency would likely be a little worse (due mostly to
the more complex adder chains needed).


The amount of additional plumbing between the integer unit and FPU could
possibly also be problematic (vs the FPU using its own multiplier).


>>>>
>>>> With superscalar, it is necessary either to use a more complicated
>>>> organization (no longer a simple scalar pipeline, eg, needs a more
>>>> obvious split between the decode and execute stages).
>>>>
>>>>
>>>> If we know in advance, the pipeline can look just like a normal scalar
>>>> pipeline, just with a few extra parts glued on.
>>>
>>> YEs.
>>>>
>>>> Main costs are mostly the alternate execute unit, additional register
>>>> ports, and the added pipeline machinery.
>>>
>>> One should note:
>>> Branch unit probably does not need a RF Write port 97% of the time.
>>> Store does not ever need an RF write port--and here is where I generally
>>> steal a RF Read port when an instruction either consumes a constant or
>>> is a 1-operand instruction (branch for example).
>>>
>>> So one should be able to get to ~1.3 IPC with a 3R1W RF; even before
>>> considering forwarding of results back as operands.
>>
>> A 2-wide WEX was presumed to use 4R2W:
>> 2 read ports per lane, 1 write port per lane.
>> Ops which need 3R would effectively disallow the second lane.
>
> Even if that lane was a 1-op
> Even if that lane was a 1-op plus constant?
>

Yes, generally.

It is possible it could be made to work, but would require being more
clever, and only work for certain ops (depending on which resources they
use), ...


Many ops expand out to be wider internally, eg, 1R ops might internally
decode as 2R ops, ... Similarly, immediate values are fed though the GPR
read ports (immediate values are treated internally as a special
register; allowing "Reg, Reg, Reg" and "Reg, Imm, Reg" cases and similar
to use the same logic in the EX stages).

Simple case is to make 3R ops effectively disallow Lane 2.


Less clear how this would work with a 3-wide WEX, eg:
ADD R2, R3, R4 | MOV.L R6, (R9, R7)

The MOV.L would use 3R, thus not allow the ADD in Lane 2, but it could
fit in Lane 3 (if it exists).

Either the core would need to be "smart" and detect this case (using
Lane 3 rather than Lane 2, giving the MOV.L both Lanes 1 and 2), or the
assembler would need to do something silly, like:
ADD R2, R3, R4 | NOP3 R6 | MOV.L R6, (R9, R7)

In this case, the 'NOP3' in Lane 2 being simply a placeholder to give
Lane 1 its 3rd source register.

An alternative is always giving Lane 1 all 3 read ports, but this would
require 5R2W for a 2-wide WEX (vs 4R2W).


To what extent to allow (or not allow) FPU ops with WEX is TBD, things like:
ADD R7, R8, R9 | FMUL FR3, FR4, FR5

Or similar are probably OK, as long as the FPU op stays in Lane 1.


>>>>>> A goal was to minimize the "cleverness" needed by the core (the C
>>>>>> compiler and/or ASM programmer would be responsible for all this).
>>>>>
>>>>> Am not a fan of compiler cleverness :) CEDAR Audio had to code in assembler (1993) because the TI DSP compiler just could not cope. With only 50 MFLOPs to play with (12.5 mhz, 2x pipeline, 2x FPUs, one for odd regs one for even, FMAC) the budget was only 1000 cycles per audio sample, and compiler inefficiency just could not be tolerated.
>>>>>
>>>>> Have been wary of VLIW compiler "cleverness" ever since.
>>>>>
>>>>
>>>>
>>>> Either the compiler does it, or the processor needs to be able to do so.
>>>>
>>>> For a core where detecting inter-instruction dependencies (as-needed for
>>>> an in-order superscalar) is too expensive, it can be a viable
>>>> alternative to scalar.
>>>
>>> If you do this::
>>> Be aware that you will want to implement the following tokens::
>>>
>>> This result forwards and writes
>>> This result forwards but write was elided [phantom]
>>> This operand comes from slot[k]
>>>
>>> and depending on how aggressive branch prediction is going to be:
>>>
>>> This operand comes from slot[k] when taken or slot[j] otherwise.
>>
>> While things have to be valid in scalar order, they also need to be
>> valid if executed in parallel.
>>
>>
>> Dependencies between results in different lanes within a block will be
>> effectively undefined.
>>
>> Eg:
>> ADD R5, R7, R9 | ADD R9, R6, R10
>
> I always allowed this to mean last use as operand R9 and write of R9
>

I had ended up going with AT&T ordering for BJX2 ASM, where in this
case, the instructions would give different results with scalar vs
parallel execution, and is thus disallowed.

In WEX, the registers as-read are typically those from the prior cycle;
with a register write within the same cycle not being visible until the
following cycle.


As noted, it does currently retain the use of forwarding/interlocks, and
behaves the same as scalar code regarding control flow.

While an argument could possibly be made for improving performance with
WEX by adding delay-slot branch instructions, these would have the
severe drawback of exhibiting different behavior in scalar mode (and
disallowing WEX in the delay slot would eliminate its main use-case).

Similar would apply to allowing ops with "delayed writeback" (where the
result of the operation would not be required to be visible until
several clock cycles later, but would otherwise allow execution to
proceed as if it were a single-cycle op).

IOW: Doing some stuff similar to the TMS320C6x family or similar...

lkcl

unread,
Jun 23, 2019, 12:23:16 PM6/23/19
to
On Sunday, June 23, 2019 at 11:18:45 PM UTC+8, EricP wrote:
> lkcl wrote:
> > On Sunday, June 23, 2019 at 1:22:11 PM UTC+8, lkcl wrote:
> >
> >
> >> No, CSRs. In effect the bits that make the decision on what to tell the ALU to do are moved from the instruction into CSRs aka "hidden instruction bits".
> >
> > Architectures that have done this very trick, for decades, are PowerPC and MIPS, with the bigendian/littleendian dynamic modeswitch.
> >
> > And you mentioned that older archs used to do the same thing all the time.
> >
> > No, definitely not store the mode in the *data*, though.
> >
> > L.
>
> You are programming a Vector Co-Processor (VCP)

mmmm not quite: the Vector Engine is permanently on, all the time. It is just that when VL is set to 1, all operations are scalar, because the hardware for loop is now "for opindex in range(1 to 1)".

Kinda weird, I know :)

> using the existing
> CSR instructions rather than adding explicit instructions to do the same.

Hidden state, on steroids, yes. Where previously the Register state used to be in 8 64 bit CSRs, 16 bits each, those are now gone and are part of the VLIW-like format.

> I see in the spec that you mention re-entrantcy but don't go into detail.

http://git.libre-riscv.org/?p=riscv-isa-sim.git;a=blob;f=riscv/insn_template_sv.cc;h=c73e3fde29e0cc4be104bc3235befac54ccb7581;hb=refs/heads/sv

The simulator code is a bitch to explain, it does however implement a re-entrant loop.

In effect, only one actual instruction is ever executed per cycle, and the index of what would normally be a c for-loop is instead incremented on each cycle, and that index shoved into a CSR (a global data structure in the simulator, see sv.h)

When that index reaches VL, that index is reset to zero and the PC ONLY THEN is allowed to increment.

This allows exceptions (context switches) to get a word in edgeways.

Hence I termed it "re-entrant". Think of how you would manually implement timeslicing for loops in a single-threaded application, you would need a bunch of state (one per timeslice) with the for loop indices. This situation is no different.

Explaining single threaded timeslicing... wekl, it's a bitch. It's an FSM on steroids, basically, where the state machine is based on all the nested loop indices.

> The VCP needs to be interruptible by external devices part way through
> a vector op, as well as exceptions (page faults in particular, as you note),
> and restartable afterwards, user or kernel mode.

Yes. Hence why that STATE CSR exists, and keeps track of the element index position, in two fields named srcoffs and destoffs. Don't ask me to explain twin predication btw :)

> Kernel must be able switch threads using the VCP at any point.

In supervisor mode, normal RV64GC operation switches the PC and the regfile. Here, for SV, you just add the STATE CSR to the list of stuff that needs context-switching.

As there are separate STATE CSRs for Supervisor Mode, User Mode and Machine Mode, you can even use vector operations *inside* each mode to save on instruction count, they do not affect each other.

> Kernel can deliver a user mode *nix signal that interrupts the VCP,
> and have it call a routine that uses the VCP, and restores it on return.
> (This is not as straight forward as it might seem at first look
> because it is the return to the interrupted routine that re-enables
> the VCP so the interrupted VCP operation restarts in its original
> program context and not the nested *nix signal one.

RISCV has sepc and uepc and mepc for the purposes of being able to flip to a new PC in each mode. Modifying uepx in Supervisor Mode will result in the PC being changed to uepc ONLY when the Supervisor trap exits back to User Mode.

SV STATE CSRs sSTATE, mSTATE and uSTATE operate a little differently but have the same result. Supervisor Mode can restore uSTATE and on exit from Supervisor Trap the uSTATE becomes "active".

So, yes, the context you mention *has* to include the STATE CSR. It represents the FSM which now no longer is just the PC, it is the PC PLUS the element index offsets.

> Maybe a branch-and-enable-VCP CSR instruction.)

Great care has to be taken when modifying the STATE CSR. It can completely alter the behaviour of the subsequent instruction. One use for that is to actually deliberately skip operations at the start of a vector. It is quite hair raising so not really recommended.


> The VCP CSR registers need to be readable and writable so their
> state can be save to the stack and restored afterward. E.G.


Yes absolutely.

> oldCsr1 = CsrRead1();
> oldCsr2 = CsrRead2();
> oldCsr3 = CsrRead3();
>
> CsrWrite1(newCsr1);
> CsrWrite2(newCsr2);
> CsrWrite3(newCsr3);
> nextPC = &Next;
> CsrWriteBEVCP (nextPC); // Branch and Enable VCP
>
> Next:
> // Restore and restart VCP
> CsrWrite1(oldCsr1);
> CsrWrite2(oldCsr2);
> CsrWrite3(oldCsr3);
> nextPC = retPC;
> CsrWriteBEVCP (nextPC); // Branch and Enable VCP
>
> or some such thing.

I am very reluctant to add new instructions to SV, or to alter existing RV behaviour. However I am interested to hear where you are going with this idea.

I *think* what you are in effect doing is merging two opcodes, Branch and CSRRW, into a single atomic instruction.

Normally that would in effect be exactly what would be done by a trap, through the RISCV uepc being set in Supervisor Mode.

Actually.... hang on.... argh now I know why uepc exists.

User mode traps can be triggered from user mode.

Thank you.

User mode traps are the direct equivalent of what you call BEVPC. Argh. I am going to have to think this through, and come uo with something.

I think it might be as simple as mirroring pc, uepc, sepc and mepc, with STATE, uSTATE, sSTATE and mSTATE, but I need time to think it through.

Thank you!

EricP

unread,
Jun 23, 2019, 12:39:43 PM6/23/19
to
lkcl wrote:
> On Saturday, June 22, 2019 at 9:15:28 AM UTC+8, BGB wrote:
>
>> Side note: This is why I did not add wide-execute predicated operations
>> to my ISA; there wasn't a good way to do this which would not adversely
>> effect a scalar implementation. Though, with my ISA design (and a
>> simplistic core), this does mean predicated sections would fall back to
>> scalar execution. This was seen as a reasonable trade-off though.
>
> Sigh, I do love OoO multi issue 6600 style augmented scoreboards.
>
> Predication may be done by throwing individual elements at Function Units, and having an extra predication FU similar to a branch FU that casts a "shadow" across all the FUs with elements that need a predicate bit.
>
> Once the Predication FU has read the predicate register (remember, that can be an immediate in which case no delay, or it could be from the regfile, this is no problem), all that the Predicate FU needs to do is:
>
> Shove the bits of the predicate into the "success" or "fail" wires of the shadow mechanism.
>
> Ta-daaa.

Not quite. These need to be interruptible.

In the 6600, as well as others, style predicate instruction,
the affected instructions (the "shadow") are specified by a mask on
the PRED instruction, which also calculates the predicate bit value.

That PRED instruction can retire but if an interrupt or exception
occurs inside the shadow, in both OoO and InO the set of enabled
and disabled non-retired shadow instructions must be restored upon REI
(but this time with the predicate bit value has been resolved).

Remember that the original PRED instruction retired but its
affects on the shadow live on. This restored state must be
communicated to the decoder so that it marks the previously
enabled shadow instructions as AlwaysExecute and previously
disabled ones as AlwaysSkip.

Also remember that predicated instructions that are skipped
are just logically skipped - they still have house keeping to do,
such as fiddling the scoreboard to wake up dependent instructions
stalled for various potential RAW etc conditions.

Also see below.

> Elements that were to be predicated out, these will die (remember, shadows stop them from committing). Elements allowed to proceed, the "success" signal drops the prohibition against commit.

I think you'll find this is an untenable position.
I think you have to deal with interruptible shadows.

For InO pipelines it would have to be able to hold all shadow
instructions from committing until the whole shadow finished,
which kinda defeats the purpose of the InO simplicity.
It has VAX like CISC effects like requiring the TLB have a
minimum size to handle translates for the worst case shadow.
In both InO and OoO cores it would force exceptions and
interrupts to be imprecise.
And affects interrupt latencies.

> In each case, the elements could actually be underway (already in progress), already have gotten their operands. This is fine because the shadow stops them writing.

This is the dependency pruning I referred to earlier,
and it depends on how one does this.

Lets take a simple example:

DIV r3 = r1/r2
LD r4 = [r5]
ADD r6 = r7 + r8
PRED r6, GT, b11, b10
MUL r9 = r3 * 1 // enable if r6 > 0
MUL r10 = r4 * 1 // enable if r6 <= 0

we can see that the PRED instruction and therefore the shadow
MUL instructions are dependent on r6 resolving. Once it does,
we know whether we need to read r3 or r4. Lets say PRED is True.
Then we can skip the r4 read and MUL, and are not dependent on r4's load.
In an InO pipeline we can't really take advantage of this
but in OoO we can.
If PRED is False, in both InO and OoO then we can skip r3 read,
MUL op and write back, and proceed to retire the first MUL,
which diddles the scoreboard/scheduler and may release an
instruction WAW stalled on r9.

So at a minimum we save register reads and MUL execute power.
At best we avoid some stalls.


lkcl

unread,
Jun 23, 2019, 1:29:06 PM6/23/19
to
On Monday, June 24, 2019 at 12:39:43 AM UTC+8, EricP wrote:


> Not quite. These need to be interruptible.
>
> In the 6600, as well as others, style predicate instruction,
> the affected instructions (the "shadow") are specified by a mask on
> the PRED instruction, which also calculates the predicate bit value.
>
> That PRED instruction can retire but if an interrupt or exception
> occurs inside the shadow, in both OoO and InO the set of enabled
> and disabled non-retired shadow instructions must be restored upon REI
> (but this time with the predicate bit value has been resolved).

Yes, I saw eg how x86 AVX512 does it. Element Ops with no exception that also complete are allowed to actually modify the predicate, such that at the end of the operation, all bits of the predicate are zero.

This requires the predicate reg to have both a read *and* write dependency, which is where my tiny mind rebelled. It also wastes a full register as that reg has to be copied, copied, copied, because each friggin predicated op will DESTROY the damn thing.

Every vector op is now interleaved with a predicate maak register copy operation!

Noooo :)

Or you have to have special CSR predicate registers... just.. too messy, particularly in RV as they forgot to provide opcodes that treat CSRs as peers of the integer regfile.

In the previous post discussing STATE, it interacts very closely with predication.

I covered the method by which context switching occurs, but did not xover how predication interacts with it.

Let us say that an exception has occurred, in the middle of a predicated Vectorised operation. STATE CSR gets restored to where srcoffs=3 and VL=8, say.

All that happens is: the predicate register is re-read, however because srcoffs is now 3 (not zero), ONLY elements 3, 4, 5, 6 and 7 are issued to FUs this time, *NOT* the full 0 thru 7.

The predication FU is ALSO notified of this fact, and AUTOMATICALLY creates a transparent hidden mask that DROPS bits 0 thru 2 of the predicate register on the floor.

It ONLY sends success/fail shadow signals to FUs containing elements 3 to 7.

So this basically is how it is not necessary to use the predicate register as exposed state to handle exceptions. Instead, the srcoffs is used to create a hidden mask.

Actually it would need to involve a shift of the predicate, down by srcoffs. A little inconvenient as it is quite expensive to do a full 64 bit shift. There's probably ways round that.

L.

EricP

unread,
Jun 23, 2019, 2:14:18 PM6/23/19
to
lkcl wrote:
>
> Yes, I saw eg how x86 AVX512 does it. Element Ops with no exception that also complete are allowed to actually modify the predicate, such that at the end of the operation, all bits of the predicate are zero.

FYI, did you look at the ARM Scalable Vector Extension?
I just quickly scanned it but basically it looks like
variable length SIMD, from 128 to 2048 bits in 128 bit increments
(they use the term Vector to mean a SIMD element).
It supports predicates too.

ARM Scalable Vector Extension, 2018
https://arxiv.org/abs/1803.06185

That paper is from 2017 and ARM is on to SVE2 already
(I've not looked for info on SVE2).



lkcl

unread,
Jun 23, 2019, 2:47:31 PM6/23/19
to
On Monday, June 24, 2019 at 2:14:18 AM UTC+8, EricP wrote:
> lkcl wrote:
> >
> > Yes, I saw eg how x86 AVX512 does it. Element Ops with no exception that also complete are allowed to actually modify the predicate, such that at the end of the operation, all bits of the predicate are zero.
>
> FYI, did you look at the ARM Scalable Vector Extension?

Some interaction I've witnessed on the LLVM dev lists, as of course we crosd paths, there.

> I just quickly scanned it but basically it looks like
> variable length SIMD, from 128 to 2048 bits in 128 bit increments
> (they use the term Vector to mean a SIMD element).
> It supports predicates too.
>
> ARM Scalable Vector Extension, 2018
> https://arxiv.org/abs/1803.06185
>
> That paper is from 2017 and ARM is on to SVE2 already
> (I've not looked for info on SVE2).

The first fault system looks to be a really bad hack due to not properly thinking through how OoO works. If they're trying to drop speculative execution onto an inorder design without a proper precise scoreboard or a multi issue variant of Tonasulo, they are in for a world of pain.

How come people working for ARM get offered six figure salaries and come up with substandard designs? Oh wait, it's proprietary, they're not permitted to interact with competent people in order to get good peer review feedback, whoops...

MitchAlsup

unread,
Jun 23, 2019, 7:10:38 PM6/23/19
to
1-wide machine::
Since My 66000 has an instruction buffer (16 instructions) and since it takes 320 gates to capture a single 32-bit instruction, adding 40 gates
to decode the length, and where the immediate an displacements happen to
be in the IB, is such a low expense; I perform these decodes and also
scan ahead of the current decode position looking for branches. If I find
a branch, I will fetch it from I$ and place it in the IB so that if the
branch is taken I can decode instructions on the taken path in the following
cycle. I am willing to prefetch 2 branches in order not to lose too much
of the IB while minimizing branch taken latency.

2-3-wide machine
The IB will be bigger (24 entry) and a predictor will guide target prefetch
and issue (subject to mispredict recovery)

> More interlocks and forwarding (if I extend later stages).

Absolutely unavoidable.
>
> Figuring out length in IF made it easier to advance PC than if had I
> done it in D1, since there is a 1c delay between giving it a PC address,
> and getting the results for said PC address.

Here is one of those areas where keeping the IP in unary as it accesses
instructions from IB saves logic and mental cruelty.
So you have an instruction arrive that you have not seen before.
you decode it, but you have to be reading the RF or FRF as you are
deciding if the inst is INT or FP, so , in effect, you are wasting
2 RF ports and you still need another write port at the end of the pipe.
>
> I had evaluated possibilities for merging them, but noted it wouldn't
> save all that much ISA-wise; could either save resources or make timing
> harder. Changing this would be a pretty big breaking-change though.

Where it saves is MemoryRef instructions.
Another place it saves if call/return when you cross type punning boundaries or happen to use *printf facilities.
>
> In the Lite profile, FPU is optional and FP arguments would be passed in
> GPRs in the C ABI either way.

2 different mean based on what if configured.........
>
>
> >>
> >> Similarly, the width of the results from the 32-bit integer multiplier
> >> are wider than those from the DP multiplier.
> >
> > It is an interesting trade offs. 53×53 and only perform FMUL and FMAC OR
> > 57×57 with FDIV and FSQRT OR 64×64 with IMUL and IDIV and REM.
> > We can all agree that if we have a int multiplier 64×64 we can perform all of the calculations
> > We can all agree that 53×53 is just over 50% of 64×64
> > We should be able to agree that there are external factors pushing a design in either direction.
>
> Yeah.
>
> 54*54->54: 6 DSPs (high result, discard low bits of result)
> 32*32->64: 4 DSPs (full-width, 64b output)
> 64*64->128: 16 DSPs
>
> A direct 64*64->128 multiplier is a bit steep (both expensive and slow).

Technically, it should have the same fall though gate count as a 53×53
multiplier. Ceil(ln2(bits))-1
instead of that, I would make a 16×64 multiplier and pump it 2× per cycle
(ala 360/91)
>
> So, the sub-multipliers could be fed different inputs, and the adder
> chains would be assembled differently.
>
> DSP cost would be 10 DSPs (so is the same here), however the cost in
> terms of LUTs and latency would likely be a little worse (due mostly to
> the more complex adder chains needed).
>
>
> The amount of additional plumbing between the integer unit and FPU could
> possibly also be problematic (vs the FPU using its own multiplier).

On the other hand, having a 3 (or 4) cycle integer multiply gets rid of
much of the need to perform loop induction stuff leading to memory
addressing.
>
>
> >>>>
> >>>> With superscalar, it is necessary either to use a more complicated
> >>>> organization (no longer a simple scalar pipeline, eg, needs a more
> >>>> obvious split between the decode and execute stages).
> >>>>
> >>>>
> >>>> If we know in advance, the pipeline can look just like a normal scalar
> >>>> pipeline, just with a few extra parts glued on.
> >>>
> >>> YEs.
> >>>>
> >>>> Main costs are mostly the alternate execute unit, additional register
> >>>> ports, and the added pipeline machinery.
> >>>
> >>> One should note:
> >>> Branch unit probably does not need a RF Write port 97% of the time.
> >>> Store does not ever need an RF write port--and here is where I generally
> >>> steal a RF Read port when an instruction either consumes a constant or
> >>> is a 1-operand instruction (branch for example).
> >>>
> >>> So one should be able to get to ~1.3 IPC with a 3R1W RF; even before
> >>> considering forwarding of results back as operands.
> >>
> >> A 2-wide WEX was presumed to use 4R2W:
> >> 2 read ports per lane, 1 write port per lane.
> >> Ops which need 3R would effectively disallow the second lane.
> >
> > Even if that lane was a 1-op
> > Even if that lane was a 1-op plus constant?
> >
>
> Yes, generally.
>
> It is possible it could be made to work, but would require being more
> clever, and only work for certain ops (depending on which resources they
> use), ...

If your IF/D21/D2 stages are properly proportioned, the logic will add no
delay.
Note: I was describing instructions residing in some kind of cache where
the parallelism is never lost and sequentiality is never imposed. If you
lose either, you don't do this.

Bruce Hoult

unread,
Jun 24, 2019, 4:53:49 PM6/24/19
to
SVE2 already?

The only people known to have implemented SVE are Fujitsu. ARM hasn't in any of their cores, not even the new A55 and A76.

EricP

unread,
Jun 24, 2019, 11:06:18 PM6/24/19
to
Yeah, I saw a reference in May on RealWorld to
SVE2 and ARM's Transactional Memory Extension (TME)

https://www.realworldtech.com/forum/?threadid=184919&curpostid=184919

It was TME I was interested in but the scuttlebutt has it
that it is pretty much the same as Intel HLE/RTM.
So... probably not much use.

ARM doesn't seem to have any info available on either,
but LLVM and GCC were apparently updated for SVE2 and TME
so someone has docs.

lkcl

unread,
Jun 25, 2019, 1:46:59 AM6/25/19
to
On Sunday, June 23, 2019 at 11:18:45 PM UTC+8, EricP wrote:

> Maybe a branch-and-enable-VCP CSR instruction.)
>
> The VCP CSR registers need to be readable and writable so their
> state can be save to the stack and restored afterward. E.G.
>
> oldCsr1 = CsrRead1();
> oldCsr2 = CsrRead2();
> oldCsr3 = CsrRead3();
>
> CsrWrite1(newCsr1);
> CsrWrite2(newCsr2);
> CsrWrite3(newCsr3);
> nextPC = &Next;
> CsrWriteBEVCP (nextPC); // Branch and Enable VCP

https://libre-riscv.org/simple_v_extension/specification/#csrs

U Mode traps have the atomic effect, on entry and exit, of doing a branch and swap of the PC and some of the CSRs.

That prompted me to do a walkthrough of the Vectorisation Context STATE CSRs and I realised that a swap is needed on entry and exit to and from each privilege level.

So a User Mode trap swaps state with ueSTATE, trap is happy, doing its thing.
Next thing you know, Supervisor Mode kicks in and swaps STATE with seSTATE.

S Mode does its thing (possibly even context switching by writing a new seSTATE as well as the other goodies like the regfile and sepc) and on exit, seSTATE gets swapped with STATE again and the new context begins where it left off.

Everthing is hunky dory, even if three nested hierarchical swaps occur in the middle of different priv levels.

That's the theory, at least.

L.

lkcl

unread,
Jun 25, 2019, 2:23:40 AM6/25/19
to
On Monday, June 24, 2019 at 12:23:16 AM UTC+8, lkcl wrote:

> Don't ask me to explain twin predication btw :)

Aw Eric, you didn't notice and ask "so what's twin predication then?" :)

Where most vector systems have separate instructions for VINSERT, VSPLAT, VGATHER, VINDEX, VSCATTER and so on, these are all gone and replaced by twin predication.

It works by having separate element indices and separate predicate masks on *both* the src and destination elements, set to "skip" mode (ie they must both not perform zeroing of masked-out elements).

Also, important to note: each of src and dest may be set to scalar or vector.

If scalar-src vector-dest is set we have VSPLAT.

If a single bit on the dest vector is set we have VINSERT.

If vector-src scalar-dest is set, the first bit of the src predicate ends the process and we have VSELECT.

If both are vector and both have predicate masks we have vector variants of bit-scatter/gather including being able to copy part-vectors from any arbitrary start position to any arbitrary offset in the destination.

It is almost comically powerful and covers a mode I have never seen in published literature on vector systems.

All this only applies to ops with 1 src and 1 dest, so a MV operation would be the usual candidate, however it works just as well on LD/ST and on FCVT (convert of FP32 to FP64 for example) and on INT sign-extend operations.

All this by using *standard scalar operations and saying hey you get to be vectorised using hidden state*.

*literally* not a single new opcode is added and yet 98% of RVV's opcodes are covered!

The only ones not provided are where functionality was not added to standard RISCV, either left out or because it did not make sense.

Dot product for example.

MV.X which is rfile(rd) = rfile(rfile(rs)) is not in standard RISCV so we have to add it first, before we can get the functionality of VINDEX.

However the equivalent indirection is in LD/ST so we *can* mark the srcbase as a vector and get LD.X and ST.X including twin predication.

FCLIP doesn't exist in RISCV scalar so we would have to add it first.

We therefore have the whole of the RVV opcode space free to use for operations that just cannot be done for Video and 3D GPUs by traditional Vector Engines, as Nyuzi (and Larrabee) showed.

Texturisation, Z buffering, YUV to ARGB conversion, 4x FP32 QUAD to ARGB8888 which in Vulkan is *not* a straight linear conversion, it needs interpolation, extremely expensive if using straight computational general purpose Vector Processing.

L.


EricP

unread,
Jun 25, 2019, 12:46:15 PM6/25/19
to
lkcl wrote:
> On Sunday, June 23, 2019 at 11:18:45 PM UTC+8, EricP wrote:
>
>> The VCP CSR registers need to be readable and writable so their
>> state can be save to the stack and restored afterward. E.G.
>
>
> Yes absolutely.
>
>> oldCsr1 = CsrRead1();
>> oldCsr2 = CsrRead2();
>> oldCsr3 = CsrRead3();
>>
>> CsrWrite1(newCsr1);
>> CsrWrite2(newCsr2);
>> CsrWrite3(newCsr3);
>> nextPC = &Next;
>> CsrWriteBEVCP (nextPC); // Branch and Enable VCP
>>
>> Next:
>> // Restore and restart VCP
>> CsrWrite1(oldCsr1);
>> CsrWrite2(oldCsr2);
>> CsrWrite3(oldCsr3);
>> nextPC = retPC;
>> CsrWriteBEVCP (nextPC); // Branch and Enable VCP
>>
>> or some such thing.
>
> I am very reluctant to add new instructions to SV, or to alter existing RV behaviour. However I am interested to hear where you are going with this idea.

I understand. This was mostly me thinking about co-processors.

I was thinking you were describing an asynchronous co-processor,
kinda like the 8087 used to be for long running transcendentals,
except it deals with vectors and as such would have to be able
to halt and restart on a single vector element.

And since I think the future is co-processors of all shapes and sizes,
image processors, speech, vectors, etc, I was musing out loud
how one might tie in such a brand X CP in a generic manner.

And one doesn't want to have to write code that knows about all
other CP's, just the one you want to talk to.

> I *think* what you are in effect doing is merging two opcodes, Branch and CSRRW, into a single atomic instruction.

Here I'm wondering how one can return from a user mode
signal handler to the interrupted user mode routine,
without a trip through the kernel, and reenable all co-processors
that were halted when the kernel was interrupted to deliver the signal,
all without knowing which xCP's are present.

So I'm thinking a CSR that can be written with a branch address
in user mode that, as a side effect, re-enables all hardware CP's
and then stuffs the address into the PC.

Its not quite right though as it has to restore saved registers too.

Just tossing spaghetti at the wall to see what sticks.

> Normally that would in effect be exactly what would be done by a trap, through the RISCV uepc being set in Supervisor Mode.

Yes, but the twist was whether it could be done without a trip
through the kernel, and without software knowing exactly what
hardware CP's are attached.



lkcl

unread,
Jun 25, 2019, 2:41:39 PM6/25/19
to
On Tuesday, June 25, 2019 at 5:46:15 PM UTC+1, EricP wrote:

> > I am very reluctant to add new instructions to SV, or to alter existing RV behaviour. However I am interested to hear where you are going with this idea.
>
> I understand. This was mostly me thinking about co-processors.

funnily enough, in the simulation (and compiling to run on FPGAs) this
processor is going to be so massive, i was considering how to split ALUs
out to separate FPGAs, effectively *as* co-processors.

> I was thinking you were describing an asynchronous co-processor,
> kinda like the 8087 used to be for long running transcendentals,
> except it deals with vectors and as such would have to be able
> to halt and restart on a single vector element.

the idea i was toying with - which is not new, i admit - was just
to push operand data across AXI4 buses, transported over ISO Layer 2
ethernet (raw packets basically).

> And since I think the future is co-processors of all shapes and sizes,
> image processors, speech, vectors, etc, I was musing out loud
> how one might tie in such a brand X CP in a generic manner.

OmniXtend is supposed to be designed to do that, and at a much larger
level, there's Gen-Z and OpenPiton. they all kiinda do the same thing?
except at different protocol levels. Gen-Z, the speeds are mad.

> And one doesn't want to have to write code that knows about all
> other CP's, just the one you want to talk to.
>
> > I *think* what you are in effect doing is merging two opcodes, Branch and CSRRW, into a single atomic instruction.
>
> Here I'm wondering how one can return from a user mode
> signal handler to the interrupted user mode routine,
> without a trip through the kernel, and reenable all co-processors
> that were halted when the kernel was interrupted to deliver the signal,
> all without knowing which xCP's are present.

well, in the 6600-like scoreboard-based design, the Computation Units only have to tell the scoreboard "i'm done, now". there's no *actual* obligation to complete in a set time.

therefore, both hypothetically (and practically) speaking, there's nothing to stop just farming the data out - transparently as far as anyone knows - over any kind of bus you can envisage.

if latency sucks, then, well, you should have designed a better bus and given it more bandwidth :)

> So I'm thinking a CSR that can be written with a branch address
> in user mode that, as a side effect, re-enables all hardware CP's
> and then stuffs the address into the PC.

am trying to get my head round this...

> Its not quite right though as it has to restore saved registers too.

indeed.

>
> Just tossing spaghetti at the wall to see what sticks.

...the sauce? splut :)

> > Normally that would in effect be exactly what would be done by a trap, through the RISCV uepc being set in Supervisor Mode.
>
> Yes, but the twist was whether it could be done without a trip
> through the kernel, and without software knowing exactly what
> hardware CP's are attached.

RV has the concept of a U-Mode trap, completely separate and distinct from S-Mode (and M and Hypervisor Modes).

it's an interesting one, that: it would typically be used by e.g. sigsetjump / siglongjump, or user-mode timers, or c++ threads, and POSIX thread cancellation, and all sorts of weird-and-wonderful stuff, as well as system calls. or, at least, i *think* that's what it's for.

if it isn't, then, well, the point is at least: you _can_ design a trap handler that's specifically for user-mode applications, in which you get the opportunity to do atomic stuff. U-Mode traps have a flag that, when set, doesn't allow any other interrupts (not in U-Mode, certainly).

honestly though, my feeling is that (and it had better damn well return within a set time) as long as the latency isn't unreasonable, and as long as there's 100% guaranteed no possibility of the Bus going "err could you repeat that cos i kinda wasn't listening, y'ken?", hooking into the asynchronous nature of Computation Units seems to me to be a reasonable way to do it.

DMA (memory-mapping) is the "usual" way to deal with co-processors (typically I/O, but not always), of course that doesn't actually hook into the *instruction* set...

l.

nedbrek

unread,
Jun 25, 2019, 8:59:59 PM6/25/19
to
I'm late to the party, but has anyone looked at how loop unrolling and software pipelining worked on Itanium?

I know Itanium is taboo, but it was very compact code. You could probably fit a lot of vector algorithms into a small loop buffer.

MitchAlsup

unread,
Jun 25, 2019, 10:28:32 PM6/25/19
to
On Tuesday, June 25, 2019 at 7:59:59 PM UTC-5, nedbrek wrote:
> I'm late to the party, but has anyone looked at how loop unrolling and software pipelining worked on Itanium?

Some will say rather well, others say it took way to much HW and SW and delivered too little.

Bruce Hoult

unread,
Jun 25, 2019, 11:18:39 PM6/25/19
to
On Tuesday, June 25, 2019 at 5:59:59 PM UTC-7, nedbrek wrote:
> I'm late to the party, but has anyone looked at how loop unrolling and software pipelining worked on Itanium?
>
> I know Itanium is taboo, but it was very compact code. You could probably fit a lot of vector algorithms into a small loop buffer.

Itanium may be ok on vector algorithms, but on scalar code it has by far the largest code size of anything near the mainstream (anything supported by the Linux kernel, say) -- far bigger even than Alpha.

See historic posts in this group with code size comparisons for various ISAs on, IIRC, gzip and grep (or bash?)

nedbrek

unread,
Jun 25, 2019, 11:49:13 PM6/25/19
to
Yes, a large part of the code bloat is due to NOPs. Definitely would not want to follow that part.

lkcl

unread,
Jun 26, 2019, 12:11:38 AM6/26/19
to
On Wednesday, June 26, 2019 at 4:49:13 AM UTC+1, nedbrek wrote:
> I'm late to the party,

hello, welcome :)

> Yes, a large part of the code bloat is due to NOPs.
> Definitely would not want to follow that part.

no, definitely not, although now that's mentioned i can perhaps deduce what they did: stacked lots of operations into a regular (fixed, large) block format?

mixing shorter ops with longer ones quickly becomes hell-on-earth. got an exception (and only a PC)? no problem, let's just take the trap, store the PC... and now the trap exits, back to where things left off.

so. start executing... and... um... where's the beginning of the instruction?

i mentioned this right at the beginning of the thread: the processor has to do a mad, desperate scramble "where's ma damn opcode start from?? argh!".

so this is probably why itanium went with a fixed (large) block format, because at least then you know for certain where the frickin instruction starts.

unfortunately, there, you now have (i assume) the potential for *multiple* exceptions occurring in an instruction?

um... which one goes where?

the solution i came up with is quite simple: an extra Program Counter (PCVBLK - Program Counter in Vector Blocks) that has to have its own suite of (x)ePCVBLK that must be treated identically to (x)epc - the exception program counter. (x = u / m / s / h for user, machine, supervisor, hypervisor)

it's basically a sub-program-counter. the main PC *pauses* - remains pointing at the start of the *LARGE* block. the sub-PC wanders happily through the sub-operations. even does loops (relative to the *block*), and, hypothetically, if they can fit, even mini function calls, w00t!

any exception now returns with *TWO* pieces of information:

* the main PC (which, surpriiiise, points at the beginning of the VBLOCK)
* the sub-PC PCBBLK (which, surpriiise, points at the same operation in the block as was interrupted by the trap).

sub-execution can then preserve instruction order just as main-execution can preserve instruction order.

and in the process, all of the problems normally associated with VLIW disappear.

l.

Terje Mathisen

unread,
Jun 26, 2019, 1:35:46 AM6/26/19
to
I agree:

When I first read the Itanium asm/architecture manuals I did work out a
few of those algorithms, and they did end up with quite nice code:
Zero unrolling, no separate startup/exit code.

The only problems were (a) it arrived at least two years too late, and
with lower core speeds than originally announced and (b) it cost far too
much.

nedbrek

unread,
Jun 26, 2019, 9:07:59 AM6/26/19
to
They blew up the instruction size by having 128 registers (7 bits per register specifier) and 64 predicates (6 bit overhead) - combined with fully specified FMAC (1 dest, 3 sources) - 7*4+6=34 oops.
They worked around it by bundling 3 ops into 128 bits (41x3, leaving 5 bits for "template" - which was going to include dependency information).

Then some genius decided to "help" dispatch by including routing information in the 5 bit template.

Unfortunately, 32 templates is not enough to encode all the possible instruction mixes (especially taking into account dependencies, and they left a lot of them reserved - since no one had any idea what would be common).

So, you can have MIB (memory, integer, branch), but no bundle with memory toward the end (no IIM).

The lack of templates leads to nops (if you want IIM, you need nop.m,I,I and M--). About 30% on integer code, and up to 50% on float.

Exception recovery was easy enough. The bundles are 16 byte aligned, so IP[3:0] is always 0. You can stash the instruction offset in 2 bits (the exception registers also include the instruction offset). Of course, you can't directly jump into the middle of a bundle, leading to more padding at branch targets...

lkcl

unread,
Jun 26, 2019, 10:32:36 AM6/26/19
to
On Wednesday, June 26, 2019 at 2:07:59 PM UTC+1, nedbrek wrote:

> They blew up the instruction size by having 128 registers (7 bits
> per register specifier) and 64 predicates (6 bit overhead)
> - combined with fully specified FMAC (1 dest, 3 sources) - 7*4+6=34 oops.

:)

EricP mentioned earlier, after Ivan said "soo what ISA would you design
if not shackled by RISC-V?", that the scheme i then described was
reminiscent of the LZ4 compression algorithm! which was kinda a cool
surprise.

it does beg the question, though: if a 20-year-old student can come up
with an ISA that looks like a compression algorithm, why the hell can't
Intel, funded as it is with multi-billions of dollars?? moo??

> They worked around it by bundling 3 ops into 128 bits (41x3, leaving 5
> bits for "template" - which was going to include dependency information).

aw nooo...

> Then some genius decided to "help" dispatch by including routing
> information in the 5 bit template.
>
> Unfortunately, 32 templates is not enough to encode all the possible
> instruction mixes (especially taking into account dependencies, and
> they left a lot of them reserved - since no one had any idea what
> would be common).

y'know... i do like some of these architectures, particularly the
ones that are well-funded. they serve as high-impact extremely
well-documented examples of How Not To Design An ISA.

> The lack of templates leads to nops (if you want IIM, you need
> nop.m,I,I and M--). About 30% on integer code, and up to 50% on float.

ok i thought i understood "templates" above: now i don't. what are
they, and why would they help reduce NOP proliferation?

> Exception recovery was easy enough. The bundles are 16 byte
> aligned, so IP[3:0] is always 0.

yehyeh. so the PC may always point to the start of the bundle.

> You can stash the instruction offset in 2 bits (the exception
> registers also include the instruction offset).

ok, that i like. it's kinda similar to the "sub-PC". and at least
with the regular PC...

> Of course, you can't directly jump into the middle of a
> bundle, leading to more padding at branch targets...

*sigh* at least with the variable instruction encoding, there's no
padding with NOPs. it is very weird however to have a sort-of "sub-PC",
that's the only down-side.

also i'm not really sure what to do about the encoding being 80 to 192
bits, where there's at one end the possibility of the spectrum, NOPs
*might* be needed, and at the other, you want to jam in as many 16/32-bit
opcodes as you can possibly get because the "Vector Context" could
take up (16 + 16 + 4*16 + 4*16) i.e. 96 of those 192 bits (!!) leaving
only 6x 16-bit "things" so you could actually only get *three* 32-bit
opcodes or 6x 16-bit RVC ones in at the end of all the context.

[i created an 8-bit format for the Vector Context blocks, to help reduce
the pressure there. it's limited in capability, though]

as a "backup" - and also to reduce pressure on compiler writers, we
have a "single-instruction-can-be-prefixed" scheme, where i kinda steered
the author (Jacob - hi if you're reading this!] so that anything in the
"SVPrefix" form could be post-processed, its prefix removed and
"converted" to the "VBLOCK" form, and the *un*-prefixed (standard, RV)
opcode dropped into one of the blocks.

this would not need the compiler itself to know about VBLOCK: although
there would be a loss of opportunity for certain optimisations, the
conversion from "SVPrefix" form to "VBLOCK" form could even be done by
a static offline *binary* conversion tool, not even a need to get in
at the asm stage (gcc -s).

branches would need to be considered "off limits" for such a tool,
unless the branch was so short and it was guaranteed to be known that
there was no jump coming *in* to the point that was to be VBLOCKed.
this would make it perhaps better to put the conversion in at the asm
stage, i don't know: never done anything like this before.

anyway, having the "SVprefix" scheme separate from VBLOCK means that
compilers can insert vectorisation context at the point it's needed
("i want register x80 and i want it as a vector and i wanna predicate
'an'na pink pony too") and neither LLVM nor gcc need to know about
VBLOCK.

l.

John Dallman

unread,
Jun 26, 2019, 11:05:33 AM6/26/19
to
In article <c3504ad0-cb89-4410...@googlegroups.com>,
lk...@lkcl.net (lkcl) wrote:

> it does beg the question, though: if a 20-year-old student can come
> up with an ISA that looks like a compression algorithm, why the hell
> can't Intel, funded as it is with multi-billions of dollars?? moo??

For Itanium, the design work was happening over twenty years ago, and I
have the impression that a lot of the people involved had gained their
experience decades before then. The architecture presents alternative
solutions to the problems that were solved by OoO execution, but appeared
significantly later.

John

Stephen Fuld

unread,
Jun 26, 2019, 11:28:37 AM6/26/19
to
On 6/26/2019 7:32 AM, lkcl wrote:
> On Wednesday, June 26, 2019 at 2:07:59 PM UTC+1, nedbrek wrote:

snip

>> The lack of templates leads to nops (if you want IIM, you need
>> nop.m,I,I and M--). About 30% on integer code, and up to 50% on float.
>
> ok i thought i understood "templates" above: now i don't. what are
> they, and why would they help reduce NOP proliferation?

The templates were a set of five bits that told the dispatch logic to
which FU each of the three instructions in the bundle was to be sent.
i.e. it might say that the first instruction was to be sent to the
floating point FU. In theory, this is nice in that it reduces that
decision to only five bits for three instructions from the entire (or
much of the) op code for each of the instructions in the bundle.

It did not reduce NOP proliferation. On the contrary, it increased it.
This was due to the fact that five bits can't encode all the
possibilities for routing each of the three instructions in the bundle.
For cases where there wasn't an available template choice, you had to
split the instructions into two bundles, and fill the slots that would
have been used for useful instructions with nops.

Hope this helps.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

lkcl

unread,
Jun 26, 2019, 12:53:45 PM6/26/19
to
On Wednesday, June 26, 2019 at 4:28:37 PM UTC+1, Stephen Fuld wrote:
> > ok i thought i understood "templates" above: now i don't. what are
> > they, and why would they help reduce NOP proliferation?
>
> The templates were a set of five bits that told the dispatch logic to
> which FU each of the three instructions in the bundle was to be sent.
> i.e. it might say that the first instruction was to be sent to the
> floating point FU. In theory, this is nice in that it reduces that
> decision to only five bits for three instructions from the entire (or
> much of the) op code for each of the instructions in the bundle.

ok gotitgotit, thank you stephen.

> Hope this helps.

yes, very much, thank you.

l.

lkcl

unread,
Jun 26, 2019, 12:55:38 PM6/26/19
to
On Wednesday, June 26, 2019 at 4:05:33 PM UTC+1, John Dallman wrote:

> For Itanium, the design work was happening over twenty years ago, and I
> have the impression that a lot of the people involved had gained their
> experience decades before then. The architecture presents alternative
> solutions to the problems that were solved by OoO execution, but appeared
> significantly later.

thank you, john. in fairness to the itanium team, the design i came up
with all that time ago was also originally VLIW-like, and would have had
similar compiler-related issues.

l.

lkcl

unread,
Jun 26, 2019, 3:35:27 PM6/26/19
to
I just realised, as you do, that this VLIW-like format isn't VLIW at all, it's a context change.

I was thinking how insane it would have to be, to load the entire 192 bit block into a buffer, possibly even 2 of them, before being able to begin processing the instructions.

I suddenly realised that is total nonsense.

All that is needed is to recognise the long header 0b1111111 and to tell the ibstruction issue FSM "hello we are in VBLOCK interpreting mode, now, start incrementing PCVBLK and use it to determine where we are in the FSM, y'ken"

Once in VBLOCK "mode", the next "instruction" to expect would be the various Vectorisation Context Blocks.

One those had been read, finally "standard RV opcode mode" can be engaged, this time however still incrementing the PCVBLK not the main PC, and of course making sure the engine knows it is in VBLOCK mode.

This makes the whole show much less VLIW like, it really is just a context system, yet, thank goodness, the above idea is a single level FSM, no need to have massive buffers and double levels of instruction queue.

It reminds me more of a Turing Machine than anything.

L.

MitchAlsup

unread,
Jun 26, 2019, 5:09:26 PM6/26/19
to
On Wednesday, June 26, 2019 at 9:32:36 AM UTC-5, lkcl wrote:
> On Wednesday, June 26, 2019 at 2:07:59 PM UTC+1, nedbrek wrote:
>
> > They blew up the instruction size by having 128 registers (7 bits
> > per register specifier) and 64 predicates (6 bit overhead)
> > - combined with fully specified FMAC (1 dest, 3 sources) - 7*4+6=34 oops.
>
> :)
>
> EricP mentioned earlier, after Ivan said "soo what ISA would you design
> if not shackled by RISC-V?",

I designed an industrial quality ISA (My 66000) that has most of the
properties of RISC architectures, but includes variable length extension
to provide immediates and displacements (including storing of an immediate
into memory using a large displacement).

Several people have ask to see this 230-odd page document and I have
graciously allowed them.

What industrial quality includes:
a) a means for ATOMIC events that allows for multiple memory locations to
participate in a single ATOMIC event with moderate guarantees of forward
progress even under withering interference.
b) full feature FP including:
1) FRACT, EXPON, EADD which allow tearing apart and putting together FP
2) ability to control rounding modes in any conversion between FP
and integer types
3) full feature set of high performance transcendental instructions
c) full feature bit manipulation: EXT, INS
d) no supervisory state
e) inherent parallelism
1) multithreaded
2) multicore
3) remote thread control
4) no privileged instructions
f) Vectorization
g) predication
h) blah blah blah...

Implementable as small as 1-wide or as big as you can figure out.

The direct support of immediates and displacements removes about 9% of the
typical RISC instruction from the stream.

Bruce Hoult

unread,
Jun 26, 2019, 6:53:42 PM6/26/19
to
I got a copy of this from you three or four years ago. (I'd be delighted to get an update if there have been any significant changes) That was back before I was working on SGPU which appears to have inherited some of the features.

I've always thought it's a brilliant and very interesting design, and should be implemented. Do you have any plans to do so? Is there RTL for any implementation? Simulator? Compiler?


The thing that has always impressed me so much with what (now my bosses) Andrew and Yunsup and Krste did in 2010 to 2014 (and continuing now) is a tight and rapid iteration:

1) do some ISA design
2) implement in simulator
3) implement in gcc & binutils
4) build and run real software, measure code size and instructions
5) iterate to 1)
6) implement in RTL
7) put RTL on FPGAs
8) run real software cycle-accurate
9) iterate to 1)
10) tape out a chip on a shuttle run
11) run real software
12) iterate to 1)

That was all done and dusted for the base ISA a few years ago with current iterations being on uarch, but the full nested iteration cycle is going on as we speak with for example the Vector extension and the much less ambitious BitManipulation extension, both of which I'm involved with.

Software adopted an agile, small iterations, have something runnable/shipable (though maybe feature poor) at all times approach two decades ago.

Why is hardware still done "waterfall"?

Anton Ertl

unread,
Jun 27, 2019, 2:52:06 AM6/27/19
to
Bruce Hoult <bruce...@gmail.com> writes:
>Software adopted an agile, small iterations, have something runnable/shipab=
>le (though maybe feature poor) at all times approach two decades ago.
>
>Why is hardware still done "waterfall"?

Is it? I have read the claim that for HPPA (released 1986) no
instruction was included unless performance gain of IIRC 1% could be
demonstrated. And they used simulators for that. And I am sure they
were not the only ones, although I consider instruction count a flawed
cost metric (nonetheless, Patterson's recent papers put a lot of
weight on instruction counts).

And if you look at Intel and AMD, they started out with the 8086 41
years ago (or maybe even the assembly-level-compatible 8080 45 years
ago), always had something shippable, and extended it over many
generations to the mess they have now; but the mess performs well.
And I am sure that they used simulations etc. to evaluate the
extensions when they made them.

On the implementation side, HP pioneered the approach of having a
long-lived microarchitecture with the PA-8000 series. Intel followed
that with the P6-based microarchitectures, first released in 1995,
with the main line being based on that in 2019 (although I guess with
the constant rebuilding they are doing the current iteration is quite
different from the original).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,
Jun 27, 2019, 3:19:40 AM6/27/19
to
Anton Ertl wrote:
> And if you look at Intel and AMD, they started out with the 8086 41
> years ago (or maybe even the assembly-level-compatible 8080 45 years
> ago), always had something shippable, and extended it over many
> generations to the mess they have now; but the mess performs well.
> And I am sure that they used simulations etc. to evaluate the
> extensions when they made them.

The use of simulation to remove instructions from the fast core really
bit them on the P4:

It looks like they did independent simulations of shifts and muls, and
decided to make both signficantly slower. This really hurt all forms or
array addressing since you now could not use shift&add as a replacement
for simple constant muls. :-(

OTOH, this aberration had other problems as well, primarly heat afaik,
so it only lasted for a single generation before we got back to Core
which looks and smells a lot like the original P6/PentumPro Andy Glew
worked on.

lkcl

unread,
Jun 27, 2019, 3:23:44 AM6/27/19
to
On Thursday, June 27, 2019 at 7:52:06 AM UTC+1, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >Software adopted an agile, small iterations, have something runnable/shipab=
> >le (though maybe feature poor) at all times approach two decades ago.
> >
> >Why is hardware still done "waterfall"?
>
> Is it? I have read the claim that for HPPA (released 1986) no
> instruction was included unless performance gain of IIRC 1% could be
> demonstrated. And they used simulators for that.

i believe that bruce (hello) may be witnessing a much more open approach
to processor development than the one that, historically, has resulted in
successful long-term designs.

point being: proprietary companies *may* have been deploying an iterative
approach, and we *just don't get to see it in action*.

in the libre/open hardware community the iterative approach has been
deployed successfully for projects such as OpenRISC1000/1200, and there
are many other cores, historically, based on e.g. MIPS (except patent
issues pissed libre/open people off, there).

RISC-V is the first time the rapid iterative approach has become more
mainstream and many of the details - not all - may be observed or
participated in, for a heavy price, sadly: ITU-style standards development.

it's a compromise half-way-house arrangement that bridges the world between
was was formerly *entirely* secretive development, and the libre/open
community, and is thus an awful lot better. unfortunately, though, libre/open
development practices and benefits don't work if the libre/open developers
do not have *full* access and *full* rights to speak freely.

so there is still a long way to go, there, yet, overall, it is a hell of
a lot better than proprietary development.

l.

lkcl

unread,
Jun 27, 2019, 3:36:11 AM6/27/19
to
On Sunday, June 23, 2019 at 4:18:45 PM UTC+1, EricP wrote:

> I see in the spec that you mention re-entrantcy but don't go into detail.
> The VCP needs to be interruptible by external devices part way through
> a vector op, as well as exceptions (page faults in particular, as you note),
> and restartable afterwards, user or kernel mode.

eric i thought how best to express this, and i suddenly realised that
there's a really simple way to explain it: python "yield".

python's "yield" statement, if used correctly, turns any standard-looking
(nested) for-loop into an opportunity to *appear* to be re-entrant code.
the most extreme version of that was something called multitaskhttpd,
which originally was a single-process SIP-to-RTSP gateway (that required
persistent connections). i turned it into a HTTP/1.0 to HTTP/1.1 persistent
connection gateway, and it is *entirely* based on *massive* deep-nested
use of "yield".

so... if you can imagine that the spec, when it comes to executing the
actual element instruction, does a python "yield" statement immediately
after the execution, that, conceptually, is how the hardware has to
operate.

l.

Andy Valencia

unread,
Jun 27, 2019, 10:04:43 AM6/27/19
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Is it? I have read the claim that for HPPA (released 1986) no
> instruction was included unless performance gain of IIRC 1% could be
> demonstrated. And they used simulators for that.

My early PA Architecture Reference Manual had reference hint bits.
I was doing hand-written assembly code, and because of high level
design knowledge I made pretty good use of the hinting. But next
major revision, the bits were gone--and yes, I was told it was
exactly because they couldn't make them reliably deliver a performance
win in their measurements.

I didn't hear the 1% number, though. I'm pretty sure this change was
well before the prototype hardware started showing up (first units
with a "LP" prefix (lab proto), later "PP" (production proto)). So
yes, all simulated sounds correct.

Andy

MitchAlsup

unread,
Jun 27, 2019, 12:12:44 PM6/27/19
to
I have a similarly detailed 1-wide in order microarchitecture (200-odd pages), and I am working on a SuperScalar partially ordered implementation
taking the Calculation units of the former to be used in the later.

I have written 1/3rd of a simulator.

I am looking for a GCC person to do a compiler port.
>
>
> The thing that has always impressed me so much with what (now my bosses) Andrew and Yunsup and Krste did in 2010 to 2014 (and continuing now) is a tight and rapid iteration:
>
> 1) do some ISA design
> 2) implement in simulator
> 3) implement in gcc & binutils
> 4) build and run real software, measure code size and instructions
> 5) iterate to 1)

always a good plan

> 6) implement in RTL
> 7) put RTL on FPGAs
> 8) run real software cycle-accurate
> 9) iterate to 1)

We plan on doing this,

> 10) tape out a chip on a shuttle run
> 11) run real software
> 12) iterate to 1)

Need some real money to do this.

The thing I fail to see in your list is verification.
>
> That was all done and dusted for the base ISA a few years ago with current iterations being on uarch, but the full nested iteration cycle is going on as we speak with for example the Vector extension and the much less ambitious BitManipulation extension, both of which I'm involved with.
>
> Software adopted an agile, small iterations, have something runnable/shipable (though maybe feature poor) at all times approach two decades ago.
>
> Why is hardware still done "waterfall"?

I think mainly because one waits for ISA and simulation before committing
enough resources (read $$$) to get it done. Especially the 1-man shop
versions.

MitchAlsup

unread,
Jun 27, 2019, 12:14:42 PM6/27/19
to
On Thursday, June 27, 2019 at 1:52:06 AM UTC-5, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >Software adopted an agile, small iterations, have something runnable/shipab=
> >le (though maybe feature poor) at all times approach two decades ago.
> >
> >Why is hardware still done "waterfall"?
>
> Is it? I have read the claim that for HPPA (released 1986) no
> instruction was included unless performance gain of IIRC 1% could be
> demonstrated. And they used simulators for that. And I am sure they
> were not the only ones, although I consider instruction count a flawed
> cost metric (nonetheless, Patterson's recent papers put a lot of
> weight on instruction counts).
>
> And if you look at Intel and AMD, they started out with the 8086 41
> years ago (or maybe even the assembly-level-compatible 8080 45 years
> ago), always had something shippable, and extended it over many
> generations to the mess they have now; but the mess performs well.

You might be surprised at how many have been done that were never shipped.
We just never get to see them.

MitchAlsup

unread,
Jun 27, 2019, 12:17:33 PM6/27/19
to
On Thursday, June 27, 2019 at 2:19:40 AM UTC-5, Terje Mathisen wrote:
> Anton Ertl wrote:
> > And if you look at Intel and AMD, they started out with the 8086 41
> > years ago (or maybe even the assembly-level-compatible 8080 45 years
> > ago), always had something shippable, and extended it over many
> > generations to the mess they have now; but the mess performs well.
> > And I am sure that they used simulations etc. to evaluate the
> > extensions when they made them.
>
> The use of simulation to remove instructions from the fast core really
> bit them on the P4:
>
> It looks like they did independent simulations of shifts and muls, and
> decided to make both signficantly slower. This really hurt all forms or
> array addressing since you now could not use shift&add as a replacement
> for simple constant muls. :-(

If you look back at Mc88100 you will find a 3 cycle integer multiply
performance, this took a lot of pressure off the loop induction variable
elimination in the compiler. I have always been a fan of fast intger
multiplies--even when really well done compilers can mitigate much of
that perf.

Stephen Fuld

unread,
Jun 27, 2019, 12:27:53 PM6/27/19
to
On 6/26/2019 2:09 PM, MitchAlsup wrote:


snip
I think it would be worthwhile to look at the design and compare the
results of its characteristics to a more "typical" existing RISC design.

Does the use of your "enhanced" scoreboard (vs say a ROB based design)
result in higher potential clock rate (assuming same fab process)? Or
is the advantage reduced heat generation or smaller die size without
being able to increase clock rate? Can you quantify the advantage?

Is your 9% number for reduced instruction count a static or dynamic count?

Do you have data/intuition of how much single thread performance
advantage you will get from the ISA features listed in b, c, and g above?

Given that, do you have data/intuition of how much further multi-thread
performance increase you will get from a, e, and your fast context switch?

EricP

unread,
Jun 27, 2019, 12:32:43 PM6/27/19
to
Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
>> Software adopted an agile, small iterations, have something runnable/shipab=
>> le (though maybe feature poor) at all times approach two decades ago.
>>
>> Why is hardware still done "waterfall"?
>
> Is it? I have read the claim that for HPPA (released 1986) no
> instruction was included unless performance gain of IIRC 1% could be
> demonstrated. And they used simulators for that. And I am sure they
> were not the only ones, although I consider instruction count a flawed
> cost metric (nonetheless, Patterson's recent papers put a lot of
> weight on instruction counts).

HP also left the integer multiply instruction off the original PA-RISC
because they drank the "one clock per RISC instruction" koolaid.
Added it in later apparently, like MIPS and Alpha on byte/word
(MIPS fixed this before shipping R2000, but DEC had to be clubbed
into submission on this one).


MitchAlsup

unread,
Jun 27, 2019, 1:13:22 PM6/27/19
to
The design arose from having done ISAs and implementations for 35 years
and having watched the good (MIPS) and the bad (SPARC,...).
>
> Does the use of your "enhanced" scoreboard (vs say a ROB based design)
> result in higher potential clock rate (assuming same fab process)? Or
> is the advantage reduced heat generation or smaller die size without
> being able to increase clock rate? Can you quantify the advantage?

Faster clock rate depends on the ratio of data buss wire delay to ALU
logic delay. If these turn out where I expect, the frequency advantage
should be 25%-33%. But a lot has to do with how the tools "synthesize"
the logic and wires. For a full custom design like I used to do, the
frequency advantage might be as high as 50%.

Less heat due to wiggling 1 wire to denote one state change (and without
CAMs.) How much is yet to be measured. Also less heat as there is no
ROB (area or power).
>
> Is your 9% number for reduced instruction count a static or dynamic count?

Dynamic; but simply morphed up out of Hennessy and Patterson data.
>
> Do you have data/intuition of how much single thread performance
> advantage you will get from the ISA features listed in b, c, and g above?

b will be small, but makes the 5-10 instruction sequences to do same unnecessary, especially the changes to the current rounding modes (and
back).

Transcendental instructions are competing against cos = SQRT( 1.0 - sin^^2)
as most modern physics code is currently written. The way things are
currently written does not allow this subgroup to really display its
performance. At least the library functions can be made::

SIN:
SIN R1,R1
JMP R0

Bit manipulation makes code like:

result = (operand << up) >> down

compress into single instructions. So one would see small gains

g predication is present in order to enhance the predictability of those
branches which are not predicated. Predication is also present to min-
imize the loss of instruction in the instruction buffer when branching
short distances.

overall, one could argue they are not worth the bother. But they do nickle
and dime th eperf up, probably more so than a huge SIMD extension.
>
> Given that, do you have data/intuition of how much further multi-thread
> performance increase you will get from a, e, and your fast context switch?

Hard to say, but what we do know:

a) one can remotely control (Activate, Enable, Disable, Deactivate) threads
without needing an IPI or (in many cases) an MP interlock. Once so
controlled, the HW will efficiently put down the current thread and pick up
a new thread including of of its OS and HV states without further OS/HV
interaction.

High performance context switching simply takes cycles out of stuff
architects seldom measure, or when they do, seldom talk about.

All in all, nickle and dime for the application side of things; except for
the immediates and displacements.
On the OS/HV side; I would hold judgement until I see how these things
pan out.

MitchAlsup

unread,
Jun 27, 2019, 1:15:11 PM6/27/19
to
I wonder if they caught wind of my Mc88100 and thought they had to
do something.

Bruce Hoult

unread,
Jun 27, 2019, 4:23:44 PM6/27/19
to
On Wednesday, June 26, 2019 at 11:52:06 PM UTC-7, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >Software adopted an agile, small iterations, have something runnable/shipab=
> >le (though maybe feature poor) at all times approach two decades ago.
> >
> >Why is hardware still done "waterfall"?
>
> Is it? I have read the claim that for HPPA (released 1986) no
> instruction was included unless performance gain of IIRC 1% could be
> demonstrated. And they used simulators for that. And I am sure they
> were not the only ones, although I consider instruction count a flawed
> cost metric (nonetheless, Patterson's recent papers put a lot of
> weight on instruction counts).

1% overall is a reasonable goal and that's similar to what we do in RISC-V land. It doesn't make sense to always insist on it though, if the candidate instruction improves *some* code by rather more than 1% but not average code, but it's very cheap to implement.

I think for example the ANDC/BIC instruction being considered in the BitManipulation extension falls into this category. It won't be very common in general code (but a compiler can easily opportunistically use it) but it's very very cheap to implement as the encoding fits into an existing hole (which wouldn't be true if you wanted ORC and XORC/EQV as well), and there is already an inverter on the 2nd ALU input which simply needs to be triggered.

Something else that should I think be considered is the incremental documentation, learning, and software tool burden of each new instruction. Although Intel and ARM clearly don't give a damn about this!

Instruction counts make sense as long as you are careful to ensure that new instructions being considered don't negatively impact the other factors in the performance equation: the IPC and the clock cycle. There are pretty much givens for instructions that are simple combinatoric functions or that reuse existing circuits in a different way. You also need to consider the impact on area/cost/power.

> And if you look at Intel and AMD, they started out with the 8086 41
> years ago (or maybe even the assembly-level-compatible 8080 45 years
> ago), always had something shippable, and extended it over many
> generations to the mess they have now; but the mess performs well.
> And I am sure that they used simulations etc. to evaluate the
> extensions when they made them.

I'm not talking about iterations on the scale of two or three years -- as an alternative to .. what? Starting a processor design in 1980 and shipping in 2020?

I'm talking about iterations on the scale of two or three weeks, days or, with good automation, hours. (Iterations to booting Linux on an FPGA can be done that quickly and run maybe 1/20th the speed of the eventual SoC, which is enough to be very useful, especially as they are cycle-accurate)

MitchAlsup

unread,
Jun 27, 2019, 7:09:35 PM6/27/19
to
On Thursday, June 27, 2019 at 3:23:44 PM UTC-5, Bruce Hoult wrote:
> On Wednesday, June 26, 2019 at 11:52:06 PM UTC-7, Anton Ertl wrote:
> > Bruce Hoult <bruce...@gmail.com> writes:
> > >Software adopted an agile, small iterations, have something runnable/shipab=
> > >le (though maybe feature poor) at all times approach two decades ago.
> > >
> > >Why is hardware still done "waterfall"?
> >
> > Is it? I have read the claim that for HPPA (released 1986) no
> > instruction was included unless performance gain of IIRC 1% could be
> > demonstrated. And they used simulators for that. And I am sure they
> > were not the only ones, although I consider instruction count a flawed
> > cost metric (nonetheless, Patterson's recent papers put a lot of
> > weight on instruction counts).
>
> 1% overall is a reasonable goal and that's similar to what we do in RISC-V land. It doesn't make sense to always insist on it though, if the candidate instruction improves *some* code by rather more than 1% but not average code, but it's very cheap to implement.
>
> I think for example the ANDC/BIC instruction being considered in the BitManipulation extension falls into this category. It won't be very common in general code (but a compiler can easily opportunistically use it) but it's very very cheap to implement as the encoding fits into an existing hole (which wouldn't be true if you wanted ORC and XORC/EQV as well), and there is already an inverter on the 2nd ALU input which simply needs to be triggered.

It is sad that many designers and architects think that bit-manipulation
is expensive. It only takes a 'greater than decoder' and an AND gate on
top of a shift/rotater.
>
> Something else that should I think be considered is the incremental documentation, learning, and software tool burden of each new instruction. Although Intel and ARM clearly don't give a damn about this!

The more each instruction is like the rest of the members in its group,
the easier it is for SW to utilize it.

Bruce Hoult

unread,
Jun 27, 2019, 8:30:35 PM6/27/19
to
On Thursday, June 27, 2019 at 4:09:35 PM UTC-7, MitchAlsup wrote:
> On Thursday, June 27, 2019 at 3:23:44 PM UTC-5, Bruce Hoult wrote:
> > On Wednesday, June 26, 2019 at 11:52:06 PM UTC-7, Anton Ertl wrote:
> > > Bruce Hoult <bruce...@gmail.com> writes:
> > > >Software adopted an agile, small iterations, have something runnable/shipab=
> > > >le (though maybe feature poor) at all times approach two decades ago.
> > > >
> > > >Why is hardware still done "waterfall"?
> > >
> > > Is it? I have read the claim that for HPPA (released 1986) no
> > > instruction was included unless performance gain of IIRC 1% could be
> > > demonstrated. And they used simulators for that. And I am sure they
> > > were not the only ones, although I consider instruction count a flawed
> > > cost metric (nonetheless, Patterson's recent papers put a lot of
> > > weight on instruction counts).
> >
> > 1% overall is a reasonable goal and that's similar to what we do in RISC-V land. It doesn't make sense to always insist on it though, if the candidate instruction improves *some* code by rather more than 1% but not average code, but it's very cheap to implement.
> >
> > I think for example the ANDC/BIC instruction being considered in the BitManipulation extension falls into this category. It won't be very common in general code (but a compiler can easily opportunistically use it) but it's very very cheap to implement as the encoding fits into an existing hole (which wouldn't be true if you wanted ORC and XORC/EQV as well), and there is already an inverter on the 2nd ALU input which simply needs to be triggered.
>
> It is sad that many designers and architects think that bit-manipulation
> is expensive. It only takes a 'greater than decoder' and an AND gate on
> top of a shift/rotater.

I'm not clear exactly what operation(s) you are referring to here. Bitfield insert/extact?

MitchAlsup

unread,
Jun 27, 2019, 9:26:55 PM6/27/19
to
They also include SLA, SLL, SRA, SRL {the difference between SRA and SRL is
whether overflow is detected and raised.)

So for a minor logic (gate+area) cost over a shifter that does SLL, SLA,
SRA, SRL is on the order of 10% of the size of the shifter in a 64-bit
data path.

SLx is simply a Insert (into a field of zero) operation.
SRx is simply an Extract operations.

Much depends on HOW you define the offset and size <driving> fields.

Ivan Godard

unread,
Jun 27, 2019, 9:59:17 PM6/27/19
to
On 6/27/2019 6:26 PM, MitchAlsup wrote:
rter on the 2nd ALU input which simply needs to be triggered.
>>>
>>> It is sad that many designers and architects think that bit-manipulation
>>> is expensive. It only takes a 'greater than decoder' and an AND gate on
>>> top of a shift/rotater.
>>
>> I'm not clear exactly what operation(s) you are referring to here. Bitfield insert/extact?
>
> They also include SLA, SLL, SRA, SRL {the difference between SRA and SRL is
> whether overflow is detected and raised.)
>
> So for a minor logic (gate+area) cost over a shifter that does SLL, SLA,
> SRA, SRL is on the order of 10% of the size of the shifter in a 64-bit
> data path.
>
> SLx is simply a Insert (into a field of zero) operation.
> SRx is simply an Extract operations.
>
> Much depends on HOW you define the offset and size <driving> fields.
>

The shifter itself is not a problem, but general field insert needs four
arguments, which can muck up encoding and datapaths. Extract only needs
three, which you can fit on the FMA datapath.

There's also an encoding problem for immediate forms, because there are
usually more bits in an operand than addressable registers (or belt
positions), so you can't use the regnum fields as bit numbers.

already...@yahoo.com

unread,
Jun 28, 2019, 6:47:50 AM6/28/19
to
Why not clang, like the rest of the world?

lkcl

unread,
Jun 28, 2019, 6:57:01 AM6/28/19
to
On Friday, June 28, 2019 at 11:47:50 AM UTC+1, already...@yahoo.com wrote:

> > I am looking for a GCC person to do a compiler port.

try mjc (michael clark). will email you privately, mitch.

> Why not clang, like the rest of the world?

the world runs the linux kernel: every IoT box with passwords
admin,admin, every smartphone, every tablet, every chromebook,
and the majority of the world's stable internet servers (barring
*BSD variants) - they all run the linux kernel. every IPTV
(particularly the copyright criminally-infringing ones from LG,
where they consider it a Corporate failure if you even *NOTICE*
that they're breaking the law).

that's literally billions of devices.

the linux kernel requires gcc, as does u-boot, coreboot, libreboot...

past ports of the linux kernel to run on llvm failed due to a std-C feature
*not being supported* by llvm: variable-length arrays. the last variable-length
array was only removed from the linux kernel a few months ago.

whilst this was the last major barrier, i have not checked if anyone has tried
compiling with llvm since.

so no: contrary to popular belief, the world does *not* revolve around llvm-clang.

l.

lkcl

unread,
Jun 28, 2019, 7:23:58 AM6/28/19
to
On Thursday, June 27, 2019 at 9:23:44 PM UTC+1, Bruce Hoult wrote:
> On Wednesday, June 26, 2019 at 11:52:06 PM UTC-7, Anton Ertl wrote:
> > Bruce Hoult <bruce...@gmail.com> writes:
> > >Software adopted an agile, small iterations, have something runnable/shipab=
> > >le (though maybe feature poor) at all times approach two decades ago.
> > >
> > >Why is hardware still done "waterfall"?
> >
> > Is it? I have read the claim that for HPPA (released 1986) no
> > instruction was included unless performance gain of IIRC 1% could be
> > demonstrated. And they used simulators for that. And I am sure they
> > were not the only ones, although I consider instruction count a flawed
> > cost metric (nonetheless, Patterson's recent papers put a lot of
> > weight on instruction counts).
>
> 1% overall is a reasonable goal and that's similar to what we
> do in RISC-V land.

this is making sense, and is good advice. i have been resisting
adding sv.setvl because it would be literally the sole and only
instruction added. however, its inclusion would provide an
opportunity to reduce IPC because every loop otherwise requires
both setting of the MAXVL CSR *and* the VL.

within the RVV setvl opcode there is more than enough space to
add a 3rd argument onto the end (MVL).
https://libre-riscv.org/simple_v_extension/specification/sv.setvl/

MAXVL is needed because it is not a behind-the-scenes parameter,
it has to be set in order to define the number of scalar (FP or INT)
registers to be typecast to a vector.

in RVV (and other traditional Vector Engines), vector registers are
that: actual vector registers, so the MAXVL becomes a hidden, hard-coded
part of the actual micro-architecture: the number of parallel lanes.



> Something else that should I think be considered is the incremental
> documentation, learning, and software tool burden of each new instruction.
> Although Intel and ARM clearly don't give a damn about this!

:) gotta get that product out the door, eh?

> I'm not talking about iterations on the scale of two or three years -- as an
> alternative to .. what? Starting a processor design in 1980 and shipping in 2020?
>
> I'm talking about iterations on the scale of two or three weeks, days or,
> with good automation, hours. (Iterations to booting Linux on an FPGA can be
> done that quickly and run maybe 1/20th the speed of the eventual SoC, which
> is enough to be very useful, especially as they are cycle-accurate)

bruce, this cycle is something that's been going on in the libre/open
hardware community for decades. it's becoming *more prevalent and well-known*,
that's what's so highly significant [again: we just don't *know* what the
secretive proprietary companies do]. that, and the reverse-engineering of
some of the lower-cost FPGAs such as the ICE40, ECP5, and a
part-reverse-engineering of the vertex-7 series Xilinx FPGAs, means that
the toolchains and affordability of the actual FPGAs makes hardware development
much more prevalent.

what the founders of RISC-V have done is to break the mind-set "lock" on
what was formerly the exclusive domain of a proprietary world. MIPS was
patented: ImgTec is universally hated so no-one would deal with them.
suddenly, everyone and their dog can go, "ha! i think i will make a stonkingly
good processor, today", where formerly they would go "naah, i got no hope,
i'm wasting my time". sadly the f-cpu.org effort from 20 years ago (which
i'd never heard of until last month) ignored this barrier and unfortunately
fizzled out.

i mean, for goodness sake, there was that chinese guy last month on hw-dev,
posted, "hello everybody i've just developed a superscalar multi-issue RV32
processor with an instructions-per-clock of around 3.5", i mean, just... *wow*

(link if anyone's interested, it's pretty cool what he's done, taking a standard
single-issue RV32 design and making it multi-issue:
https://groups.google.com/a/groups.riscv.org/d/msg/hw-dev/droxuIGRcPc/urq4Q6XWBQAJ)

it had *never occurred to me* that a single unit test using formal mathematical
proofs could be put to good use as substitutes for a convoluted mish-mash of
(incomplete) corner-case unit tests, guaranteeing 100% coverage, until i saw
a post by samuel falvo on hw-dev. *major* lightbulb epiphany as a software
engineer of 25 years!

*these* are the things that are hugely significant: yes, the faster iterative
development cycles, it's not just that, it's the whole community, and it's
why so many people view the secretive closed doors "ITU" style cartelling
initiated by the RISC-V Foundation as such a serious impediment *to the entire
community*. if you're on the "inside", you can't interact with the rest of
the world, not in a seriously meaningful and engaging way because (even if
you were _permitted_ to explain things to quotes outsiders quotes) *nobody
within the cartel wants to bother repeating heavily contextual technical
discussion twice*.

yes, lots of progress, which is fantastic to witness: yes, still an awful
lot to accept and learn from, as well.

l.

already...@yahoo.com

unread,
Jun 28, 2019, 7:38:46 AM6/28/19
to
1.
My copy of the world does not revolve around Linux kernel nearly that much as yours. Probably, matter of being slightly older (not as old as most c.a regulars, but still...) never living in California and never learning Computer Science in academia.

I would very much prefer something 100 times simpler than Linux as the first boot of the new ISA/compiler. Something like bare metal no-OS + small variant of newlib. Or, may be, something less bloated than even that. Esp. on simulator, but probably on FPGA too.


2. Getting rid off VLA in kernel is a good idea, regardless of compiler issues.

3.
My copy of clang does basic VLA staff just fine, out of the box and with no effort.
May be, non basic VLA, like multidimensionals are not supported, I wouldn't know because I don't know what they do. Besides, knowing Mr. Torvalds, I have very hard time believing that he would ever allow such stuff into trunk.

4.
In the past gcc had reputation for being actively hostile to attempts of 3rd parties to extract or insert intermediate representations. On the other hand, LLVM was very much build on the promise of stable intermediate representations.
I don't know where both of them stand now in that regard.
My guess is that they are converging, i.e. gcc interm. getting more stable and clang-LLVM getting less stable, but lines didn't cross yet.
Would like to know opinions of those, actively involve, like Bruce.


MitchAlsup

unread,
Jun 28, 2019, 12:15:59 PM6/28/19
to
On Thursday, June 27, 2019 at 8:59:17 PM UTC-5, Ivan Godard wrote:
> On 6/27/2019 6:26 PM, MitchAlsup wrote:
> rter on the 2nd ALU input which simply needs to be triggered.
> >>>
> >>> It is sad that many designers and architects think that bit-manipulation
> >>> is expensive. It only takes a 'greater than decoder' and an AND gate on
> >>> top of a shift/rotater.
> >>
> >> I'm not clear exactly what operation(s) you are referring to here. Bitfield insert/extact?
> >
> > They also include SLA, SLL, SRA, SRL {the difference between SRA and SRL is
> > whether overflow is detected and raised.)
> >
> > So for a minor logic (gate+area) cost over a shifter that does SLL, SLA,
> > SRA, SRL is on the order of 10% of the size of the shifter in a 64-bit
> > data path.
> >
> > SLx is simply a Insert (into a field of zero) operation.
> > SRx is simply an Extract operations.
> >
> > Much depends on HOW you define the offset and size <driving> fields.
> >
>
> The shifter itself is not a problem, but general field insert needs four
> arguments, which can muck up encoding and datapaths. Extract only needs
> three, which you can fit on the FMA datapath.

THe total number of bits needed is 12 in the last 2 arguments. I use
the lower 32-bits for offset and the top 32-bits as length; length=0
implies length = as much of the operand as fits in the result. I also
have immediate versions perfectly fitting C-bit-fields with that 12-bit
immediate.

Presto it fits on a 3-operand data path, and when the top 32-bits are
zero (0) they perform shift instructions.
>
> There's also an encoding problem for immediate forms, because there are
> usually more bits in an operand than addressable registers (or belt
> positions), so you can't use the regnum fields as bit numbers.

And I did not!

MitchAlsup

unread,
Jun 28, 2019, 12:19:09 PM6/28/19
to
What I want is after the C-port is done that a variety of languages
are available. If Clang can do that its fine by me.
Also note: My ISA is very close to Mc 88K which had a port in GCC,
so that might shorten the port burden.
I also need a Linux kernel build to work.

Hint: whatever gets the job done.

lkcl

unread,
Jun 28, 2019, 2:32:08 PM6/28/19
to
On Saturday, June 29, 2019 at 12:19:09 AM UTC+8, MitchAlsup wrote:

> What I want is after the C-port is done that a variety of languages
> are available. If Clang can do that its fine by me.

gcc is like about 20% the way up the critical software dependency stack, where LLVM is more critical at the user level.

gcc will get you critical services written in c, c++ and so on, where LLVM will get you rust, golang, and other upcoming stuff.

> Also note: My ISA is very close to Mc 88K which had a port in GCC,
> so that might shorten the port burden.

it would. an emulator is also going to be needed, high performance not being critical initially, nor JIT optimisation, but cycle accuracy and a way to load an initramdisk and boot a linux kernel to operate from emulated RAM and interact at the console level is essential.

Whilst qemu is usually what is recommended here, and will *evrntually* be needed, I very much prefer the simplicity of the approach used with spike.

Before even running linux:

In combination with some libraries that provide absolute minimum POSIX subset capability, and something called "SBI", very basic bare metal and then semi-POSIX compliant programs can be cross compiled with the fledgling toolchain and they can at least verify each other.

Before even running bare metal programs:

A binutils port is needed, pure assembly bare metal programs can be written, that can be put into the simulator.

Also the same programs can be thrown at the hardware on an FPGA to verify them.

> I also need a Linux kernel build to work.

As above shows there are a hell of a lot of bootstrap phases to get to that point. Estimated... 2 to 3 manyears, less if shortcuts are taken by riding off the back of BBL, spike, OpenSBI and riscv-fesvr.

You also need to compile up an absolute bare minimum suite of applications to run *in* that initialramdisk and the simplest option there is probably buildroot.

buildroot was written by Not Invented Here specialists who did not believe the the complexities of cross compiling were as complicated as they are by necessity, and they soon found out. Ho hum. Anyway it does the job.

However before *that* is possible you will need a port of gnu libc6, which has one major critical component, how to do POSIX threads. Also various intrinsics such as atomic locks and mutexes, these are usually very specialist to the architecture.

The very very basic minimalist POSIX compliance earlier will partially help.
Also there will almost certainly be a libc6 88000 port already, if there is a gcc port. This will save time too.

> Hint: whatever gets the job done.

LLVM will certainly also be needed, later on, because as you can see from what already5 wrote, a large fraction of userspace apps now critically depend on llvm.

Do not underestimate the software porting side, Mitch, it is really quite enormous. Shortcuts can be taken by selecting a previously-followed bootstrap path, and I would recommend following the one taken to bootstrap riscv. It is quirky (BBL is particularly weird and nonstandard) however using the code directly with minimum necessary modifications will cut the bootstrap process drastically.

L.

lkcl

unread,
Jun 28, 2019, 2:58:50 PM6/28/19
to
On Friday, June 28, 2019 at 7:38:46 PM UTC+8, already...@yahoo.com wrote:

> > l.
>
>
> 1.
> My copy of the world does not revolve around Linux kernel nearly that much as yours.

Userspave. I understand. I did reverse engineering of around... 12 devices, mostly smartphones, one laptop, and have done board bringup on half a dozen other embedded hardware systems, it gives you a really REALLY good grasp of how hardware hangs together.

I tell the story of how the HTC Universal micro laptop (clamshell smartphone aka the O2 XDA II) works, and it puts people used to the x86 world into shock with total disbelief. They have *no idea* how complex such small devices really are.

> Probably, matter of being slightly older (not as old as most c.a regulars, but still...) never living in California and never learning Computer Science in academia.

You have not been brainwashed, then :)

> I would very much prefer something 100 times simpler than Linux as the first boot of the new ISA/compiler.

Absolutely, this is not just nice to have, it is absolutely essrntial. I just outlined this process in another post.

The riscv-tests are written in pure assembler, and they can be run on a console-less FPGA if you provide your own "test passed", "test succeeded" and "print char to something, for debug purposes" routine.

Usually, and this is what picorv32 did, that debug routine is as simple as writing to a memory location, which on an AXI4 or wishbone address goes out to a UART or even a blinky light with an optical reader staring at it, who cares as long as it gives feedback.

The tests then quite literally test every single opcode. Of course, many of those tests critically rely on those very same opcodes working first before... you get the point, however the unit tests are there to cover the corner cases.

Add of 0xffffffff to 0x1 does it give zero etc etc.

When running under spike, the trap that outputs a debug character goes through that POSIX minimal compliance layer i mentioned, that in turn at the simulated level ends up getting redirected to stdout, and, ta-daaa, the test can print "OK" which is enough.

Other people may know the details here better than i.

That these tests even exist, alongside spike, has allowed me to modify spike to add the SV Vectorisation. It only took 10 weeks appx to get the first SV Draft spec implemented, including adding 25 or so new unit tests (assembler) which allowed me to get the spec details right, in the iterative fashion that Bruce mentioned.

SV was very very specifically and deliberately designed to NOT require any new opcodes.

This meant that when it came to writing the VECTOR unit tests, I could literally write them in SCALAR assembly.

There has therefore been absolutely no need to develop a "vector" version of binutils (which contains gas, the assembly to binary compiler), because there are no instructions ti add to it.

A new ISA would not be able to deploy this trick, however for Mitch, the fact that his work is so similar to the 88000 will save a huge amount of porting effort.

L.

lkcl

unread,
Jun 28, 2019, 4:08:32 PM6/28/19
to
On Saturday, June 29, 2019 at 2:58:50 AM UTC+8, lkcl wrote:


> Usually, and this is what picorv32 did, that debug routine is as simple as writing to a memory location, which on an AXI4 or wishbone address goes out to a UART or even a blinky light with an optical reader staring at it, who cares as long as it gives feedback.

Forgot to mention, another trick which helps with the bootsrap process, again saving time and decreasing the early iterative loop time:

in verilog, that same address, if written to, results in a verilog console print statement.

When compiled to a c program with iverilog, the unit test infrastructure just uses the c iverilog executable to run a test.

Running the entire test suite therefore takes seconds.

Whereas an FPGA compile and upload could run to several minutes as the FPGA has to be reset and the new unit test uploaded each and every time.

ONLY when in the iverilog simulation you know that all unit tests pass do you move on to the next slowest next least expennsive iterative loop: uploading to the FPGA.

As Bruce outlines, saving this kind of time at any stage is essential. The faster the iterative loop the faster mistakes can be caught and fixed.

L.


Ivan Godard

unread,
Jun 28, 2019, 5:05:03 PM6/28/19
to
You prove my point: the op needs non-standard ALU encoding even though
it fits in the ALU FU. Here you have appended an immediate word to carry
the extra manifest arguments, which an add(say) doesn't encode.

More importantly, you are omitting the general op, which needs four
register arguments (5 in IBM-style) and no immediates. Life is easy if
you can omit the annoying cases :-) Yes, you can parse a 32-bit
immediate as additional regnums, but that would be two-level parse and I
think a problem even on you ISA. Then there's the extra datapaths.

Most designs, who want to duck the problem, simply assert that nobody
uses dynamic insert/extract so do it in software. To them I say: show me
your code that parses a dynamic bit-stream input.

MitchAlsup

unread,
Jun 28, 2019, 6:44:25 PM6/28/19
to
to_container<width,offset> = from_value;

Where is the 5th operand?

Also note: most dynamic uses are arrays of bit fields so on can do::

width += R7
Offset += R7

With loop induction as::

MOV Rad,<0,4> // <w,o> is built into asm
MOV Rwo,<4,0> // and builds appropriate constants.
loop:
INS R10,R10,Rw0

ADD Rwo,Rwo,Rad

Although one might want to roll this into a TRY/CATCH block to detect
offset overflow and fetch the next containers.

> you can omit the annoying cases :-) Yes, you can parse a 32-bit
> immediate as additional regnums, but that would be two-level parse and I
> think a problem even on you ISA. Then there's the extra datapaths.
>
> Most designs, who want to duck the problem, simply assert that nobody
> uses dynamic insert/extract so do it in software. To them I say: show me
> your code that parses a dynamic bit-stream input.

a) what I have on the immediate side is perfect for C-family languages.
b) what I have on the dynamic side requires only 1 additional instruction;
for the random bit field in order to bind both operands into a single
operand.
c) I still keep the ability to detect "overflow" and "operand out of range"
with my construct.
It is loading more messages.
0 new messages