SV / RVV, marking a register as VL.

lkcl

unread,

Aug 27, 2019, 3:45:12 PM8/27/19

to RISC-V ISA Dev, Libre-RISCV General Development

https://libre-riscv.org/simple_v_extension/appendix/#strncpy
https://libre-riscv.org/simple_v_extension/specification/sv.setvl/#questions

Something that has been bugging me about RVV and SV for some time: the fact that arithmetic on VL requires additional instructions to perform a transfer between VL and the registers used to carry out the necessary arithmetic.

If CSRs were treated orthogonally as actual scalar registers in RISCV, the problem would be moot.

This particularly hits home on use of fail-on-first.

The above pseudocode for strcpy shows it well: a CSR load is required in RVV in order to get at the modifications to VL that the failfirst actioned.

In SV it is even more pronounced an effect, due to a need to increment VL by one, after the fail-first, which of course requires first transferring VL to a scalar reg, then performing the arithmetic, then getting the value *back* into VL.

I have been thinking of a solution here which I did not want to share until I was reasonably sure it would be easily implementable in hardware.

The solution is, instead of having a CSR that contains the current VL value, instead have the CSR points *to* the scalar register that contains and will indefinitely continue to contsin the current VL value.

This would have the advantage that, once "linked", fail-on-first would automatically result in *direct* modification of that scalar (standard x1-x31 integer Regfile) register.

In the pseudocode above that would save 1 instruction in the inner loop in the RVV case: a reduction of around 8%.

In the SV case it would save *three* instructions in what is currently a 14 instruction loop: a significant saving (even when they're all RVC opcodes).

The hardware challenges are that these are implicit (indirect) modifications to a scalar regfile. Given that VL already has to be modified (in the current revision of RVV) conceptually it is not challenging, it's just that instead of modifying the CSR store, the integer regfile store is to be written.

For an OoO design, which was my primary concern, this makes every vector instruction require one additional read and write register hazard.

For context: some implementations may not have chosen to make VL a read / write dependency hazard, choosing instead to "stall" instruction issue whilst waiting for outstanding vector operations to complete: for such implementations the previous paragraph makes no sense and does not apply.

If on the other hand an OoO engine *has* had support for read and write hazard dependency tracking on VL added (in order to avoid stalling when VL is modified), then changing that to be the scalar register (to which this proposal modification to SETVL points) is not so much of a problem, and might even simplify the microarchitecture.

Inorder systems I am not so concerned about the hardware implications: stall is the go-to "solution" and this situation is no different.

As this is quite a radical design change I have been reluctant to come forward with it, had to think about it for several months.

Feedback appreciated, will hold off going ahead with this on SV for a whole longer.

Constructive feedback on its value in RVV also welcomed as it will save on instruction count in tight loops in RVV, as well.

L.

Bruce Hoult

unread,

Aug 28, 2019, 6:00:12 PM8/28/19

to lkcl, RISC-V ISA Dev, Libre-RISCV General Development

Have a register (or CSR) contain some sort of pointer to another
register? Just: no way. Micro-architectural nightmare.

The scalar instructions in, for example, this strncpy loop do not take
significant time. In a real version of the code they would be
interleaved with vector instructions rather than all at the end, and
would on almost all machines be completed long before the preceding
vector instruction is. In particular the move from the VL CSR would
happen soon after the vlbff.v and the increments to the pointers soon
after that.

Maybe something like:

strncpy:
mv a3, a0 # Copy dst
loop:
setvli x0, a2, vint8 # Vectors of bytes.
vlbff.v v1, (a1) # Get src bytes
vseq.vi v0, v1, 0 # Flag zero bytes
csrr t1, vl # Get number of bytes fetched
vmfirst a4, v0 # Zero found?
add a1, a1, t1 # Bump src pointer
vmsif.v v0, v0 # Set mask up to and including zero byte.
sub a2, a2, t1 # Decrement count.
vsb.v v1, (a3), v0.t # Write out bytes
add a3, a3, t1 # Bump dst pointer
bgez a4, exit # Done
bnez a2, loop # Anymore?

exit:
ret

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fd02c4a4-0e9a-41d1-b7e4-f8356ae078c4%40groups.riscv.org.

Rogier Brussee

unread,

Aug 29, 2019, 3:26:30 AM8/29/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

First of all see Bruce Hoult's remark: the whole issue may be moot and yet another layer of redirection seems meh.

ideas:

* I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions complementing the "bitwise" CSR instructions. Problem is of course where to put that because the CSR number is big. There seems to be room in the CSR/func3 == 0b100 minor opcode for an immediate version, but the privileged spec seems to be a heavy user of the CSR/func==000 however (albeit all with rd = x0), which makes it a bit awkward to also have a CSRRA instruction :-(.

*As above, but just have an R-type instruction that only add's to the VL CSR.

*If you could mmap the CSR file, you could use the AMO-ops to manipulate them, in particular use add and subtract (and max and min!).

*Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions, allowing to specify the VL register in the 64(?) bit wide "allow to specify everything" version of your instructions. This, of course, requires specifying you are in vector mode in other ways then VL != 1 if you want to use implicit vectorisation.

Ciao,

Rogier

Op dinsdag 27 augustus 2019 21:45:12 UTC+2 schreef lkcl:

lkcl

unread,

Aug 29, 2019, 3:53:53 AM8/29/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Wednesday, August 28, 2019 at 11:00:12 PM UTC+1, Bruce Hoult wrote:

Have a register (or CSR) contain some sort of pointer to another
register? Just: no way. Micro-architectural nightmare.

that's what i thought, initially: it's why i paused for a long time before raising the idea. then it occurred to me that

(a) there's only one of them (VL is global) so the contents may be cached and

(b) in an efficient OoO design the CSRs *are* a register file which requires dependency-management anyway and

(c) the implications of the CSR-register-containing-a-pointer is just another dependency hazard

in addition, both predication and MV.X (regs[rd] = regs[regs[rs1]]) require pretty much exactly the same microarchitectural dependency hardware to be in place. in the case of "CSR-register-is-a-pointer", the actual vector length is obtained via "regs[CSRregs[VL]]" which is near-identical to MV.X

[MV.X is the scalar equivalent of the vector-indexed move operation]

so a good vector engine will already *have* the required concepts / hardware in place and/or have to solve near-identical microarchitectural design issues anyway.

in-order sytems, the one-stop-shop solution to everything of course is "stall, stall, stall"... :)

The scalar instructions in, for example, this strncpy loop do not take
significant time. In a real version of the code they would be
interleaved with vector instructions rather than all at the end,

that's *if* the vector engine is a separate one from the scalar engine. some embedded low-cost solutions may not have a separate ALU, for example in embedded 3D. you'll meet some of the people for whom such a microarchitectural design decision will be critical, tomorrow.

that having been said...

and
would on almost all machines be completed long before the preceding
vector instruction is. In particular the move from the VL CSR would
happen soon after the vlbff.v and the increments to the pointers soon
after that.

...ok, great: so in an in-order system, clashes (dependency hazards) would be long gone by the time the CSR-pointing-to-the-register had been established.

as long as the code had been arranged so that the VL CSR pointer-setup was well in advance of its use.

Maybe something like:

i like this example. it's really elegant.

strncpy: 
 mv a3, a0 # Copy dst 
loop: 
 setvli x0, a2, vint8 # Vectors of bytes.

ok so here there's a dependency: VL has a read dependency on a2. x0 is not written to, so there's no write dependency created.

vlbff.v v1, (a1) # Get src bytes

this instruction has both a read *and* write dependency. v1 has a read dependency on VL, and because VL is written to it creates a second ongoing write-dependency.

vseq.vi v0, v1, 0 # Flag zero bytes

... which occurs here. so here, VL's (new value) creates a read dependency on both v0 and v1.

csrr t1, vl # Get number of bytes fetched

here, VL's new value from vlbff creates a read dependency on the scalar register, t1. so there's one potential cycle's "grace" in an in-order system where stall would not occur. as these are are not complex operations i'd be really surprised if significant latency was required, no matter what the microarchitecture.

the rest of the assembly code is straightforward apart from the modification to a2 and looping back to where a2 is used...

vmfirst a4, v0 # Zero found? 
 add a1, a1, t1 # Bump src pointer 
 vmsif.v v0, v0 # Set mask up to and including zero byte. 
 sub a2, a2, t1 # Decrement count. 
 vsb.v v1, (a3), v0.t # Write out bytes 
 add a3, a3, t1 # Bump dst pointer 
 bgez a4, exit # Done 
 bnez a2, loop # Anymore?

... here - and it was set up (written to) over 5 instructions ago as far as the entrance to the next loop iteration is concerned. that's still a write-dependency, however, which in a seriously-fast out-of-order design may result in tripping the dependency hardware.

so, let's go over it again, this time with the hypothetical VL-points-to-a-scalar-reg augmentation.

strncpy: 
 mv a3, a0 # Copy dst 
loop:

 
 setvli t1, a2, vint8 # Vectors of bytes.

note that t1 is now the target. this says - hypothetically - that t1 *is* VL.

so here there's a dependency: VL has a read dependency on a2. *t1* has a write dependency created on whatever is going to use it in the near future

vlbff.v v1, (a1) # Get src bytes

this instruction has both read *and* write dependencies. v1 has a read dependency not on VL, but on *t1*, and because *t1* is written to it creates a second ongoing write-dependency... *on t1*.

vseq.vi v0, v1, 0 # Flag zero bytes

... which occurs here. so here, t1's (new value) creates a read dependency on both v0 and v1.

# NO LONGER NEEDED csrr t1, vl # Get number of bytes fetched

t1 has *already* been set up with the required value [this is the (one) instruction in the loop that is saved, reducing the loop count in RVV by around... 8% or so].

again: the rest of the assembly code is straightforward apart from the modification to a2 and looping back to where a2 is used...

vmfirst a4, v0 # Zero found? 
 add a1, a1, t1 # Bump src pointer 
 vmsif.v v0, v0 # Set mask up to and including zero byte. 
 sub a2, a2, t1 # Decrement count. 
 vsb.v v1, (a3), v0.t # Write out bytes 
 add a3, a3, t1 # Bump dst pointer 
 bgez a4, exit # Done 
 bnez a2, loop # Anymore?

again, t1 is all "read" here (not written to) so again, the only concern is that a2 had been written to 5 instructions up, which, on the loop (on very fast systems) that will create a write hazard back at the setvli, just as with the current revision of RVV.

so, honestly i'm not seeing anything unsurmountable, here. if i haven't missed anything, my feeling is that a good dependency-tracking system will have the necessary hardware in place, and an in-order system is going to be using stall, stall, stall anyway.

does that look reasonable?

l.

lkcl

unread,

Aug 29, 2019, 4:23:16 AM8/29/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Thursday, August 29, 2019 at 8:26:30 AM UTC+1, Rogier Brussee wrote:

First of all see Bruce Hoult's remark: the whole issue may be moot and yet another layer of redirection seems meh.

shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.

if the design principles of RISC and RISC-V are to be respected and followed, small reductions in code size are significant, and big reductions even more so.

ideas:
* I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions complementing the "bitwise" CSR instructions. Problem is of course where to put that because the CSR number is big. There seems to be room in the CSR/func3 == 0b100 minor opcode for an immediate version, but the privileged spec seems to be a heavy user of the CSR/func==000 however (albeit all with rd = x0), which makes it a bit awkward to also have a CSRRA instruction :-(.

"here be dragons"... if you have one CSR being allowed this kind of special treatment (arithmetic) pretty soon there will be calls for yet more arithmetic operations. at that point the ISA has a duplication of the *entire* suite of arithmetic operators.

CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.

VL *completely* breaks that rule, right from the SETVLI implementation (VL=MIN(rs1, MVL)), and fail-on-first even more so. fail-on-first not only has a read-dependency on the VL CSR, it has a *write* dependency as well.

this is the core of the argument for special-case treatment of VL (and making it an actual scalar register): as a CSR its use goes well beyond that for which CSRs were originally designed.

whereas... if SETVLI is modified to set up a *pointer* to a scalar register, *now* the VL CSR is more along the lines of how CSRs were intended to be used. set them up once to change the behaviour (and leave them alone), do some tightly-dependent arithmetic work, then reset them.

*As above, but just have an R-type instruction that only add's to the VL CSR.

again, i'd be concerned at the special treatment. once you want ADD, someone else will want MUL. and DIV. and... etc. etc.

*If you could mmap the CSR file, you could use the AMO-ops to manipulate them, in particular use add and subtract (and max and min!).

iinteresting. i've mulled over the idea of mapping the CSR regfile SRAM into the actual global memoryspace before. the architectural implications (and power consumption due to the load on the L1 cache) had me sliiightly concerned.

mind you, for 3D, we need separate pixel buffer memory areas and so on so it's a problem that has to be solved.

worth thinking through, some more, i feel.

*Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions, allowing to specify the VL register in the 64(?) bit wide "allow to specify everything" version of your instructions. This, of course, requires specifying you are in vector mode in other ways then VL != 1 if you want to use implicit vectorisation.

i kinda like it, however mentally i am rebelling at the lack of orthogonality. allocating one register to VL means it's effectively removed from use in all other circumstances...

... and if one register is allocated, you still have to have the dependency-tracking on that (one) scalar register, and if you have dependency-tracking on one scalar register (as a "hidden" VL) you might as well go the whole hog and go orthogonal.

that said: from what i saw of the statistical analysis of register-usage by gcc that WD did, many of the registers x1-x31 have near-zero percentage utilisation, so something at the high end of the regfile numbering probably wouldn't be missed.

however if you do that (x31 for example), use of RVC instructions is out of the question. and if you _do_ allocate one of the registers accessible by RVC (x8-15) you just took out a whopping 12.5% of the available registers for use by RVC.

with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?

l.

lkcl

unread,

Aug 29, 2019, 4:55:58 AM8/29/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Thursday, August 29, 2019 at 8:53:53 AM UTC+1, lkcl wrote:

ok so here there's a dependency: VL has a read dependency on a2. x0 is not written to, so there's no write dependency created.

so here's the thing: (thanks to rogier for going over some of the alternatives, it allows clear highlighting by comparison of the benefits of this idea)

as long as VL is not read via a CSRR, it's actually completely out of the picture as far as dependency-tracking is concerned. where in the current version of RVV it is the "exception" - the problematic CSR that requires close arithmetic tie-in, when it is made into a pointer-to-a-scalar-that-is-used-instead, it becomes just like every other CSR is supposed to be.

at that point, some interesting microarchitectural design-decisions (and compromises) come back into play:

(1) where before, due to its close tie-in, VL could in no way be left out of full first-order dependency-tracking, its change in role now allows it to be considered a "second-rate" CSR, where accessing it results in "stalls", thus greatly simplifying the architecture and reducing gate count.

(2) as it is the only one, VL may be hardware-cached, i.e. the fact that it points to a scalar register, well, that's only 5 bits: that's not very much to pass round and through pipelines.

(3) if it's not very much to pass around, then the possibility exists to *rewrite* a CSRR VL instruction to become a MV operation, *at execution time*!

yes, really: at instruction *decode* time, with there being only the 5 bits to check "if VL-cache-register is non-zero and CSR register == VL", it's really not that much extra logic to *directly* substitute the CSRR instruction with "MV rd, VL-where-VL-is-actually-the-contents-of-the-VL-cache"

that would then allow the substituted-instruction to go directly into dependency-tracking *on the scalar register*, nipping in the bud the need for special CSR-related dependency logic, and no longer requiring the sub-par "stall" solution, either.

overall i think it's not so much a nightmare: it's just different (as all new ideas are), and, ultimately, it's solvable.

l.

lkcl

unread,

Aug 30, 2019, 6:25:51 AM8/30/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

Nowait... hang on... it's dead simple. There is no need for that instruction-decode-rewrite trick, because it's not the *contents* of the register pointed to by VL that is needed to be returned, it's the *pointer* that is needed to be returned by CSRR rd, VL.

[getting at the *contents* - well, um... it's already *in* the scalar register!]

That has no effect on the scalar regfile read dependencies at all (just a write dependency on rd).

That in turn means no stalls of the ALU(s) on CSRR rd, VL.

So the only thing that an implementation need do in order to have a little extra speed is to cache the contents of the-register-pointed-to-by-VL in latches, and for that register to be managed by absolutely standard register dependency hazard techniques, and for the latch to be simultaneously written whenever the pointed-to register is also updated in the register file.

That really is not very challenging, microarchitecturally, at all.

I think we will go ahead with this change to SV, given the benefits it brings by effectively allowing the full range of RISC-V arithmetic operations to be applied directly to VL, in addition to the savings on instruction count.

L.

Rogier Brussee

unread,

Aug 30, 2019, 12:55:25 PM8/30/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

Op donderdag 29 augustus 2019 10:23:16 UTC+2 schreef lkcl:

On Thursday, August 29, 2019 at 8:26:30 AM UTC+1, Rogier Brussee wrote:
First of all see Bruce Hoult's remark: the whole issue may be moot and yet another layer of redirection seems meh.

shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.

For this example.

if the design principles of RISC and RISC-V are to be respected and followed, small reductions in code size are significant, and big reductions even more so.

see may. It is a matter of weighing cost and benefit.

ideas:
* I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions complementing the "bitwise" CSR instructions. Problem is of course where to put that because the CSR number is big. There seems to be room in the CSR/func3 == 0b100 minor opcode for an immediate version, but the privileged spec seems to be a heavy user of the CSR/func==000 however (albeit all with rd = x0), which makes it a bit awkward to also have a CSRRA instruction :-(.

"here be dragons"... if you have one CSR being allowed this kind of special treatment (arithmetic) pretty soon there will be calls for yet more arithmetic operations. at that point the ISA has a duplication of the *entire* suite of arithmetic operators.

CSRRA would be allowed on all CSR's, just as CSRRS and CSRRC are allowed on all CSR registers, they would just not necessarily be useful. I assume here that adds are more useful than anything else except what is available now. Also if you go the road of trying to squeeze in the CSR/ privileged opcode you have no room left for anything else :-(.

In the slightly crazy idea category: I seem to recall that what CSRRS and CSRRC do exactly depends on the CSR and they are allowed to do "strange things" like nothing at all or setting bits that are implied by others. After all the semantics of a CSR instruction strongly depends on what CSR is. So you may be allowed (standard wise, but more importantly, as far as tool chains are concerned) to think of CSRRS and CSRRC as just two (atomic) operations that you can fill in to your hearts desire, and if for your VL CSR you think it is more useful for CSRRS and CSRRC to mean (atomic) ADD and SUB rather than (atomic) OR and ANDN, more power to you.

(What seems standard WARL usage of a CSR is

CSRRCI rd MVL -1 # rd = platform dependent max vector length, MVL = 0: to indicate vectorisation is _on_

CSRRSI zero MVL -1 # MVL = platform dependent max vector length.

)

CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.

Right. But what is the fundamental difference between atomically set/clear a bit and atomically adding and subtracting.

VL *completely* breaks that rule, right from the SETVLI implementation (VL=MIN(rs1, MVL)), and fail-on-first even more so. fail-on-first not only has a read-dependency on the VL CSR, it has a *write* dependency as well.

this is the core of the argument for special-case treatment of VL (and making it an actual scalar register): as a CSR its use goes well beyond that for which CSRs were originally designed.

whereas... if SETVLI is modified to set up a *pointer* to a scalar register, *now* the VL CSR is more along the lines of how CSRs were intended to be used. set them up once to change the behaviour (and leave them alone), do some tightly-dependent arithmetic work, then reset them.

*As above, but just have an R-type instruction that only add's to the VL CSR.

again, i'd be concerned at the special treatment. once you want ADD, someone else will want MUL. and DIV. and... etc. etc.

*If you could mmap the CSR file, you could use the AMO-ops to manipulate them, in particular use add and subtract (and max and min!).

iinteresting. i've mulled over the idea of mapping the CSR regfile SRAM into the actual global memoryspace before. the architectural implications (and power consumption due to the load on the L1 cache) had me sliiightly concerned.

mind you, for 3D, we need separate pixel buffer memory areas and so on so it's a problem that has to be solved.

worth thinking through, some more, i feel.

*Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions, allowing to specify the VL register in the 64(?) bit wide "allow to specify everything" version of your instructions. This, of course, requires specifying you are in vector mode in other ways then VL != 1 if you want to use implicit vectorisation.

i kinda like it, however mentally i am rebelling at the lack of orthogonality. allocating one register to VL means it's effectively removed from use in all other circumstances...

It sort of depends on how you look at it. You are perfectly free to use say t1 as vl register for any other purpose as long as you don't do SV stuff, just like you are perfectly free to use ra for any other purpose, but of course if you C.JAL it gets implicitly trashed, and you have to save and restore the value on the stack if you call a function. In particular there is no compatibility issue.

... and if one register is allocated, you still have to have the dependency-tracking on that (one) scalar register, and if you have dependency-tracking on one scalar register (as a "hidden" VL) you might as well go the whole hog and go orthogonal.

I don't quite see what you mean here with dependency tracking but you seem to have already decided that there is no dependency. My mental model is that an instruction " expands" to an instruction with an explicit vl register input which just happens to have a conventional number ((just like sp in C.LDSP).

that said: from what i saw of the statistical analysis of register-usage by gcc that WD did, many of the registers x1-x31 have near-zero percentage utilisation, so something at the high end of the regfile numbering probably wouldn't be missed.

I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl, whereas a scalar use of t1 can be replaced with a temporary in the x16-x31 (e.g. t3) without any problem. This is why t1 or t2 which are just as hard or easy to use as x31 is still better than x31.

however if you do that (x31 for example), use of RVC instructions is out of the question. and if you _do_ allocate one of the registers accessible by RVC (x8-15) you just took out a whopping 12.5% of the available registers for use by RVC.

It would seem the important instructions to use with vl are C.ADDI, C.LI, C.MV, B.MAX and perhaps occasionally C.LDSP, C.SDSP, and C.ADD and SUB. Each of the RVC instructions use a 5 bit register number.

with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?

No because even if redirection is free, once the vector length is in a register and not in a CSR, I don't really see what it buys you to be able to set _different_ registers as the vl register, (even if you specify the vector length in a parameter of a function just doing mv t1 a1 is shorter). I think such a CSR would (almost?) always point to a conventional register like t1. Also, it is a CSR worth of state that has to be saved on context switches. Being able to making the vl register dependency explicit by making it explicitly part of the long "specify everything version" of your instructions seems "the right thing" though.

This cannot be uncoupled from a calling convention.

Anyway I was just giving you food for thought which it seems to have done :-).

Rogier

l.

lkcl

unread,

Aug 30, 2019, 2:25:05 PM8/30/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Saturday, August 31, 2019 at 12:55:25 AM UTC+8, Rogier Brussee wrote:

> shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.
>
>
> For this example.
>

Yes. Which is "the" canonical example of sequential data-dependent fail-on-first parallel processing.

In SV we're down to 8 RVC opcodes within the loop and 2 on the exceptional exit path. Use of VBLOCK which contains the SETVL and the Vectorisation Context requires a 64 bit overhead.

[Yes, SV can vectorise RVC opcodes]

>
>
> CSRRA would be allowed on all CSR's, just as CSRRS and CSRRC are allowed on all CSR registers, they would just not necessarily be useful. I assume here that adds are more useful than anything else except what is available now. Also if you go the road of trying to squeeze in the CSR/ privileged opcode you have no room left for anything else :-(.

Yep. Sigh.

> CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.
>
>
>
>
> Right. But what is the fundamental difference between atomically set/clear a bit and atomically adding and subtracting.

That's slightly missing the point: the point is that the scalar registers are what you're supposed to do arithmetic on, and CSRs are what are supposed to change the behaviour of the engine, run a bunch of arithmetic ops, then switch it off again.

The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.

The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.

Vector data-dependent fail-on-first completely throws that rule out the window.

*in the middle of the loop*, VL is modified by the vbff opcode, giving it read modify write dependency requirements, and tying in the ALUs closely with instruction issue *and* CSR management, in the process.

Where before (without vbff) those dependencies on the VL CSR were read only and thus complexity _was_ greatly reduced but isn't any more.

Clearly This Is Bad :) or, could be better.

Given that data paths and dependency tracking already exist between Vector and scalar instructions, making VL "be" a scalar register quite dramatically simplifies the microarchitectural implementation rather than complicates it.

> *Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions,

> It sort of depends on how you look at it. You are perfectly free to use say t1 as vl register for any other purpose as long as you don't do SV stuff,

Ah. We've switched contexts. I was trying to keep this mostly to RVV. However the idea applies to both.

Ok for SV, as it uses standard scalar opcodes, there is a lot more pressure on the scalar register number allocation. Even though the regfiles have been extended to a whopping 128 each, they are only accessible by specifying "context" that modifies the 5 bit field in *scalar* opcodes (or 3 bits in RVC).

Thus if one of those is allocated hardwired to VL it has much more serious consequences for SV than for RVV.

> > ... and if one register is allocated, you still have to have the dependency-tracking on that (one) scalar register,

> I don't quite see what you mean here with dependency tracking

Read and write hazards on registers.

> but you seem to have already decided that there is no dependency.

Ah if you are referring to my last message, I concluded that just by reading the *pointer* of course does not affect the *data* (in this case a scalar reg),

Which now sounds obvious :)

The point was then not that the *data* deoendencies are gone (clearly if the reg-that-VL-points to is modified, you just changed VL so there is definitely a hazard there), the point was that there are no data dependencies by use of a *CSRR* instruction, which is subtly different.

> My mental model is that an instruction " expands" to an instruction with an explicit vl register input which just happens to have a conventional number ((just like sp in C.LDSP).

Yehh the case for C.LDSP was I feel slightly different. Hilariously (and OT), SV can actually redirect x2 (and override the element width), making C.LDSP useful for purposes beyond its original intent.

>
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,

Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.

The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.

Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.

> whereas a scalar use of t1 can be replaced with a temporary in the x16-x31 (e.g. t3) without any problem.

See above. SV is under a lot more register allocation pressure.

>
>
>
> It would seem the important instructions to use with vl are C.ADDI, C.LI, C.MV, B.MAX and perhaps occasionally C.LDSP, C.SDSP, and C.ADD and SUB. Each of the RVC instructions use a 5 bit register number.

It's not the VL-related-arithmetic ops that worry me, it's that because SV uses *all* scalar opcodes and contextually marks them as "vectorised", the *vector* operations are under pressure if one of the regs is hard coded to VL.

It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.

It is just a route I do not want to go down.

>
>
> with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
>
>
>
>
> No because even if redirection is free, once the vector length is in a register and not in a CSR, I don't really see what it buys you to be able to set _different_ registers as the vl register,

I believe I explained that, above, it is down to the unique design properties of SV, the scalar regfile *is* the Vector regfile.

By overloading multi-issue execution semantics, SETVL tells the instruction issue phase how many sequentially numbered contiguous *scalar* operations are to be thrown at the *scalar* ALUs.

Making those actually SIMD ALUs is a microarchitectural optional optimisation.

Even having any parallelism *at all* is an optional performance optimisation.

3D is a real pain, because the core ops involve 4x FP32 or 3x, and the pressure on a standard 32 entry regfile is ridiculous. The power consumption penalty of extra LD/ST is just completely unacceptable, so large regfiles are critical.

Normal 3D ISAs go with special 64 bit instructions. This instead creates pressure on executable size and the I-cache in a hybrid CPU / GPU.

A proprietary GPU wouldn't give a monkey's, the I cache and D cache and entire memory arrangement is radically different.

Being able to use RVC and being able to "prefix" RVC scalar opcodes to create one-off (non-VBLOCK) 32 bit vector opcodes is I feel really important for the ultra low power GPU space.

> Also, it is a CSR worth of state that has to be saved on context switches.

It has to be saved anyway.

> Being able to making the vl register dependency explicit by making it explicitly part of the long "specify everything version" of your instructions seems "the right thing" though.

On balance... yeah.

>
> Anyway I was just giving you food for thought which it seems to have done :-).

Yes, for which I am very grateful.

I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.

L.

Daniel Petrisko

unread,

Aug 30, 2019, 9:02:35 PM8/30/19

to lkcl, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

> I assume here that adds are more useful than anything else except what is available now.

Atomic adds are also significantly more expensive (gates) than other CSR operations. Since CSRs operations are not typically performance critical, I think most architects would rather reuse the ALU with a traditional read modify write sequence.

An extra 64 bit adder for atomic CSR adds would be unacceptable for low-end implementations, and reusing the ALU for 3-part atomic CSR adds is unnecessary complexity.

As suggested earlier, memory-mapping CSRs would allow one to get all AMOs for “free”, but in a very platform-specific way. Does anyone know if there’s precedent for that scheme, in current RISC-V implementations or uarchs for other ISAs?

Best,
Dan Petrisko

Jacob Lifshay

unread,

Aug 30, 2019, 11:25:38 PM8/30/19

to Dan Petrisko, Luke Kenneth Casson Leighton, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Fri, Aug 30, 2019, 18:02 Daniel Petrisko <petr...@cs.washington.edu> wrote:

> I assume here that adds are more useful than anything else except what is available now.

Atomic adds are also significantly more expensive (gates) than other CSR operations. Since CSRs operations are not typically performance critical, I think most architects would rather reuse the ALU with a traditional read modify write sequence.

An extra 64 bit adder for atomic CSR adds would be unacceptable for low-end implementations, and reusing the ALU for 3-part atomic CSR adds is unnecessary complexity.

As suggested earlier, memory-mapping CSRs would allow one to get all AMOs for “free”, but in a very platform-specific way. Does anyone know if there’s precedent for that scheme, in current RISC-V implementations or uarchs for other ISAs?

If I recall correctly, x86 has the hart-local interrupt/timer unit memory mapped.

Jacob Lifshay

lkcl

unread,

Aug 31, 2019, 12:56:57 AM8/31/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Saturday, August 31, 2019 at 9:02:35 AM UTC+8, Dan Petrisko wrote:
> > I assume here that adds are more useful than anything else except what is available now.
>

> Atomic adds are also significantly more expensive (gates) than other CSR operations.

In both rocket-chip and ariane they are implemented as a special FSM built-in to the L2 cache communications layer!

> Since CSRs operations are not typically performance critical,

eexactly, and VL particularly on vbff not only reads VL it writes a new value to it, as well.

Which then makes subsequent (necessary) use of CSRR the exception to all the rules about CSRs.

L.

Rogier Brussee

unread,

Aug 31, 2019, 8:30:53 AM8/31/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

Op vrijdag 30 augustus 2019 20:25:05 UTC+2 schreef lkcl:

On Saturday, August 31, 2019 at 12:55:25 AM UTC+8, Rogier Brussee wrote:

> shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.
>
>
> For this example.
>

Yes. Which is "the" canonical example of sequential data-dependent fail-on-first parallel processing.

Good point.

[snip]

>
>
> CSRRA would be allowed on all CSR's, just as CSRRS and CSRRC are allowed on all CSR registers, they would just not necessarily be useful. I assume here that adds are more useful than anything else except what is available now. Also if you go the road of trying to squeeze in the CSR/ privileged opcode you have no room left for anything else :-(.

Yep. Sigh.

> CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.
>
>
>
>
> Right. But what is the fundamental difference between atomically set/clear a bit and atomically adding and subtracting.

That's slightly missing the point: the point is that the scalar registers are what you're supposed to do arithmetic on, and CSRs are what are supposed to change the behaviour of the engine, run a bunch of arithmetic ops, then switch it off again.

The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.

The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.

You seem to be using CSR's a lot and pushing the boundaries anyway, and once you are accommodating a full vector processing unit it is safe to assume we are not talking about the lowest end of processors.

[snip]

>
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,

Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.

The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.

Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.

I didn't mean the internal dependency, but....

> whereas a scalar use of t1 can be replaced with a temporary in the x16-x31 (e.g. t3) without any problem.

See above. SV is under a lot more register allocation pressure.

>
>
>
> It would seem the important instructions to use with vl are C.ADDI, C.LI, C.MV, B.MAX and perhaps occasionally C.LDSP, C.SDSP, and C.ADD and SUB. Each of the RVC instructions use a 5 bit register number.

It's not the VL-related-arithmetic ops that worry me, it's that because SV uses *all* scalar opcodes and contextually marks them as "vectorised", the *vector* operations are under pressure if one of the regs is hard coded to VL.

It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.

this point. The first six registers numbers are tied up by calling /usage convention so a seventh is less in the way down there.

It is just a route I do not want to go down.

Ok your call.

>
>
> with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
>
>
>
>
> No because even if redirection is free, once the vector length is in a register and not in a CSR, I don't really see what it buys you to be able to set _different_ registers as the vl register,

I believe I explained that, above,

MMmmm no, but you understand this better than I do.

[snip]

> Also, it is a CSR worth of state that has to be saved on context switches.

It has to be saved anyway.

If there is no CSR that stores which register is currently used as vl (e.g. t1) you don't have to save and restore that. Of course you still have to save the register contents of t1. As an analogue: on context switch you have to save the stack pointer In x2, you don't have to save that x2 is used as a stack pointer because that convention is an invariant.

> Being able to making the vl register dependency explicit by making it explicitly part of the long "specify everything version" of your instructions seems "the right thing" though.

On balance... yeah.

>
> Anyway I was just giving you food for thought which it seems to have done :-).

Yes, for which I am very grateful.

You'r welcome.

I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.

Which reminds me that I have a half finished review of the B extension. That draft seems to be improving but could, I think, also use more eyeballs.

Rogier

L.

lkcl

unread,

Aug 31, 2019, 10:57:20 AM8/31/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Saturday, August 31, 2019 at 1:30:53 PM UTC+1, Rogier Brussee wrote:

The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.
The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.

You seem to be using CSR's a lot

well.. the original version of SV was. the spike-sv simulator quickly disabused me of that idea. jacob came up with SVPrefix (which sort-of SIMD-ifies scalar opcodes) and i decided to do "VBLOCK", which effectively strips absolutely all of the 32-bit CSR instructions and drops the 192-bit format header (only 16 bits) in its place.

and pushing the boundaries anyway,

yup :)

and once you are accommodating a full vector processing unit it is safe to assume we are not talking about the lowest end of processors.

ah. right. it's really important to remember that SV is _not_ just for high-performance vector processing: it's designed to be useful right the way down to the RV32E level. at its core it's just a series of for-loops:

PC++;

for VBLKPC in range(VBLOCKLEN)

{

for vl in range(VL)

{

for subvl in range(SUBVL)

{

same instruction gets repeated here,

just with different regs each time.

actual parallelism is entirely optional

}

you don't *have* to have massive regfiles, or huge memory bandwidth, or greatly-increased register ports.

there's benefits to SV beyond vectorisation, including being able to save/restore the entire register file with one or two instructions [not one per register].

provision of LD/ST-MULTI was a complete accident, and can even be predicated, which saves on function call stack save/restore as well as context-switch.

so it's definitely not just about vectorisation.

[snip]

>
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,

Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.

The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.

Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.

I didn't mean the internal dependency, but....

slightly lost. no the registers are not addressed in "blocks" (which is a nice idea, btw). they're all individually re-routeable / taggable. it takes a whopping 16 bits to specify the full context (per register!) so i came up with a shortened 8-bit format.

i've been trying to think how to get that down further, for some time.

It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.

this point. The first six registers numbers are tied up by calling /usage convention so a seventh is less in the way down there.

hard-coding registers just makes me twitchy. the more there are, the harder the interactions become to predict what harm they might do for compiler and assembler writers. we already had to have some for the reduced (compacted) formats, and they're making me nervous.

> Also, it is a CSR worth of state that has to be saved on context switches.

It has to be saved anyway.

If there is no CSR that stores which register is currently used as vl (e.g. t1) you don't have to save and restore that.

ok yes, now i understand. took me a while. *thinks*... context-switching isn't as high a priority as reducing code size (to get I-cache usage down in ultra-low-power 3D GPU scenarios) ...

I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.

Which reminds me that I have a half finished review of the B extension. That draft seems to be improving but could, I think, also use more eyeballs.

could you make it here on isa-dev? having the same operations that have been specially added to RVV for mask manipulation: we will need the exact same operations in BitManip. i've been delaying a review as there's been so much else to do.

thanks rogier.

l.

lkcl

unread,

Sep 1, 2019, 10:34:11 AM9/1/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Saturday, August 31, 2019 at 3:57:20 PM UTC+1, lkcl wrote:

i've been trying to think how to get that down further, for some time.

came up with something, overnight:

https://groups.google.com/forum/#!topic/comp.arch/l2nzme2sCR0

Jacob Lifshay

unread,

Sep 1, 2019, 12:23:14 PM9/1/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development

Few comments:

the SVPMode table needs to have all entries use different encodings (they all currently encode to 0b00)

The encodings don't match the RISC-V long-instruction encodings (SVPMode would need to be moved to accommodate the 1s and 0 needed to produce 48, 64, 80, 96, and larger bit lengths).

Jacob

lkcl

unread,

Sep 1, 2019, 1:35:28 PM9/1/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Sunday, September 1, 2019 at 5:23:14 PM UTC+1, Jacob Lifshay wrote:

came up with something, overnight:
https://groups.google.com/forum/#!topic/comp.arch/l2nzme2sCR0

Few comments:

the SVPMode table needs to have all entries use different encodings (they all currently encode to 0b00)

good catch - they're in the commentary below, but not the table. cut/paste error.

The encodings don't match the RISC-V long-instruction encodings (SVPMode would need to be moved to accommodate the 1s and 0 needed to produce 48, 64, 80, 96, and larger bit lengths).

ah. right. the 80+-bit format works slightly differently. from Section 1.5 of the RV spec, bits 12-14 specify the length according to the formula "80 + NNN * 16", for all values of NNN != 0b111. when bits 12-14 are all set to 0b111, that indicates *another* encoding, the 192-and-above bit length format.

despite these being moot (as in: they're not *actually* 80+-bit or 192+-bit opcodes), i really like the longer-formats. but that's off-topic.

it's moot because actually the opcode header/identifier is purely used to put the processor into "Vector Context Mode". it is *NOT* required to have a massive buffer in which to store the entire 80+ or 192+ bit opcode. Program Order is *still maintained*, EVEN THOUGH a new "context" (Vector Block Context) has been "activated".

branches and jumps are prohibited within the VBLOCK because the PC has been "frozen" i.e. it still points to the *START* of the VBLOCK.

it is bits 0-4 of PCVBLK (Program Counter for VBLOCK) that get incremented, *NOT* the Program Counter.

this is why i said (in an earlier message) in SV that there are nested hardware for-loops: PC, PCVBLK, VL and SUBVL.

* PC incrementing is paused whilst PCVBLK is in effect

* PCBVLK incrementing is paused whilst VL is in effect

* VL incrementing is paused whilst SUBVL is in effect.

l.

Jacob Lifshay

unread,

Sep 1, 2019, 5:20:34 PM9/1/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development

On Sun, Sep 1, 2019, 10:35 lkcl <luke.l...@gmail.com> wrote:

On Sunday, September 1, 2019 at 5:23:14 PM UTC+1, Jacob Lifshay wrote:

came up with something, overnight:
https://groups.google.com/forum/#!topic/comp.arch/l2nzme2sCR0

Few comments:

the SVPMode table needs to have all entries use different encodings (they all currently encode to 0b00)

good catch - they're in the commentary below, but not the table. cut/paste error.

The encodings don't match the RISC-V long-instruction encodings (SVPMode would need to be moved to accommodate the 1s and 0 needed to produce 48, 64, 80, 96, and larger bit lengths).

Oops, I had misremembered the number of 1s needed. Sorry.

Jacob

Jacob Lifshay

unread,

Sep 1, 2019, 10:32:01 PM9/1/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development

On Sun, Sep 1, 2019, 07:34 lkcl <luke.l...@gmail.com> wrote:
>

What about taking the vector/scalar propagation idea and just apply it
to normal instructions (outside of VBLOCK):

there would be a 128-bit CSR (64-bit on RVE) with a 4-bit field for
each register field encoding [1], tentatively named SVMODES. It is
allocated to normal CSRs in the same manner as the time CSR -- by
splitting into XLEN-sized chunks.

[1]: not just register number, other extensions may have a register
field encoding to register number translation table, such as SVorig.

each field would be encoded as follows:
bits 0-1:
00: SUBVL=1
01: SUBVL=2
10: SUBVL=3
11: SUBVL=4

bits 2-3:
00: scalar
01: vector unpredicated
1x: vector (predication -- TBD)

the 4-bit field corresponding to rd would be written with the vector
mode of the result of each instruction, which is calculated from the
instruction and the vector modes as follows:
1. If the instruction specifies scalar/vector mode in the encoding
(like via SVPrefix), then the mode specified by the instruction is
used.
2. Otherwise,
a. the scalar/vector mode is calculated for each of the source
operands by reading from the SVMODES field corresponding to the
register field encoding for that source operand.
b. If the SUBVL modes from the source operands don't match and the
instruction doesn't specifically handle differing SUBVL modes, then
trigger an illegal instruction trap (swizzle can handle diffferent
SUBVL modes, most other instructions can't)
c. The SUBVL mode for the result is chosen from one of the source
operands (doesn't matter which, they match) unless the instruction
specifically overrides the result SUBVL.
d. If all the source operands are scalar, then
1. the result is scalar unless specifically overridden by the
instruction.
e. Otherwise,
1. the result is vector unless specifically overridden by the
instruction.
2. The result predication is TBD (probably selected similarly
to SUBVL selection).

There would be a separate SVMODES csr used for context switching, on a
trap, the old value would be saved (by switching which SVMODES csr is
used or by copying to a different csr), then the SVMODES csr would be
cleared to all zero (all registers scalar with SUBVL=1).

on return from an exception, SVMODES would be restored (by copying
back from the saved csr or by switching which SVMODES csr is used).

There would be a separate instruction that clears SVMODES to all zero,
to allow calling scalar code quickly.

Since using SVMODES instead of all SVPrefix instructions just makes
the code smaller and faster, but doesn't allow using any more
instructions than before, it could be used like a more complicated way
to compress instructions and it would be possible to limit all SVMODES
handling to after the register allocator in the compiler, similar to
how RVC instructions can be substituted entirely in the assembler,
without the compiler needing to know (though knowing allows selecting
more optimal code at the expense of a more complex compiler).

Jacob Lifshay

lkcl

unread,

Sep 2, 2019, 4:52:02 AM9/2/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Monday, September 2, 2019 at 3:32:01 AM UTC+1, Jacob Lifshay wrote:

What about taking the vector/scalar propagation idea and just apply it
to normal instructions (outside of VBLOCK):

i like it. so let's do a comparison of the resource utilisation of each, and how it would work.

the premise of the VBLOCK-Prefix idea: up to 2 SVPrefixes apply to the first (and second) instruction. the prefixes do not specify the register *numbers* (unlike SVOrig), they specify "src1 is a scalar/vector" and the actual register number is picked up from the [first] instruction.

reminder: the cascade basically says that any register number which is "marked" as vectorised from the first [and second] instruction, those register numbers *remain* marked as vectorised and, in subsequent instructions, if the same register number is further used (as a src) it *remains marked as a vector*. furthermore, the result register from the subsequent instructions *also get marked as vectorised* and so on. hence a "cascade".

The Mill Architecture also implements this concept, so it is not a new idea. The Mill however starts from LDs. only "LD" instructions actually specify the "type" and bit-width of the operand! all arithmetic operations are subsequently polymorphic (there is no ADD.W, there is only ADD. there is no FADD, there is only ADD), and consequently the instruction encoding is extremely compact. it's very cool.

as branches [except back to the start] and function calls are prohibited within the VBLOCK, there are guarantees that the order in which the compiler can calculate which registers will cascade through is completely static.

if the same idea was applied *outside* of a VBLOCK, there would be no such guarantee. hmm, so that would need to be thought through.

there would be a 128-bit CSR (64-bit on RVE) with a 4-bit field for
each register field encoding [1], tentatively named SVMODES. It is
allocated to normal CSRs in the same manner as the time CSR -- by
splitting into XLEN-sized chunks.

oh hang on - 128/4 is 32, are you suggesting a table-entry *per register*? the way that VBLOCK-SVPrefix works is that the register numbers are taken from the following instructions, and cascade onwards. which saves a huge amount of CSR space.

i would be extremely surprised if more than 4 possibly 8 sets of prefixing was needed, given the cascading. marking the *entire* set of registers in advance, there is no need for the cascade (at all).

reading further below i am guessing you're probably not. this would reduce the required size of the CSR signiificantly.

let's assume a max of 4 regs: rd, rs1, rs2, rs3. that would be 4x4=16 bits. let's assume say... 4 sets of those (up to 4 instructions in sequence may be "marked"). that's 64 bits. if you wanted to cover up to 8 instructions, yes that would be 128 biits.

so the cost (number of instructions) needed to set up the CSRs would be:

* li x3, immediate # 64-bit immediate: a LD?

* CSRRW CSR1, x3

* li x3, immediate # another 64-bit immediate LD

* CSRRW CSR2, x3

if LD is used for the load-immediate, that's still a whopping *128* bits worth of setup, with other aspects of SVPrefix (extending the register numbers beyond 32, being able to use different predicate registers on different instructions) still needs to be included.

by contrast, the VBLOCK prefix is only 16 bits, and whilst only up to 2 instructions may be "marked", only between 16 to 64 bits is used to do so, *and* it covers setting of VL/MVL, and targets multiple predicate registers, and so on.

[1]: not just register number, other extensions may have a register
field encoding to register number translation table, such as SVorig.

each field would be encoded as follows:
bits 0-1:
00: SUBVL=1
01: SUBVL=2
10: SUBVL=3
11: SUBVL=4

mmm... SUBVL is supposed to be "global", however... what would be implied here is that SUBVL would be applied *per register*. exactly how that would work would need to be thought through.

[oh, ok, you cover it, below].

bits 2-3:
00: scalar
01: vector unpredicated
1x: vector (predication -- TBD)

element-width overrides are also necessary (2 bits to specify default/8/16/32). that would i feel be better than changing the semantics of SUBVL (from a global to a per-register).

the 4-bit field corresponding to rd would be written with the vector
mode of the result of each instruction, which is calculated from the
instruction and the vector modes as follows:
1. If the instruction specifies scalar/vector mode in the encoding
(like via SVPrefix), then the mode specified by the instruction is
used.
2. Otherwise,
a. the scalar/vector mode is calculated for each of the source
operands by reading from the SVMODES field corresponding to the
register field encoding for that source operand.

... and cascades from there.

b. If the SUBVL modes from the source operands don't match and the
instruction doesn't specifically handle differing SUBVL modes, then
trigger an illegal instruction trap (swizzle can handle diffferent
SUBVL modes, most other instructions can't)

this is where it gets wasteful. the number of permutations that raise illegal instruction traps is so high that it suggests that the encoding is not a good one. i feel it would be better to use the 2 bits for elwidth.

2. The result predication is TBD (probably selected similarly
to SUBVL selection).

twin-predication also needs to be thought through. i think it would work though.

There would be a separate SVMODES csr used for context switching, on a
trap, the old value would be saved (by switching which SVMODES csr is
used or by copying to a different csr), then the SVMODES csr would be
cleared to all zero (all registers scalar with SUBVL=1).

128 bits worth of context-saving...

on return from an exception, SVMODES would be restored (by copying
back from the saved csr or by switching which SVMODES csr is used).

There would be a separate instruction that clears SVMODES to all zero,
to allow calling scalar code quickly.

ok so here's where the VBLOCK concept has a clear advantage: that extra instruction is not needed. once the VBLOCK context is exited, the tear-down is automatic.

Since using SVMODES instead of all SVPrefix instructions just makes
the code smaller and faster, but doesn't allow using any more
instructions than before, it could be used like a more complicated way
to compress instructions and it would be possible to limit all SVMODES
handling to after the register allocator in the compiler, similar to
how RVC instructions can be substituted entirely in the assembler,
without the compiler needing to know (though knowing allows selecting
more optimal code at the expense of a more complex compiler).

this is the logic behind VBLOCK-SVPrefix. a compiler could (conceivably) simply output SVPrefix instructions, and a second-phase optimiser simply goes through them, spots which registers share the same prefixes, works out how to cascade them and *replaces* multiple SVPrefix instructions with a cascading VBLOCK-SVPrefix instead.

so comparing the two:

* SVMODEs uses CSRs, which is an inherent code-size penalty compared to VBLOCK-SVPrefix.

* SVMODEs requires tear-down instructions (another code-size penalty)

these are, for me, the "killers" if the focus is to be on reducing code size (and thus I-Cache usage and thus power consumption). the overhead of the CSR setup sequence was why i came up with VBLOCK in the first place: VBLOCK-SVPrefix simply continues to take advantage of that opportunity.

l.

lkcl

unread,

Sep 2, 2019, 6:55:01 AM9/2/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

Nuts. Just realised, a context switch (trap) is a nuisance when it comes to a cascade.

Creating the internal state which tracks the cascade, no problem: just some latches.

Storing it for a restore: now we have a problem.

In VBLOCK (the original) there is a single 32 bit CSR for context switching. It contains just enough state to be able to go *BACK* to the START of the VBLOCK, re-read the tables, then jump FORWARD to the offset in the VBLOCK where execution left off.

Funnily enough someone described a scenario a few months back where this exact same trick may be deployed. I forget how or why.

The problem with the cascade is that it becomes necessary to re-read the *entire* group of VBLOCK instructions in order to recreate the cascade.

I am therefore tempted to restrict the cascade to just the first two instructions in the VBLOCK. That results in up to six registers being potentially "marked" as vectorised, with their use also resulting in other operations also being marked.

It's a reasonable compromise, which does not require massive CSR usage.

On the other hand.... I mean, there's only actually 32 registers (31) that can be used, so is it really that bad an overhead? Is it sufficient to just have a single bit mask indicating which registers have been cascade-marked?

If so, that could easily go in the top 32 bits of PCVBLK.

L.

Jacob Lifshay

unread,

Sep 2, 2019, 8:04:31 AM9/2/19

to lkcl, RISC-V ISA Dev, Libre-RISCV General Development

On Mon, Sep 2, 2019 at 1:52 AM lkcl <luke.l...@gmail.com> wrote:
>
> On Monday, September 2, 2019 at 3:32:01 AM UTC+1, Jacob Lifshay wrote:
>
>>
>> What about taking the vector/scalar propagation idea and just apply it
>> to normal instructions (outside of VBLOCK):
>
>
> i like it. so let's do a comparison of the resource utilisation of each, and how it would work.
>
> the premise of the VBLOCK-Prefix idea: up to 2 SVPrefixes apply to the first (and second) instruction. the prefixes do not specify the register *numbers* (unlike SVOrig), they specify "src1 is a scalar/vector" and the actual register number is picked up from the [first] instruction.
>
> reminder: the cascade basically says that any register number which is "marked" as vectorised from the first [and second] instruction, those register numbers *remain* marked as vectorised and, in subsequent instructions, if the same register number is further used (as a src) it *remains marked as a vector*. furthermore, the result register from the subsequent instructions *also get marked as vectorised* and so on. hence a "cascade".
>
> The Mill Architecture also implements this concept, so it is not a new idea. The Mill however starts from LDs. only "LD" instructions actually specify the "type" and bit-width of the operand! all arithmetic operations are subsequently polymorphic (there is no ADD.W, there is only ADD. there is no FADD, there is only ADD), and consequently the instruction encoding is extremely compact. it's very cool.
>
>
> as branches [except back to the start] and function calls are prohibited within the VBLOCK, there are guarantees that the order in which the compiler can calculate which registers will cascade through is completely static.
>
> if the same idea was applied *outside* of a VBLOCK, there would be no such guarantee. hmm, so that would need to be thought through.

I am proposing that the vector/scalar flags would cascade throughout
the whole function/multiple functions, which can be handled in the
compiler with a rather simple data-flow analysis. You could think of
it as tagging the registers with a vector/scalar flag.

I think that SUBVL will actually be more useful there than elwidth,
though those both will be very useful. The majority of graphics
shaders use only i32/u32 and f32, whereas most of them use a range of
SUBVL values.

>
>> 2. The result predication is TBD (probably selected similarly
>> to SUBVL selection).
>
>
> twin-predication also needs to be thought through. i think it would work though.
>
>>
>> There would be a separate SVMODES csr used for context switching, on a
>> trap, the old value would be saved (by switching which SVMODES csr is
>> used or by copying to a different csr), then the SVMODES csr would be
>> cleared to all zero (all registers scalar with SUBVL=1).
>
>
> 128 bits worth of context-saving...

not too bad since, if we design it right, saving/restoring SVMODES can
be skipped for most system calls, it would only need to be
saved/restored for context switches between processes.

>
>>
>>
>> on return from an exception, SVMODES would be restored (by copying
>> back from the saved csr or by switching which SVMODES csr is used).
>>
>> There would be a separate instruction that clears SVMODES to all zero,
>> to allow calling scalar code quickly.
>
>
> ok so here's where the VBLOCK concept has a clear advantage: that extra instruction is not needed. once the VBLOCK context is exited, the tear-down is automatic.

Actually, I think VBLOCK not being able to work on more than a single
basic-block at a time is a disadvantage compared to SVMODES. the
SVMODES-clear instruction would mostly only be used when calling code
that is not SVMODES-aware.

>
>
>>
>> Since using SVMODES instead of all SVPrefix instructions just makes
>> the code smaller and faster, but doesn't allow using any more
>> instructions than before, it could be used like a more complicated way
>> to compress instructions and it would be possible to limit all SVMODES
>> handling to after the register allocator in the compiler, similar to
>> how RVC instructions can be substituted entirely in the assembler,
>> without the compiler needing to know (though knowing allows selecting
>> more optimal code at the expense of a more complex compiler).
>>
>
> this is the logic behind VBLOCK-SVPrefix. a compiler could (conceivably) simply output SVPrefix instructions, and a second-phase optimiser simply goes through them, spots which registers share the same prefixes, works out how to cascade them and *replaces* multiple SVPrefix instructions with a cascading VBLOCK-SVPrefix instead.
>
> so comparing the two:
>
> * SVMODEs uses CSRs, which is an inherent code-size penalty compared to VBLOCK-SVPrefix.

You're missing that SVMODES is changed by each instruction (basically,
SVMODES values follow values in registers), so it would basically only
need to be explicitly written on a context-switch. the rest of the
time, SVPrefix instructions would be used to enter vector mode.

Example:

add_loop:
setvl x0, a3 # only CSR-type instruction, the rest is done with SVPrefix
svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register
field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated
loop:
beqz a2, loop_end
svp.lw x64(vector), (a1), SUBVL=3 # encoded using 16 in the register
field; sets SVMODES[x16] to SUBVL=3, vector, unpredicated
add x32(vector), x32(vector), x64(vector), SUBVL=3 # encoded as add
x8, x8, x16 (could use RVC instruction if register numbers fit); sets
SVMODES[x8] to SUBVL=3, vector, unpredicated
addi a2, a2, -1
bnez a2, loop
loop_end:
svp.sw x32(vector), (a0), SUBVL=3 # encoded as sw x8, (a0) (could also
use RVC instruction if register numbers fit)
ret # no tear-down instructions, following code just has to use a
SVPrefix instruction if it uses an uninitialized register, most the
time a register will be written to the first time it's used, so
SVPrefix is not required for those instructions

equivalent C code:
void add_loop(int *a0, int *a1, int a2, int a3)
{
int VL = a3;
int x32[VL * 3]; // excuse my dynamic arrays
int x64[VL * 3];
for(int i = 0; i < VL * 3; i++)
x32[i] = a0[i];
while(a2 != 0)
{
for(int i = 0; i < VL * 3; i++)
x64[i] = a1[i];
for(int i = 0; i < VL * 3; i++)
x32[i] += x64[i];
}
for(int i = 0; i < VL * 3; i++)
a0[i] = x32[i];

}

> * SVMODEs requires tear-down instructions (another code-size penalty)
>
> these are, for me, the "killers" if the focus is to be on reducing code size (and thus I-Cache usage and thus power consumption). the overhead of the CSR setup sequence was why i came up with VBLOCK in the first place: VBLOCK-SVPrefix simply continues to take advantage of that opportunity.
>
> l.
>

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1a019c40-03fb-4204-bc06-8b14a4e3c4c6%40groups.riscv.org.

lkcl

unread,

Sep 2, 2019, 10:27:15 AM9/2/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Monday, September 2, 2019 at 1:04:31 PM UTC+1, Jacob Lifshay wrote:

> if the same idea was applied *outside* of a VBLOCK, there would be no such guarantee. hmm, so that would need to be thought through.

I am proposing that the vector/scalar flags would cascade throughout
the whole function/multiple functions,

single function, no problem. multiple functions: now it's an order of magnitude more complex. the cascade now becomes part of the contract for the function API. that's an entire academic research project, all on its own.

we *need* to keep this simple and straightforward, jacob.

which can be handled in the
compiler with a rather simple data-flow analysis.

it's nowhere neeeear simple when branch/control is involved.

You could think of
it as tagging the registers with a vector/scalar flag.

yes. same principle for the Mill (except they tag with bit-width and other things).

ok, so let's say that there's a function which, on entry, the first thing it does is a data-dependent branch.

what should the engine do, when cascading the registers *from following the program counter*?

it's going to follow the branch on one call of the function, which will cause a completely different set of registers to become cascade-tagged as "vectorised", when compared to a subsequent call which goes the other direction, isn't it?

now what happens when there are six nested loops and multiple data-dependent paths, as there is in the MP4 decode algorithm?

do you imagine that a compiler would be able to cope with that?

if it was successfully patched, what do you rate the chances of successful upstream acceptance of such a patch?

i'd rate the chances of the success of such an effort at close to zero.

so let's instead go with a sequential ordering from the numbering (address) on the instructions. instead, the cascade occurs in *instruction* order, as opposed to "execution" order.

now we have a hardware-level nightmare: it's now necessary to keep track of the program counter, perhaps read a dozen or perhaps even hundreds of instructions, tracking them all in order to create and maintain the cascade order.

the complexity is off the scale for one scenario (in hardware), and off the scale in a different direction (in compiler technology) for the other.

> this is where it gets wasteful. the number of permutations that raise illegal instruction traps is so high that it suggests that the encoding is not a good one. i feel it would be better to use the 2 bits for elwidth.

I think that SUBVL will actually be more useful there than elwidth,
though those both will be very useful. The majority of graphics
shaders use only i32/u32 and f32, whereas most of them use a range of
SUBVL values.

we're not just covering 3D graphics: SV is for Video Processing, numerical computation, and a lot more. (the meetup at WD on thursday gave some great feedback, and someone specifically asked for Video Processing to be included).

with scalar RISC-V entirely missing compact 8-bit and 16-bit operations, elwidth overrides are a sane way to get them (in bulk i.e. vectorised).

remember that it's perfectly possible to call SV.SETVL within a VBLOCK. or CSRRWI SUBVL.

also, that setting SUBVL in a VBLOCK will set it globally. in the example you gave, i believe that the entirety of the individual "SET SUBVL" marks may be replaced with one single global SET SUBVL, right at the top of the function.

can you think of an example where SUBVL would need to change hugely and frequently, *and* where it would be sub-optimal to use VBLOCK-SVP "grouping" of opcodes (when compared to it being mandatory to call SVMODES-clear at the end of a function *or* before actually *calling* a function)?

> 128 bits worth of context-saving...

not too bad since, if we design it right, saving/restoring SVMODES can
be skipped for most system calls, it would only need to be
saved/restored for context switches between processes.

to give some context (haha), the reason why i designed VBLOCK in the first place was because i considered 128 bits worth of CSR setup in SV-Orig to be far too much.

in addition, if the assumption is that system calls will use registers that are not involved in the cascade, that's a *really* dangerous assumption (and/or requires an entire recompilation of pretty much every single distro's source code to make sure that it _is_ a correct assumption). and/or requires setting certain boundaries (such as not utilising registers x1-x31) for VBLOCKs, which then limits the entire purpose of the exercise.

keeping things "isolated" to a single function (or to isolated functions in the same source code file that cannot be called externally) is the safest and simplest thing to do, and that means *not* spilling the cascade-context outside of the places where it's used.

think about it: one register happens to be marked as "vectorised", run on a function that is supposed to be scalar, it'll destroy the entire function by using registers as vectors that were never intended to be used as such.

it's just not safe, jacob, and expecting the entirety of the GNU/Linux world to recompile 30,000 packages is not a reasonable expectation, either. as a hybrid processor, it has to be "compatible" within the confines of the UNIX Platform Spec.

>> on return from an exception, SVMODES would be restored (by copying
>> back from the saved csr or by switching which SVMODES csr is used).
>>
>> There would be a separate instruction that clears SVMODES to all zero,
>> to allow calling scalar code quickly.
>
>
> ok so here's where the VBLOCK concept has a clear advantage: that extra instruction is not needed. once the VBLOCK context is exited, the tear-down is automatic.

Actually, I think VBLOCK not being able to work on more than a single
basic-block at a time is a disadvantage compared to SVMODES.

i'd imagine it covering between three to maybe eight instructions... then setting up a new VBLOCK context on another set of three to maybe eight instructions. the VBLOCK setup overhead is small enough that it'd be ok, and still compacting instructions down.

remember that the cascade rules go *well beyond* the initial one/two instructions. the source and destination register(s) of *both* those first instructions (up to a maximum of eight, which is a *lot*) can cause pretty much all of the ongoing instructions to end up being vectorised as well.

my feeling is that the ripple effect will make VBLOCK-SVP be extremely efficient, making the need for marking more than a couple of instructions moot.

the
SVMODES-clear instruction would mostly only be used when calling code
that is not SVMODES-aware.

yes. understood. still don't like the overhead (which misses the additional point that makes the overhead of SVMODES-clear moot, which is that the actual *setup* is more costly than VBLOCK-SVP).

> * SVMODEs uses CSRs, which is an inherent code-size penalty compared to VBLOCK-SVPrefix.

You're missing that SVMODES is changed by each instruction (basically,
SVMODES values follow values in registers),

i don't understand.

i'd expect that the SVMODEs would be set up, a few instructions called, then SVMODEs-clear is called. anything else is dangerous, costly, complex, and has far too many undesirable consequences.

exceptions to that would be static functions within the same source code file.

so it would basically only
need to be explicitly written on a context-switch.

[and to safely tear down the context so that the entirety of the GNU/Linux software base does not require total recompilation to suit this scheme].

now, if we were google, there would be no problem: propose an entirely new architecture, recompile the OS (Chromium, Android) to suit it, do what we like. unfortunately, that's not the case, so stepping outside of certain boundaries (certain knock-on consequences) isn't ok.

svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register
field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated

ret # no tear-down instructions, following code just has to use a
SVPrefix instruction if it uses an uninitialized register, most the
time a register will be written to the first time it's used, so
SVPrefix is not required for those instructions

that is rreeeallly unsafe to do. it's fine if the example was a static function, used exclusively by other functions in the same source code file, where the compiler can safely determine that there will be no bleed-out of the vectorisation cascade.

l.

lkcl

unread,

Sep 3, 2019, 1:23:19 AM9/3/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

so, i appreciate i might appear to be "overdoing it" on the functions-thing. think it through, from the perspective of distro-distriibuted packages, which i do appreciate is not exactly very common at the moment.

* standard packages from e.g. debian or fedora are RV64GC Unix-platform compliant

* hardware has a special mode (SVMODE) which provides vectorisation, where the function-call convention is changed to "leave it on"

* that means that every debian/fedora-packaged library used by the new hardware mode needs to be multilib compiled to understand this new convention.

* that in turn means that every single package needs to have a patch submitted to the packaging (debian has over 30,000 packages)

* appliications will still not be safe to run until *all* its dependent libraries have had the new SVMODE multilib recompile.

you see where that's going? it'd be literally years before the entire debian/fedora (etc.) distro suite was able to fully take advantage of the new hardware, across the board.

or...

you could have a convention where each function is "self-contained responsible" for clean-up. that means that any function becomes vector-accelerated *immediately*, with no knock-on recompilation consequences required.

users will put in individual requests for the new hardware-accelerated support (and/or help out in doing that by actually submitting packaging patches), on a *per-package* basis, and get immediate results.

if that convention's realistically and pragmatically accepted / acceptable, then the next phase is to actually optimise the hardware based around the convention and on a sensible desire to reduce code size.

so it's not that skipping SVMODEs-cleanup isn't a good way to save on instruction count: it is... it's just that such a dramatic change to standard function-call conventions has undesirable knock-on consequences when it comes to UNIX Platform software distribution / libraries.

interestingly, the exact same thing is going to apply for RVV when it's in common usage. with RVV it will also not be safe to assume a convention that RVV vector setup has been left in a useable / useful state, except if a tree of functions are used statically within the same source code file.

l.

lkcl

unread,

Sep 3, 2019, 12:01:46 PM9/3/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Monday, September 2, 2019 at 6:55:01 PM UTC+8, lkcl wrote:

> On the other hand.... I mean, there's only actually 32 registers (31) that can be used, so is it really that bad an overhead? Is it sufficient to just have a single bit mask indicating which registers have been cascade-marked?
>
> If so, that could easily go in the top 32 bits of PCVBLK.

Nope. damn again. 3 bits is not enough, it would need to be 64. The reason is that 2 bits are needed to express the following states:

* this reg has not been encountered yet in any VBLOCK instruction.
* this reg has been encountered and determined to be scalar
* .... vector....
* 4th state, wasted space (oh well).

Or... is it actually safe to say, because of the ORing, that a register need only be determined as vector by a single bit?

i.e all regs are marked as scalar, unless determined otherwise, and is that bitfield sufficient context to continue from?

L.

lkcl

unread,

Sep 4, 2019, 12:20:32 PM9/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

Rats.

I tried applying the new multi-SVP idea to DAXPY and strcpy, two canonical algorithms. Neither really works.

The reason is because:

1. unlike SVP where the prefixing context is applied immediately at the instruction, the two opcodes which primarily define the vector operands do *not* occir in succession (a flawed design prerequisite of this idea).

2. unlike VBLOCK, with the registers being explicitly named (despite the overhead of doing so) there can be some lookahead (static compiler analysis) in a batch of instructions.

Solutions include allowing register number context to be picked up from the 1st instruction, whilst for the 2nd the register numbers are explicitly spelled out, just like in VBLOCK.

This kinda defeats the purpose of the exercise!

Another possible solution is - was - to use some of the spare bits to specify the instruction in the VBLOCK from which the context will get its register numbers.

Whilst on the face of it this seems like a great idea, what it means in practice is that the instruction issue phase actually has to request those instructions (almost certainly requiring a new cache line miss), and decode them, before the first one in the VBLOCK may be executed.

At least with the current VBLOCK system execution may be done (tead) in a sequential order.

If there are no ideas anyone can think of on how to fix this I this VBLOCK-SVPrefix idea goes in the "retire" bin.

Not least because SVPrefix has no way to specify zeroing, inversion or fail-on-first data dependent predication, on which the canonical strcpy example critically relies (ffirst and inv, at least).

Which is a pity as I love the idea of reducing overhead, it just needs too much advance context to work.

Anyone any ideas or solutions?

L.

Jacob Lifshay

unread,

Sep 4, 2019, 4:13:36 PM9/4/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development

On Wed, Sep 4, 2019, 09:20 lkcl <luke.l...@gmail.com> wrote:

Rats.
[snip]

Anyone any ideas or solutions?

I still think the full-function vectorization is workable, will explain in more detail in follow-up email.

Jacob

lkcl

unread,

Sep 4, 2019, 5:12:04 PM9/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

Ok great. Can you also clarify or consider how ffirst, zeroing, lack of elwidth etc might be compensated for?

Also, if I understood correctly, with only 4 bits per register, a heck of a lot of SV capabilities is left out.

Those limitations, if correct, need a full evaluation.

L.

lkcl

unread,

Sep 4, 2019, 7:03:35 PM9/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

2.1.4 pack, all variants, and all permutations of all possible bitwidths of all src and dest operands, are also provided by SimpleV, using an element width override and suitable VL on a single opcode, either C.MV or its pseudo aliases (add rd rs1 x0).

lkcl

unread,

Sep 4, 2019, 7:19:53 PM9/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

gorc seems to simply be the following:

grev t0, t1, rs2
or t0, t0, t1

Is that correct? If so, what significant general purpose savings are there which justify the additional opcode?

If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.

L.

lkcl

unread,

Sep 9, 2019, 2:55:30 PM9/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

Going back to the original topic, always a good thing.

It took me a while to realise that SVPrefix's 64 bit LD/ST.MULTI capabilities are *damaged* - impeded - by the change of VL to VLPTR.

the reason is that the hardware loops on VL (and SUBVL) have to be re-entrant as far as traps are concerned.

You do not expect the PC to increment over a batch of sequential instructions and it is no different here.

Thus, because sequential program order is preserved, one of the registers that could be overwritten by a LD.MULTI is... the very same register being pointed to *as containing VL*.

oink.

With program order being preserved it is *not* possible to say "oh that version of the register which VLPTR points to must stay the same for the whole loop", it doesn't work that way.

And if LD.MULTI is restoring the entire context of a regfile, it has to restore the reg that VL points to.

urk.

So the solution here is to allow SVPrefix to have its own independent CSR STATE, which contains all the element offsets needed for re-entrancy.

This has a nice side effect of making SVPrefix's instruction format entirely independent, with no side effects or "harm" caused to VBLOCK VL/SUBVL.

However.. what would be nice is if SVPrefix could "pick up" SUBVL/VL from VBLOCK, so for that, rather than have SVP set explicit values for SUBVL to 1,2,3,4 I am advocating they be VBLK.SUBVL,2,3,4

Likewise for P64 (which allows setting of VL, where P48 only allows setting of SUBVL) the default option is to take VBLK.VL, and to allow that to be explicitly ignored if desired.

In this way P64 can be used to explicitly target every single one of the registers x0 - x127 or it can treat them as contiguous vector-like vectors.

The new STATE CSR for SVP is only 32 bit, it is identical, and may fit cleanly into the top 32 bits on RV64.