Something that has been bugging me about RVV and SV for some time: the fact that arithmetic on VL requires additional instructions to perform a transfer between VL and the registers used to carry out the necessary arithmetic.
If CSRs were treated orthogonally as actual scalar registers in RISCV, the problem would be moot.
This particularly hits home on use of fail-on-first.
The above pseudocode for strcpy shows it well: a CSR load is required in RVV in order to get at the modifications to VL that the failfirst actioned.
In SV it is even more pronounced an effect, due to a need to increment VL by one, after the fail-first, which of course requires first transferring VL to a scalar reg, then performing the arithmetic, then getting the value *back* into VL.
I have been thinking of a solution here which I did not want to share until I was reasonably sure it would be easily implementable in hardware.
The solution is, instead of having a CSR that contains the current VL value, instead have the CSR points *to* the scalar register that contains and will indefinitely continue to contsin the current VL value.
This would have the advantage that, once "linked", fail-on-first would automatically result in *direct* modification of that scalar (standard x1-x31 integer Regfile) register.
In the pseudocode above that would save 1 instruction in the inner loop in the RVV case: a reduction of around 8%.
In the SV case it would save *three* instructions in what is currently a 14 instruction loop: a significant saving (even when they're all RVC opcodes).
The hardware challenges are that these are implicit (indirect) modifications to a scalar regfile. Given that VL already has to be modified (in the current revision of RVV) conceptually it is not challenging, it's just that instead of modifying the CSR store, the integer regfile store is to be written.
For an OoO design, which was my primary concern, this makes every vector instruction require one additional read and write register hazard.
For context: some implementations may not have chosen to make VL a read / write dependency hazard, choosing instead to "stall" instruction issue whilst waiting for outstanding vector operations to complete: for such implementations the previous paragraph makes no sense and does not apply.
If on the other hand an OoO engine *has* had support for read and write hazard dependency tracking on VL added (in order to avoid stalling when VL is modified), then changing that to be the scalar register (to which this proposal modification to SETVL points) is not so much of a problem, and might even simplify the microarchitecture.
Inorder systems I am not so concerned about the hardware implications: stall is the go-to "solution" and this situation is no different.
As this is quite a radical design change I have been reluctant to come forward with it, had to think about it for several months.
Feedback appreciated, will hold off going ahead with this on SV for a whole longer.
Constructive feedback on its value in RVV also welcomed as it will save on instruction count in tight loops in RVV, as well.
L.
Have a register (or CSR) contain some sort of pointer to another
register? Just: no way. Micro-architectural nightmare.
The scalar instructions in, for example, this strncpy loop do not take
significant time. In a real version of the code they would be
interleaved with vector instructions rather than all at the end,
and
would on almost all machines be completed long before the preceding
vector instruction is. In particular the move from the VL CSR would
happen soon after the vlbff.v and the increments to the pointers soon
after that.
Maybe something like:
strncpy:
mv a3, a0 # Copy dst
loop:
setvli x0, a2, vint8 # Vectors of bytes.
vlbff.v v1, (a1) # Get src bytes
vseq.vi v0, v1, 0 # Flag zero bytes
csrr t1, vl # Get number of bytes fetched
vmfirst a4, v0 # Zero found?
add a1, a1, t1 # Bump src pointer
vmsif.v v0, v0 # Set mask up to and including zero byte.
sub a2, a2, t1 # Decrement count.
vsb.v v1, (a3), v0.t # Write out bytes
add a3, a3, t1 # Bump dst pointer
bgez a4, exit # Done
bnez a2, loop # Anymore?
strncpy:
mv a3, a0 # Copy dst
loop:
setvli t1, a2, vint8 # Vectors of bytes.
vlbff.v v1, (a1) # Get src bytes
vseq.vi v0, v1, 0 # Flag zero bytes
# NO LONGER NEEDED csrr t1, vl # Get number of bytes fetched
vmfirst a4, v0 # Zero found?
add a1, a1, t1 # Bump src pointer
vmsif.v v0, v0 # Set mask up to and including zero byte.
sub a2, a2, t1 # Decrement count.
vsb.v v1, (a3), v0.t # Write out bytes
add a3, a3, t1 # Bump dst pointer
bgez a4, exit # Done
bnez a2, loop # Anymore?
First of all see Bruce Hoult's remark: the whole issue may be moot and yet another layer of redirection seems meh.
ideas:* I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions complementing the "bitwise" CSR instructions. Problem is of course where to put that because the CSR number is big. There seems to be room in the CSR/func3 == 0b100 minor opcode for an immediate version, but the privileged spec seems to be a heavy user of the CSR/func==000 however (albeit all with rd = x0), which makes it a bit awkward to also have a CSRRA instruction :-(.
*As above, but just have an R-type instruction that only add's to the VL CSR.
*If you could mmap the CSR file, you could use the AMO-ops to manipulate them, in particular use add and subtract (and max and min!).
*Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions, allowing to specify the VL register in the 64(?) bit wide "allow to specify everything" version of your instructions. This, of course, requires specifying you are in vector mode in other ways then VL != 1 if you want to use implicit vectorisation.
ok so here there's a dependency: VL has a read dependency on a2. x0 is not written to, so there's no write dependency created.
[getting at the *contents* - well, um... it's already *in* the scalar register!]
That has no effect on the scalar regfile read dependencies at all (just a write dependency on rd).
That in turn means no stalls of the ALU(s) on CSRR rd, VL.
So the only thing that an implementation need do in order to have a little extra speed is to cache the contents of the-register-pointed-to-by-VL in latches, and for that register to be managed by absolutely standard register dependency hazard techniques, and for the latch to be simultaneously written whenever the pointed-to register is also updated in the register file.
That really is not very challenging, microarchitecturally, at all.
I think we will go ahead with this change to SV, given the benefits it brings by effectively allowing the full range of RISC-V arithmetic operations to be applied directly to VL, in addition to the savings on instruction count.
L.
On Thursday, August 29, 2019 at 8:26:30 AM UTC+1, Rogier Brussee wrote:First of all see Bruce Hoult's remark: the whole issue may be moot and yet another layer of redirection seems meh.shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.
if the design principles of RISC and RISC-V are to be respected and followed, small reductions in code size are significant, and big reductions even more so.
ideas:* I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions complementing the "bitwise" CSR instructions. Problem is of course where to put that because the CSR number is big. There seems to be room in the CSR/func3 == 0b100 minor opcode for an immediate version, but the privileged spec seems to be a heavy user of the CSR/func==000 however (albeit all with rd = x0), which makes it a bit awkward to also have a CSRRA instruction :-(."here be dragons"... if you have one CSR being allowed this kind of special treatment (arithmetic) pretty soon there will be calls for yet more arithmetic operations. at that point the ISA has a duplication of the *entire* suite of arithmetic operators.
CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.
VL *completely* breaks that rule, right from the SETVLI implementation (VL=MIN(rs1, MVL)), and fail-on-first even more so. fail-on-first not only has a read-dependency on the VL CSR, it has a *write* dependency as well.this is the core of the argument for special-case treatment of VL (and making it an actual scalar register): as a CSR its use goes well beyond that for which CSRs were originally designed.whereas... if SETVLI is modified to set up a *pointer* to a scalar register, *now* the VL CSR is more along the lines of how CSRs were intended to be used. set them up once to change the behaviour (and leave them alone), do some tightly-dependent arithmetic work, then reset them.*As above, but just have an R-type instruction that only add's to the VL CSR.again, i'd be concerned at the special treatment. once you want ADD, someone else will want MUL. and DIV. and... etc. etc.*If you could mmap the CSR file, you could use the AMO-ops to manipulate them, in particular use add and subtract (and max and min!).iinteresting. i've mulled over the idea of mapping the CSR regfile SRAM into the actual global memoryspace before. the architectural implications (and power consumption due to the load on the L1 cache) had me sliiightly concerned.mind you, for 3D, we need separate pixel buffer memory areas and so on so it's a problem that has to be solved.worth thinking through, some more, i feel.*Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions, allowing to specify the VL register in the 64(?) bit wide "allow to specify everything" version of your instructions. This, of course, requires specifying you are in vector mode in other ways then VL != 1 if you want to use implicit vectorisation.i kinda like it, however mentally i am rebelling at the lack of orthogonality. allocating one register to VL means it's effectively removed from use in all other circumstances...
... and if one register is allocated, you still have to have the dependency-tracking on that (one) scalar register, and if you have dependency-tracking on one scalar register (as a "hidden" VL) you might as well go the whole hog and go orthogonal.
that said: from what i saw of the statistical analysis of register-usage by gcc that WD did, many of the registers x1-x31 have near-zero percentage utilisation, so something at the high end of the regfile numbering probably wouldn't be missed.
however if you do that (x31 for example), use of RVC instructions is out of the question. and if you _do_ allocate one of the registers accessible by RVC (x8-15) you just took out a whopping 12.5% of the available registers for use by RVC.
with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
l.
> I assume here that adds are more useful than anything else except what is available now.
Atomic adds are also significantly more expensive (gates) than other CSR operations. Since CSRs operations are not typically performance critical, I think most architects would rather reuse the ALU with a traditional read modify write sequence.
An extra 64 bit adder for atomic CSR adds would be unacceptable for low-end implementations, and reusing the ALU for 3-part atomic CSR adds is unnecessary complexity.
As suggested earlier, memory-mapping CSRs would allow one to get all AMOs for “free”, but in a very platform-specific way. Does anyone know if there’s precedent for that scheme, in current RISC-V implementations or uarchs for other ISAs?
In both rocket-chip and ariane they are implemented as a special FSM built-in to the L2 cache communications layer!
> Since CSRs operations are not typically performance critical,
eexactly, and VL particularly on vbff not only reads VL it writes a new value to it, as well.
Which then makes subsequent (necessary) use of CSRR the exception to all the rules about CSRs.
L.
On Saturday, August 31, 2019 at 12:55:25 AM UTC+8, Rogier Brussee wrote:
> shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.
>
>
> For this example.
>
Yes. Which is "the" canonical example of sequential data-dependent fail-on-first parallel processing.
>
>
> CSRRA would be allowed on all CSR's, just as CSRRS and CSRRC are allowed on all CSR registers, they would just not necessarily be useful. I assume here that adds are more useful than anything else except what is available now. Also if you go the road of trying to squeeze in the CSR/ privileged opcode you have no room left for anything else :-(.
Yep. Sigh.
> CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.
>
>
>
>
> Right. But what is the fundamental difference between atomically set/clear a bit and atomically adding and subtracting.
That's slightly missing the point: the point is that the scalar registers are what you're supposed to do arithmetic on, and CSRs are what are supposed to change the behaviour of the engine, run a bunch of arithmetic ops, then switch it off again.
The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.
The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.
[snip]
>
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,
Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.
The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.
Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.
> whereas a scalar use of t1 can be replaced with a temporary in the x16-x31 (e.g. t3) without any problem.
See above. SV is under a lot more register allocation pressure.
>
>
>
> It would seem the important instructions to use with vl are C.ADDI, C.LI, C.MV, B.MAX and perhaps occasionally C.LDSP, C.SDSP, and C.ADD and SUB. Each of the RVC instructions use a 5 bit register number.
It's not the VL-related-arithmetic ops that worry me, it's that because SV uses *all* scalar opcodes and contextually marks them as "vectorised", the *vector* operations are under pressure if one of the regs is hard coded to VL.
It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.
It is just a route I do not want to go down.
>
>
> with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
>
>
>
>
> No because even if redirection is free, once the vector length is in a register and not in a CSR, I don't really see what it buys you to be able to set _different_ registers as the vl register,
I believe I explained that, above,
[snip]
> Also, it is a CSR worth of state that has to be saved on context switches.
It has to be saved anyway.
> Being able to making the vl register dependency explicit by making it explicitly part of the long "specify everything version" of your instructions seems "the right thing" though.
On balance... yeah.
>
> Anyway I was just giving you food for thought which it seems to have done :-).
Yes, for which I am very grateful.
I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.
L.
The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.
You seem to be using CSR's a lot
and pushing the boundaries anyway,
and once you are accommodating a full vector processing unit it is safe to assume we are not talking about the lowest end of processors.
[snip]>
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,
Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.
The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.
Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.
I didn't mean the internal dependency, but....
It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.
this point. The first six registers numbers are tied up by calling /usage convention so a seventh is less in the way down there.
> Also, it is a CSR worth of state that has to be saved on context switches.
It has to be saved anyway.
If there is no CSR that stores which register is currently used as vl (e.g. t1) you don't have to save and restore that.
I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.Which reminds me that I have a half finished review of the B extension. That draft seems to be improving but could, I think, also use more eyeballs.
i've been trying to think how to get that down further, for some time.
came up with something, overnight:Few comments:the SVPMode table needs to have all entries use different encodings (they all currently encode to 0b00)
The encodings don't match the RISC-V long-instruction encodings (SVPMode would need to be moved to accommodate the 1s and 0 needed to produce 48, 64, 80, 96, and larger bit lengths).
On Sunday, September 1, 2019 at 5:23:14 PM UTC+1, Jacob Lifshay wrote:came up with something, overnight:Few comments:the SVPMode table needs to have all entries use different encodings (they all currently encode to 0b00)good catch - they're in the commentary below, but not the table. cut/paste error.The encodings don't match the RISC-V long-instruction encodings (SVPMode would need to be moved to accommodate the 1s and 0 needed to produce 48, 64, 80, 96, and larger bit lengths).
What about taking the vector/scalar propagation idea and just apply it
to normal instructions (outside of VBLOCK):
there would be a 128-bit CSR (64-bit on RVE) with a 4-bit field for
each register field encoding [1], tentatively named SVMODES. It is
allocated to normal CSRs in the same manner as the time CSR -- by
splitting into XLEN-sized chunks.
[1]: not just register number, other extensions may have a register
field encoding to register number translation table, such as SVorig.
each field would be encoded as follows:
bits 0-1:
00: SUBVL=1
01: SUBVL=2
10: SUBVL=3
11: SUBVL=4
bits 2-3:
00: scalar
01: vector unpredicated
1x: vector (predication -- TBD)
the 4-bit field corresponding to rd would be written with the vector
mode of the result of each instruction, which is calculated from the
instruction and the vector modes as follows:
1. If the instruction specifies scalar/vector mode in the encoding
(like via SVPrefix), then the mode specified by the instruction is
used.
2. Otherwise,
a. the scalar/vector mode is calculated for each of the source
operands by reading from the SVMODES field corresponding to the
register field encoding for that source operand.
b. If the SUBVL modes from the source operands don't match and the
instruction doesn't specifically handle differing SUBVL modes, then
trigger an illegal instruction trap (swizzle can handle diffferent
SUBVL modes, most other instructions can't)
2. The result predication is TBD (probably selected similarly
to SUBVL selection).
There would be a separate SVMODES csr used for context switching, on a
trap, the old value would be saved (by switching which SVMODES csr is
used or by copying to a different csr), then the SVMODES csr would be
cleared to all zero (all registers scalar with SUBVL=1).
on return from an exception, SVMODES would be restored (by copying
back from the saved csr or by switching which SVMODES csr is used).
There would be a separate instruction that clears SVMODES to all zero,
to allow calling scalar code quickly.
Since using SVMODES instead of all SVPrefix instructions just makes
the code smaller and faster, but doesn't allow using any more
instructions than before, it could be used like a more complicated way
to compress instructions and it would be possible to limit all SVMODES
handling to after the register allocator in the compiler, similar to
how RVC instructions can be substituted entirely in the assembler,
without the compiler needing to know (though knowing allows selecting
more optimal code at the expense of a more complex compiler).
Creating the internal state which tracks the cascade, no problem: just some latches.
Storing it for a restore: now we have a problem.
In VBLOCK (the original) there is a single 32 bit CSR for context switching. It contains just enough state to be able to go *BACK* to the START of the VBLOCK, re-read the tables, then jump FORWARD to the offset in the VBLOCK where execution left off.
Funnily enough someone described a scenario a few months back where this exact same trick may be deployed. I forget how or why.
The problem with the cascade is that it becomes necessary to re-read the *entire* group of VBLOCK instructions in order to recreate the cascade.
I am therefore tempted to restrict the cascade to just the first two instructions in the VBLOCK. That results in up to six registers being potentially "marked" as vectorised, with their use also resulting in other operations also being marked.
It's a reasonable compromise, which does not require massive CSR usage.
On the other hand.... I mean, there's only actually 32 registers (31) that can be used, so is it really that bad an overhead? Is it sufficient to just have a single bit mask indicating which registers have been cascade-marked?
If so, that could easily go in the top 32 bits of PCVBLK.
L.
> if the same idea was applied *outside* of a VBLOCK, there would be no such guarantee. hmm, so that would need to be thought through.
I am proposing that the vector/scalar flags would cascade throughout
the whole function/multiple functions,
which can be handled in the
compiler with a rather simple data-flow analysis.
You could think of
it as tagging the registers with a vector/scalar flag.
> this is where it gets wasteful. the number of permutations that raise illegal instruction traps is so high that it suggests that the encoding is not a good one. i feel it would be better to use the 2 bits for elwidth.
I think that SUBVL will actually be more useful there than elwidth,
though those both will be very useful. The majority of graphics
shaders use only i32/u32 and f32, whereas most of them use a range of
SUBVL values.
> 128 bits worth of context-saving...
not too bad since, if we design it right, saving/restoring SVMODES can
be skipped for most system calls, it would only need to be
saved/restored for context switches between processes.
>> on return from an exception, SVMODES would be restored (by copying
>> back from the saved csr or by switching which SVMODES csr is used).
>>
>> There would be a separate instruction that clears SVMODES to all zero,
>> to allow calling scalar code quickly.
>
>
> ok so here's where the VBLOCK concept has a clear advantage: that extra instruction is not needed. once the VBLOCK context is exited, the tear-down is automatic.
Actually, I think VBLOCK not being able to work on more than a single
basic-block at a time is a disadvantage compared to SVMODES.
the
SVMODES-clear instruction would mostly only be used when calling code
that is not SVMODES-aware.
> * SVMODEs uses CSRs, which is an inherent code-size penalty compared to VBLOCK-SVPrefix.
You're missing that SVMODES is changed by each instruction (basically,
SVMODES values follow values in registers),
so it would basically only
need to be explicitly written on a context-switch.
svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register
field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated
> On the other hand.... I mean, there's only actually 32 registers (31) that can be used, so is it really that bad an overhead? Is it sufficient to just have a single bit mask indicating which registers have been cascade-marked?
>
> If so, that could easily go in the top 32 bits of PCVBLK.
Nope. damn again. 3 bits is not enough, it would need to be 64. The reason is that 2 bits are needed to express the following states:
* this reg has not been encountered yet in any VBLOCK instruction.
* this reg has been encountered and determined to be scalar
* .... vector....
* 4th state, wasted space (oh well).
Or... is it actually safe to say, because of the ORing, that a register need only be determined as vector by a single bit?
i.e all regs are marked as scalar, unless determined otherwise, and is that bitfield sufficient context to continue from?
L.
I tried applying the new multi-SVP idea to DAXPY and strcpy, two canonical algorithms. Neither really works.
The reason is because:
1. unlike SVP where the prefixing context is applied immediately at the instruction, the two opcodes which primarily define the vector operands do *not* occir in succession (a flawed design prerequisite of this idea).
2. unlike VBLOCK, with the registers being explicitly named (despite the overhead of doing so) there can be some lookahead (static compiler analysis) in a batch of instructions.
Solutions include allowing register number context to be picked up from the 1st instruction, whilst for the 2nd the register numbers are explicitly spelled out, just like in VBLOCK.
This kinda defeats the purpose of the exercise!
Another possible solution is - was - to use some of the spare bits to specify the instruction in the VBLOCK from which the context will get its register numbers.
Whilst on the face of it this seems like a great idea, what it means in practice is that the instruction issue phase actually has to request those instructions (almost certainly requiring a new cache line miss), and decode them, before the first one in the VBLOCK may be executed.
At least with the current VBLOCK system execution may be done (tead) in a sequential order.
If there are no ideas anyone can think of on how to fix this I this VBLOCK-SVPrefix idea goes in the "retire" bin.
Not least because SVPrefix has no way to specify zeroing, inversion or fail-on-first data dependent predication, on which the canonical strcpy example critically relies (ffirst and inv, at least).
Which is a pity as I love the idea of reducing overhead, it just needs too much advance context to work.
Anyone any ideas or solutions?
L.
Rats.
[snip]
Anyone any ideas or solutions?
Ok great. Can you also clarify or consider how ffirst, zeroing, lack of elwidth etc might be compensated for?
Also, if I understood correctly, with only 4 bits per register, a heck of a lot of SV capabilities is left out.
Those limitations, if correct, need a full evaluation.
L.
grev t0, t1, rs2
or t0, t0, t1
Is that correct? If so, what significant general purpose savings are there which justify the additional opcode?
If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.
L.
It took me a while to realise that SVPrefix's 64 bit LD/ST.MULTI capabilities are *damaged* - impeded - by the change of VL to VLPTR.
the reason is that the hardware loops on VL (and SUBVL) have to be re-entrant as far as traps are concerned.
You do not expect the PC to increment over a batch of sequential instructions and it is no different here.
Thus, because sequential program order is preserved, one of the registers that could be overwritten by a LD.MULTI is... the very same register being pointed to *as containing VL*.
oink.
With program order being preserved it is *not* possible to say "oh that version of the register which VLPTR points to must stay the same for the whole loop", it doesn't work that way.
And if LD.MULTI is restoring the entire context of a regfile, it has to restore the reg that VL points to.
urk.
So the solution here is to allow SVPrefix to have its own independent CSR STATE, which contains all the element offsets needed for re-entrancy.
This has a nice side effect of making SVPrefix's instruction format entirely independent, with no side effects or "harm" caused to VBLOCK VL/SUBVL.
However.. what would be nice is if SVPrefix could "pick up" SUBVL/VL from VBLOCK, so for that, rather than have SVP set explicit values for SUBVL to 1,2,3,4 I am advocating they be VBLK.SUBVL,2,3,4
Likewise for P64 (which allows setting of VL, where P48 only allows setting of SUBVL) the default option is to take VBLK.VL, and to allow that to be explicitly ignored if desired.
In this way P64 can be used to explicitly target every single one of the registers x0 - x127 or it can treat them as contiguous vector-like vectors.
The new STATE CSR for SVP is only 32 bit, it is identical, and may fit cleanly into the top 32 bits on RV64.
Thoughts?
L.