Public review for the Vector standard extensions

Stephano Cetola

unread,

Sep 20, 2021, 8:11:15 PM9/20/21

to isa...@groups.riscv.org, Krste Asanovic, roger....@esperantotech.com

We are delighted to announce the start of the public review period for
the following proposed standard extensions to the RISC-V ISA:
Zve32x - 32-bit elements with integer and fixed-point operations only
Zve32f - 32-bit elements with integer, fixed-point, and
single-precision floating-point operations only
Zve64x - 64-bit elements with integer and fixed-point operations only
Zve64f - 64-bit elements with integer, fixed-point, and
single-precision floating-point operations only
Zve64d - 64-bit elements with integer, fixed-point, single-precision,
and double-precision floating-point operations

The Zve* standard extensions provide more limited functionality
intended for embedded and microcontroller applications.

The review period begins today, September 20, and ends on Thursday
November 4, 2021 (inclusive).

These extensions are described in the PDF spec available at:
https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf

which was generated from the source available in the following GitHub repo:
https://github.com/riscv/riscv-v-spec/releases/tag/v1.0

To respond to the public review, please either email comments to the
public isa-dev mailing list or add issues and/or pull requests (PRs)
to the Vector GitHub repo: https://github.com/riscv/riscv-v-spec. We
welcome all input and appreciate your time and effort in helping us by
reviewing the specification.

During the public review period, corrections, comments, and
suggestions, will be gathered for review by the Vector Task Group. Any
minor corrections and/or uncontroversial compatible changes will be
incorporated into the specification. Any remaining issues or proposed
changes will be addressed in the public review summary report. If
there are no issues that require incompatible changes to the public
review specification, the Unprivileged ISA Committee will recommend
the updated specifications be approved and ratified by the RISC-V
Technical Steering Committee and the RISC-V Board of Directors.

Thanks to all the contributors for all their hard work.

Kind Regards,
Stephano
--
Stephano Cetola
Director of Technical Programs
RISC-V International

Anthony Coulter

unread,

Sep 25, 2021, 4:39:43 PM9/25/21

to isa...@groups.riscv.org, step...@riscv.org, kr...@sifive.com, roger....@esperantotech.com

=== Substantive questions and suggestions ===

1. Is there a reason that one cannot run vsetvl with rs2 = x0 to change
the vector length without setting vtype? I don't have a specific use
case in mind for this, but I was surprised that this wasn't the default
behavior.

2. Is there a reason that widening left-shifts are not provided, given
that narrowing right-shifts are? It's easy to split a 2*SEW value in
register group v8:v9 into high and low parts in two instructions:

vnsrl.vx v2, v8, x0 # v2 = (v8:v9) & 0xffff
vnsrl.vi v3, v8, 16 # v3 = ((v8:v9) >> 16) & 0xffff

But reassembling v2 + (v3<<16) seems to be substantially harder. The
obvious approach doesn't work because there is no widening left-shift
instruction:

vwsll.vi v8, v3, 16 # vwsll doesn't exist!
vwaddu.wx v8, v8, v2

So, OK, that means I need to widen v3 before shifting it. The trouble
is that I need SEW = 32 to do the left shift, but then I need to drop
back down to SEW = 16 to add v2. So in addition to a widening
instruction I'm now setting vl twice.

vcvtu.x.x.v v8, v3
vsetvli x0, x0, e32, m2
vsll.vi v8, v8, 16
vsetvli x0, x0, e16, m2
vwaddu.wx v8, v8, v2

Maybe I shouldn't fixate on using vwaddu to add v2; if I widened v2
before shifting it I wouldn't have to drop SEW again. But widening v2
makes things worse because now I have four computational instructions
(the previous code sequence had only three non-setvli instructions)
and it also clobbers another register group!

vcvtu.x.x.v v8, v3
vcvtu.x.x.v v4, v2 # Clobbers v4 and v5!
vsetvli x0, x0, e32, m2
vsll.vi v8, v8, 16
vadd.vv v8, v8, v4

Of course, the last option is to recognize this computation as a fused
multiply-add of v3*65536 + v2. but now I'm paying for a multiplication,
not a left-shift:

li t0, 0x10000
vcvtu.x.x.v v8, v2
vwmaccu.vx v8, t0, v3

Maybe this situation is too artificial, but to a certain extent I see
it as the complement of narrowing right shifts, which can be used to
unpack a 2*SEW-sized bitfield into a bunch of SEW-sized registers for
processing, after which they can be reassembled into another
2*SEW-sized bitfield and pushed to memory.

Is there a reason that widening adds and multiplies are provided, as
are narrowing right-shifts, but no widening left-shifts?

3. The spec could call out vmadc/vadc and vmsbc/vsbc as macro-op fusion
candidates (in addition to vmulh/vmul and vrem/vdiv, which are not
suggested as such in the vector spec either).

4. Concerning the interaction between the vector extension and the
hypervisor extension: Are hypervisors expected to emulate vector memory
operations in all MMIO regions? I can understand how vector operations
can accelerate the process of talking to real hardware on a real bus,
but as it stands right now, the RISC-V privileged spec does not define
any PMA that marks a region of memory as being accessible only to
scalar memory operations. Thus if a hypervisor wants to emulate a CLINT
and a guest tries to use a vector-store to blanket the SSIP registers
with ones to send software interrupts to every hart, the hypervisor is
obligated to emulate that vector-store instruction. Maybe a fancy
hypervisor is willing to do all that work, but I can imagine legitimate
use cases where hypervisors don't want to do so for certain regions of
memory. This could be addressed by creating a "scalar-only" or better
yet, an "integer-scalar-only" PMA that allows hypervisors to mark
emulated device I/O regions as being accessible only with the RV32I or
RV64I base instructions. Main memory would not be allowed to use this
PMA. Note that I suggested this possibility in a separate email thread
about the hypervisor spec, but nobody has commented on it.

5. Sections 5.2, 5.3, and 7.8 reserve "encodings" for which
- The instruction uses a register number that's not divisible by EMUL
- The source and destination violate overlap constraints
- The instruction cites a group which would have EMUL > 8.
- The destination is v0 and the instruction is masked, but is not
overwriting the destination with another mask.
- The instruction is a load/store-segment that would utilize vector
registers after v31

How does this reservation work? For most instructions, EMUL depends
on the value of the vtype CSR at runtime. An assembler cannot determine
at compile time whether "vwadd.vv v2, v3, v4" violates any of these
constraints. (It violates none of them if LMUL = 1, and three of them
if LMUL = 8.) But for some instructions it can be determined, e.g.
"vadd.vv v0, v1, v2, v0.t" always violates the fourth rule, and the
load/store-segment violations are easy to check on the spot because
NFIELDS and the vector number are both written in the instruction.

Is it within the scope of this specification to mandate which reserved
encodings should be rejected by assemblers (and disassemblers)? The
specification suggests that the constraint on load/store-segment
instructions will eventually be relaxed by adding extra registers, so
it seems as though assemblers might as well accept instructions
violating this constraint even though it's the easiest violation to
detect!

=== Adding clarifying language ===

6. Section 3.2 should state explicitly that setting mstatus.VS to Off
or Initial does not change any state in the vector unit.

7. The abbreviation "EEW" is used in a note in section 4 but is not
defined until section 5.2.

8. Section 10.2 says "The first vector register group operand can be
either single or double width." This would read a little more clearly
if a parenthetical note said this refers to vs2. (It takes awhile to
get used to the idea that the first operand is referred to by "2".)

9. The examples in section 7.8.1 refer to vd, vs3, rs1, and vm instead
of using concrete register numbers. That is, instead of:

vlseg8e8.v vd, (rs1), vm # Load eight vector...
vlseg3e32.v vs3, (rs1), vm # Store packed vector...

The examples should probably be:

vlseg8e8.v v5, (a0), v0.t # Load eight vector...
vlseg3e32.v v17, (t1) # Store packed vector...

Incidentally, I would encourage the use of odd-numbered vector
registers in the load-segment and store-segment examples to emphasize
the fact that these instructions don't operate on register groups in
the default case.

10. The first paragraph of "Section 7: Vector Loads and Stores" says:
"[...] Vector loads and stores are masked and do not raise exceptions
on inactive elements. Masked vector loads do not update inactive
elements in the destination vector register group, unless masked
agnostic is specified (vtype.vma=1). Masked vector stores only update
active memory elements. All vector loads and stores may generate and
accept a non-zero start value."

This is a little confusing because some vector loads and stores are
unmasked, and the paragraph specifically calls out vtype.vma but not
vtype.vta. I would change "Vector loads and stores *are* maked" to
"Vector loads and stores *may be* masked", and I would change the
middle sentences to "Vector loads do not update inactive elements in
the destination vector register group except as permitted by the
masked-agnostic and tail-agnostic settings (vtype.vma and vta).
Vector stores never write inactive elements to memory, regardless of
masked-agnostic and tail-agnostic settings."

11. Section 3.10 defines the layout of vcsr but does not explicitly
reserve the upper XLEN-3 bits or encourage current hardware to set
them to zero. (The fact that the bits are reserved is implied by a
note under "Table 1: New vector CSRs" but even there it isn't
explicitly called out.)

12. It would be nice if there were a convention for referring to
specific register groups in text (e.g. code comments and emails like
this one). I used the shorthand "v8:v9" in my code examples earlier
in this email to refer to the EMUL=2 vector register group consisting
of registers v8 and v9 but I felt awkward doing it, and in my notes
I shift uncomfortably between that, v8-v9, and v8/v9 (which look like
arithmetic operations). While I recognize that the linguistic
conventions are beyond the scope of a technical specification, people
are going to have to refer to register groups in written text and
anything we can do now to nudge folks into consistency would be
helpful.

=== Questions motivated by curiosity only ===

13. Is there a reason you chose to provide vmulhsu.vx instead of
vmulhus.vx? (I understand why you didn't add both; but did you pick
the one you did by flipping a coin or is there a technical advantage
to it?)

14. What is the meaning of the "M" in the funct3 opcodes OVMVV and
OVMVX? I would have guessed 'mask' but I these opcodes don't appear
to be much heavier in mask-related operations than the others.

Anthony

Andrew Waterman

unread,

Sep 25, 2021, 8:53:23 PM9/25/21

to Anthony Coulter, RISC-V ISA Dev, Stephano Cetola, Krste Asanovic, Roger Espasa

On Sat, Sep 25, 2021 at 1:39 PM Anthony Coulter <ri...@anthonycoulter.name> wrote:

=== Substantive questions and suggestions ===

1. Is there a reason that one cannot run vsetvl with rs2 = x0 to change
the vector length without setting vtype? I don't have a specific use
case in mind for this, but I was surprised that this wasn't the default
behavior.

For normal stripmining, "set vl but retain vtype" is not the common case. Since, for code-size reasons, you only want to have a single static setvl at the head of the loop (for code size reasons), and since you generally need to program vtype before entering the stripmine loop, that lone setvl needs to set both vl and vtype.

Normally, when the desired operation is "change vl, but keep vtype unchanged", the old value of vtype is known statically. So, executing vsetvli with the new vl, and new vtype equal to old vtype, is the standard approach, with no loss of efficiency vs. your alternative.

There's very limited opcode space, and while widening multiplication is ubiquitous, widening shift left is somewhat less common. Widening multiplication can be used to emulate widening shift left, as you point out. In any case, these instructions could be included as part of a future extension if the need becomes better justified.

In practice, most vector machines will have full-throughput pipelined multipliers, so "paying for" a multiplication won't be costly at all.

3. The spec could call out vmadc/vadc and vmsbc/vsbc as macro-op fusion
candidates (in addition to vmulh/vmul and vrem/vdiv, which are not
suggested as such in the vector spec either).

Noted.

4. Concerning the interaction between the vector extension and the
hypervisor extension: Are hypervisors expected to emulate vector memory
operations in all MMIO regions? I can understand how vector operations
can accelerate the process of talking to real hardware on a real bus,
but as it stands right now, the RISC-V privileged spec does not define
any PMA that marks a region of memory as being accessible only to
scalar memory operations. Thus if a hypervisor wants to emulate a CLINT
and a guest tries to use a vector-store to blanket the SSIP registers
with ones to send software interrupts to every hart, the hypervisor is
obligated to emulate that vector-store instruction. Maybe a fancy
hypervisor is willing to do all that work, but I can imagine legitimate
use cases where hypervisors don't want to do so for certain regions of
memory. This could be addressed by creating a "scalar-only" or better
yet, an "integer-scalar-only" PMA that allows hypervisors to mark
emulated device I/O regions as being accessible only with the RV32I or
RV64I base instructions. Main memory would not be allowed to use this
PMA. Note that I suggested this possibility in a separate email thread
about the hypervisor spec, but nobody has commented on it.

This is a platform question, not a V-extension question, but my own expectation is that unsupported vector memory accesses will generally _not_ be emulated.

5. Sections 5.2, 5.3, and 7.8 reserve "encodings" for which
- The instruction uses a register number that's not divisible by EMUL
- The source and destination violate overlap constraints
- The instruction cites a group which would have EMUL > 8.
- The destination is v0 and the instruction is masked, but is not
overwriting the destination with another mask.
- The instruction is a load/store-segment that would utilize vector
registers after v31

How does this reservation work? For most instructions, EMUL depends
on the value of the vtype CSR at runtime. An assembler cannot determine
at compile time whether "vwadd.vv v2, v3, v4" violates any of these
constraints. (It violates none of them if LMUL = 1, and three of them
if LMUL = 8.) But for some instructions it can be determined, e.g.
"vadd.vv v0, v1, v2, v0.t" always violates the fourth rule, and the
load/store-segment violations are easy to check on the spot because
NFIELDS and the vector number are both written in the instruction.

Is it within the scope of this specification to mandate which reserved
encodings should be rejected by assemblers (and disassemblers)? The
specification suggests that the constraint on load/store-segment
instructions will eventually be relaxed by adding extra registers, so
it seems as though assemblers might as well accept instructions
violating this constraint even though it's the easiest violation to
detect!

It's not really in scope of this specification, but as a practical matter, traditional assemblers lack sufficient dataflow information to perform this analysis correctly, and so mandating it would've been pointless. Illegal-instruction exceptions will catch misuse at runtime.

I'll let Krste or others respond to the rest.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7a0128b3ec539042%40raines.redjes.us.

Anthony Coulter

unread,

Sep 25, 2021, 11:47:56 PM9/25/21

to and...@sifive.com, ri...@anthonycoulter.name, isa...@groups.riscv.org, kr...@sifive.com, roger....@esperantotech.com, step...@riscv.org

>> 4. Concerning the interaction between the vector extension and the
>> hypervisor extension: Are hypervisors expected to emulate vector memory
>> operations in all MMIO regions? I can understand how vector operations
>> can accelerate the process of talking to real hardware on a real bus,
>> but as it stands right now, the RISC-V privileged spec does not define
>> any PMA that marks a region of memory as being accessible only to
>> scalar memory operations. Thus if a hypervisor wants to emulate a CLINT
>> and a guest tries to use a vector-store to blanket the SSIP registers
>> with ones to send software interrupts to every hart, the hypervisor is
>> obligated to emulate that vector-store instruction. Maybe a fancy
>> hypervisor is willing to do all that work, but I can imagine legitimate
>> use cases where hypervisors don't want to do so for certain regions of
>> memory. This could be addressed by creating a "scalar-only" or better
>> yet, an "integer-scalar-only" PMA that allows hypervisors to mark
>> emulated device I/O regions as being accessible only with the RV32I or
>> RV64I base instructions. Main memory would not be allowed to use this
>> PMA. Note that I suggested this possibility in a separate email thread
>> about the hypervisor spec, but nobody has commented on it.
>
>
> This is a platform question, not a V-extension question, but my own
> expectation is that unsupported vector memory accesses will generally _not_
> be emulated.

Is it? I think that in its current form, this question still deals with
the vector specification, or maybe even the basic privileged spec. My
question isn't "Under what conditions should the vector memory
instructions be able to access I/O memory?" but rather "Don't we need a
way for hypervisors (and physical memory controllers) to express the
concept that they don't support vector memory operations?"

My interpretation of section 3.6 of the privileged spec is that it
provides an explicit list of valid excuses for a system not to
implement all of the memory-access instructions in all memory regions.
It is legal for an I/O memory region to allow word-sized accesses while
disallowing byte-sized accesses because that's specifically called out
as a possibility in section 3.6.2; meanwhile, it is illegal for main
memory to make that same distinction (which is also stated explicitly
in section 3.6.2). But I would also say that it is illegal for an I/O
memory region to allow zero-extended byte loads (i.e. the LBU
instruction) while disallowing sign-extended byet loads (LB). The
grounds for this illegality wouldn't be "because sign-extension is
handled in the hart, not in the memory controller"; the grounds would
actually be "because sign-extension is not listed as a valid PMA in
section 3.6 of the privileged spec." If section 3.6 doesn't list your
excuse as a PMA, then your excuse is no good, and you need to implement
or emulate whatever memory option you didn't want to have to deal with.

That's why (I assume) section 8 of the vector draft calls for the
creation of a separate PMA for misaligned vector memory accesses and
misaligned scalar accesses---if there were no PMA, then it would be
illegal for a hypervisor to emulate misaligned scalar instructions
but not misaligned vector instructions. But because there is a PMA for
it, this is now a valid excuse for the hypervisor to do one but not
the other, as long as it labels the memory region appropriately using
whatever mechanism we define for such things.

Applying that to this problem of hypervisors not emulating vector
instructions: even if we create a PMA treating misaligned vector
accesses as separate from misaligned scalar accesses, there is still
no PMA that makes *aligned* vector accesses different from aligned
scalar accesses. So from a legalistic point of view, the hypervisor
is required to emulate both, under the same principle that requires
hypervisors to emulate both LB and LBU. That's an unfair comparison
because emulating vector instructions is a lot more complicated than
supporting two types of width extension for a scalar, but if we think
vector instructions deserve to be treated differently, all we have to
do is define a PMA for them. And since we're allowed to admit certain
PMAs for I/O regions only, we would do that here: aligned vector
accesses should always be valid in main memory, but I/O memory regions
should be allowed to carry the PMA that disallows vector (and, in my
opinion, floating-point) loadas and stores.

Otherwise, what's the point of section 3.6 of the privileged spec, and
why does section 8 of the vector spec take the time to call for the
creation of a new PMA? The only thing these sections do is discourage
fragmentation by listing allowable subsets of memory behaviors. But for
such a list of allowable subsets to have any value, it needs to be
comprehensive. I imagine that at some point in the future a subgroup
of the people on tech-privileged will sit down and formalize the data
that's supposed to go in that "device-tree" data structure that'll be
used for hardware discovery, and memory blocks listed in the
device-tree will be annotated with PMAs encoded either as ASCII strings
or numeric constants. But the list of ASCII strings or numeric
constants that can be used to define memory attributes will have to
come from somewhere, and that "somewhere" is section 3.6 ("Physical
Memory Attributes") of the privileged spec.

It's not quite a platform spec issue. A platform spec might say "Main
memory regions must support AMOArithmetic and I/O regions must support
either AMONone or AMOLogical" (per the privileged spec's suggestion to
skip over AMOSwap whenever possible), but the fact that "AMOSwap" is a
valid subset of AMO operations for a memory region to support is still
the business of the privileged spec, which defines the vocabulary that
platform specs will use when discussing their PMA requirements.

So while I'm not sure which proposal to attach my request to (the
vector spec, the hypervisor spec, or the privileged spec itself), I do
think that the end result that would make the most sense is to create
a new section 3.6.7 in the privileged spec called "Integer scalar
accesses" that says "To simplify hypervisor emulation, I/O regions can
specify that only memory accesses made with scalar integer instructions
are valid, and that attempts to access the memory region with floating-
point or vector memory instructions will fail with an access fault."
With this sentence, it now becomes legal for a hypervisor to decline to
emulate vector instructions, and people who write device drivers will
not be surprised when they find that their vector-optimized drivers
don't work on certain platforms. Furthermore, the people who write
platform specs will read this section and realize that if they think
that certain required devices (like the AIA interrupt controller) ought
to be implemented in such a way that vector accesses are still OK, then
they will need to explicitly call for that in their platform spec.

It's fair to say that section 3.6 of the privileged spec doesn't have a
whole lot of formal normative content right now, and personally I would
like to see more clearly-named classes like AMONone, AMOSwap, etc. in
section 3.6.3.1. But until we know what classes are actually useful,
the best we can do is keep a relatively informal list of situations in
which some memory regions behave differently from others. And since you
and I both agree that some I/O memory regions are going to be emulated
by hypervisors that aren't set up to emulate vector operations, I think
we should explicitly admit that possibility in the list of PMAs in
section 3.6 of the privileged spec.

That said, I can't think of any other arguments to make in support of
this point. So if you still disagree I'll drop the issue.

Anthony

Anthony Coulter

unread,

Sep 26, 2021, 11:01:32 AM9/26/21

to isa...@groups.riscv.org, ri...@anthonycoulter.name, step...@riscv.org, kr...@sifive.com, roger....@esperantotech.com

I had two more ideas regarding the vector proposal:

15. Define pseudoinstructions vsetvtypei and vsetvtype to clean up
the common case where the vector length isn't changing.

vsetvli x0, x0, e16, m4, ta, ma # Old
vsetvtypei e16, m4, ta, ma # New

vsetvl x0, x0, a4
vsetvtype a4

16. In section 17.1, consider renaming "precise" traps to "restartable"
traps. This terminology corresponds more closely to other documentation
I've seen for other architectures. A "precise" trap should still be one
where the instruction pointed to by the sepc CSR has so far had no
architecturally-visible effects. This change shouldn't require any
changes to other documentation. Section 1.6 of the base spec calls for
"The execution environment determines for each trap whether it is
handled precisely, though the recommendation is to remain preciseness
where possible"---this language is still good even if we call vector
traps "restartable." Impressively, the privileged spec's only
references to precise traps are for specific cases involving PMP and
PMA violations.

Anthony

Andrew Waterman

unread,

Sep 26, 2021, 7:32:01 PM9/26/21

to Anthony Coulter, RISC-V ISA Dev, Stephano Cetola, Krste Asanovic, Roger Espasa

On Sun, Sep 26, 2021 at 8:01 AM Anthony Coulter <ri...@anthonycoulter.name> wrote:

I had two more ideas regarding the vector proposal:

15. Define pseudoinstructions vsetvtypei and vsetvtype to clean up
the common case where the vector length isn't changing.

vsetvli x0, x0, e16, m4, ta, ma # Old
vsetvtypei e16, m4, ta, ma # New

vsetvl x0, x0, a4
vsetvtype a4

This was discussed in the task group. IIRC, the concern was that these instructions only have the behavior of "set vtype only" if the old vl is valid under the new vtype, so the mnemonic was thought to be a little misleading. Reasonable minds might disagree on this point.

In general, the R-type vsetvl isn't used in most code (usually just context switching) and so pseudoinstructions for the R-type variant are harder to justify.

16. In section 17.1, consider renaming "precise" traps to "restartable"
traps. This terminology corresponds more closely to other documentation
I've seen for other architectures. A "precise" trap should still be one
where the instruction pointed to by the sepc CSR has so far had no
architecturally-visible effects. This change shouldn't require any
changes to other documentation. Section 1.6 of the base spec calls for
"The execution environment determines for each trap whether it is
handled precisely, though the recommendation is to remain preciseness
where possible"---this language is still good even if we call vector
traps "restartable." Impressively, the privileged spec's only
references to precise traps are for specific cases involving PMP and
PMA violations.

Conventionally, "restartable" has a significantly weaker meaning than this--it admits the possibility that younger instructions might've executed. This is relevant in that it affects which actions the OS must take following an exception.

While perhaps a bit fuzzy, "precise" more accurately captures the software usage model: that the OS just doesn't need to consider the possibility of exposed microarchitectural state.

Anthony

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7a01291c37300b67%40raines.redjes.us.

David Horner

unread,

Oct 18, 2021, 8:03:22 AM10/18/21

to Stephano Cetola, isa...@groups.riscv.org, Krste Asanovic, roger....@esperantotech.com

I add a preface here which is generally applicable, not just to RVV
Public Review.

The gifting of the RISCV design to the Foundation in 2015 conveyed a
profound responsibility.

At its core the Foundation became protectorate on behalf of humanity.

The mandate applies not only to the then current specification, but also
to protecting and developing its potential.

At the time much was said of protecting and guarding the architecture
for everyone and for decades to come.

An architecture for all realms, technical and geographic.

An architecture free and available to all.

Steps where taken to preserve it with IP transfers and registrations;
copyright and trademark...

The Foundation Institution transition is a part of that protection.

However, I perceive a shift from preservation of RISCV potential to
exploitation.

Decisions, especially in opcode disbursement, demonstrate no
consideration for retention of the limited resources and provide not
justification of the trade-offs for RISCV potential retention.

Each decision limits RISCVs potential. There is little evidence that
long view considerations drive the decisions, rather short term benefits
are proffered and debated.

RISCV.org must take the long view if it is to fulfill its mandate of
preserving this architecture for all of humanity and the future.

Bruce Hoult

unread,

Oct 18, 2021, 5:02:04 PM10/18/21

to David Horner, Stephano Cetola, RISC-V ISA Dev, Krste Asanovic, roger....@esperantotech.com

I couldn't find a substantive point here.

I also have some concerns about the amount of opcode space used relative to utility by some extensions, but NOT for RVV. Last time I checked RVV is essentially 1.5 major opcodes (out of 32 possible). This is very economical for the functionality, especially compared to the 7 major opcodes used for scalar floating point.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/35703ae8-88ba-d2e9-44e1-043fbadb042f%40gmail.com.

ds2horner

unread,

Oct 18, 2021, 8:00:13 PM10/18/21

to Stephano Cetola, isa...@groups.riscv.org, Krste Asanovic, roger....@esperantotech.com

Disclosure: I have been participating on RVV since [before] the TG was
established.

All that I share below are based on publicly disclosed information, and
none from within TG meetings.

Originally the vision was for a polymorphic implementation for which the
data type [including element size] was associated with the vector register.

This would have allowed extremely efficient operation encoding as:

each operation would be shared with each data type when appropriate
[with implicit conversion]

e.g. add for signed:overflowing:integer:word,
unsigned:saturated:fixedpoint:half, real:float:single,
complex:float:double, etc.

or conversely

the operator would be overloaded for incompatible types

e.g. add would be logical-add, i.e. AND for binary:bitfields

This concept was dropped to reduce complexity in both hardware and
software implementation; conceptual definition; and context transitions.

That raised the number of bits to encode these typed operations and type
specific operations substantially.

It was proposed that 64bit opcodes would be best to fully define the
extensive application space.

However, pioneering the 64bit opcode space, and requiring that for
implementers, was seen as overly challenging and impractical.

The idea of a compressed form existing in 32bit opcodes was conceived
and pursued.

The 64bit opcodes would exist for those who wished to implement them.

The 32bit encoding would allow immediate use in the expected environment
that was expected to be unperceptive to 64bit opcodes.

CSRs were introduced to make dynamic some operation constraints thus
removing the encoding from the instruction op.

Originally that was vcfg, later replaced by vtype,

Similarly, predicate registers were reduced to 1 [although 8 would exist
in the hypothetical 64bit opcode assignment].

Four register Multiply and Add variants were dropped for compressed,
destructive forms to reduce the 32bit opcode impact.

RVV has a stated objective to expand and handle many more data types and
data structures.

This will undoubtedly increase the pressure to use more of the 32bit
encoding.

Now, during public review, is the opportunity to ask why not use the
64bit encoding [or at least provide it as the primary reference], and
how this use of a substantial part of the 32bit opcode space for a
"compressed" vector version is justified.

I have an understanding, being on the TG, but the Public Review provides
the necessary opportunity to raise the question not only as it benefits
the Vector Extension, but as it relates to RISCV in general.

ds2horner

unread,

Oct 19, 2021, 1:40:37 AM10/19/21

to Bruce Hoult, Stephano Cetola, RISC-V ISA Dev, Krste Asanovic, roger....@esperantotech.com

Thank you Bruce for posting this on the list.

On 2021-10-18 5:01 p.m., Bruce Hoult wrote:

I couldn't find a substantive point here.

Installment two raises a substantive question that I ask to be answered in the context of the mission of RISCV.

I also have some concerns about the amount of opcode space used relative to utility by some extensions, but NOT for RVV.

Last time I checked RVV is essentially 1.5 major opcodes (out of 32 possible).

Of these 32, all standard green field major opcodes are now spoken for [If I've tracked and projected correctly].

This is very economical for the functionality,

How do we quantify that, and especially compare to this proposed use to potential future use?

especially compared to the 7 major opcodes used for scalar floating point.

Here we agree completely.

Were XFinX extension proposed and implemented first it would not have consumed 7 major opcodes.

The 4 Multiply-ADD variants each consuming a major opcode would not exist.

Rather, It would have been obvious that a destructive form coupled with the C.mv would create a superior solution.

I am not suggesting I saw this at the time. I did not, but I wish I had.

FinX and DinX would then be contained in a single major opcode.

When scalar float implemented in 32 hardware float registers was introduced, it expectedly would have used a strategy like RVV has that reduces the load/store offsets. It would not consume 7 major opcodes, but rather I expect it would be incorporated into OP-FP [which probably would have been called just FP}.

This is a prime example of unnecessary opcode space over-allocation because we sought to provide standard industry features in the standard industry way.

We should learn from our mistakes, rather than use them as justification to continue to be wasteful, short sighted and entrenched in old ideas and ways.

We have the potential to be so much more.

L Peter Deutsch

unread,

Oct 19, 2021, 3:07:23 PM10/19/21

to ds2horner, br...@hoult.org, step...@riscv.org, isa...@groups.riscv.org, kr...@sifive.com, roger....@esperantotech.com

Regarding the Vector extensions, I'm hoping that ds2horner's technical
points were addressed at length in the deliberations of the TG that produced
the vector extension proposal, and that there is a record of those
deliberations. Could someone tell me where to find that record?

I'm a relative newcomer to RISC-V, and one of the reasons I wanted to
participate was that I saw a design that was doing so many things right.
However, I have to say that I too feel uneasy at the amount of opcode space
taken up by the fused multiply-add instructions. I see in the spec that
giving up static rounding modes was considered and rejected, as was forcing
rd == rs3, but I didn't see (for example) consideration of specifying rs3 by
combining the 2 high-order bits of rd with 3 explicit low-order bits, which
would have reduced the 4 major opcodes to 1, as well as possibly playing
nicely with the 3-bit register space for the compressed instructions, while
hopefully reducing the number of explicit moves significantly. (Of course,
without real investigation, I don't know whether this would actually pay
off.) I recognize that this is off the Vector topic, and too late to
consider in any case, but I would appreciate being directed to the record of
the discussions that led to the present specification of these instructions.

I would participate happily in a thoughtful discussion about the mission of
the RISC-V enterprise, but isa-dev doesn't feel like the right place for it.
Is there an existing group where that discussion would be at home?

--

L Peter Deutsch <gh...@major2nd.com> :: Aladdin Enterprises :: Healdsburg, CA

Was your vote really counted? http://www.verifiedvoting.org

Bruce Hoult

unread,

Oct 20, 2021, 4:40:32 AM10/20/21

to L Peter Deutsch, ds2horner, step...@riscv.org, RISC-V ISA Dev, Krste Asanovic, roger....@esperantotech.com

A lot of things *could* have been done to reduce the space taken by the scalar FMA instructions, had they been thought of at the time.

Arguably, 1/8th of the opcode space isn't totally out of line with the overriding importance of FMA in floating point code, if FP is important. FMA is by far the most common ALU instruction in FP code.

For example, we could have done the same thing as is done in the Vector ISA, with the destination being constrained to be the same as either the addend or one of the multiplicands, would save 5 bits of register field in exchange for 1 more bit of opcode, for a net saving of 4 bits. i.e. the FMA instructions would then need a quarter of a major opcode instead of four major opcodes -- with negligible effect on the number of instructions executed by most programs.

But the time to do that was before the base ISA was ratified in July 2019.

That would have obsoleted about 500 HiFive Unleashed boards. And boards using the Kendryte K210, which came out in September 2018 and uses a draft privileged ISA spec (1.9.1), so is somewhat orphaned anyway. I can't immediately think of anything else in the field that would have been a casualty. In-progress SoCs would have to be modified.

But it's far far too late to change now.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20211019190716.59231EC2837%40serpent.at.major2nd.com.

ds2horner

unread,

Oct 20, 2021, 5:50:04 AM10/20/21

to L Peter Deutsch, br...@hoult.org, step...@riscv.org, isa...@groups.riscv.org, kr...@sifive.com, roger....@esperantotech.com

On 2021-10-19 3:07 p.m., L Peter Deutsch wrote:

Regarding the Vector extensions, I'm hoping that ds2horner's technical
points were addressed at length in the deliberations of the TG that produced
the vector extension proposal, and that there is a record of those
deliberations.  Could someone tell me where to find that record?

https://github.com/riscv/riscv-v-spec/issues is the current location for discussion and resolution of Vector issues.

It is a very good starting place.

Notes on all the Task Group deliberations are publicly available.

Typically they only summarized the decisions recorded in git hub and are not explanatory.

The location of the TG notes has changed over the years.

I would like to see the links for each of the TGs past and present reflected in the member's wiki and/or in the directory documents housed on google drive:

for risc-v members - Google Drive

subcommittees, task groups, and SIGs - Google Drive

I'm a relative newcomer to RISC-V, and one of the reasons I wanted to
participate was that I saw a design that was doing so many things right.
However, I have to say that I too feel uneasy at the amount of opcode space
taken up by the fused multiply-add instructions.  I see in the spec that
giving up static rounding modes was considered and rejected, as was forcing
rd == rs3, but I didn't see

Early discussions of design decisions prior to gifting the design to the foundation were not public.

To some extend that goes on until today.

What to exclude is a huge list, and the general justification to exclude is often:

1) no one thought to include the concept/construct. There are new ideas every day.

2) the original design goals were oriented to simplicity and thus efficiency in decode/execution.

3) another design goal was architecture acceptance.

Again a simple design that was useful/acceptable to academia, RISC purists and early adopters [cf. FPGA]

Much was made of the spartan nature of the design, the base having only 40 instructions.

A large proliferation of instructions, even if highly encoded and effective, is readily dismissed with that mindset.

4) if the perceived cost to pursue an idea was high it would be dismissed early as only limited time is ever available to a small team.

(for example) consideration of specifying rs3 by
combining the 2 high-order bits of rd with 3 explicit low-order bits, which
would have reduced the 4 major opcodes to 1,

Early on opcode pressure was not a concern, it was secondary to the concerns above.

Apparently is still not a major concern to many.

RISCV was a builtin opcode expansion design described in detail in Chapter 21 of volume 1 User Manual.

Bits 0 through 6 of the instruction define instruction lengths of 16,32,48 and 64.

Three further bits in the first 16bit instruction package allow multiple lengths in 16bit increments 192 bits.

This is an excellent design inclusion. However, it may have lulled some to think that it resolves opcode pressure concerns.

It does not. 32bit instruction lengths are prime real-estate that should be reserved for the highest value instructions [including static and dynamic use].

 as well as possibly playing
nicely with the 3-bit register space for the compressed instructions,

RVC, the 16bit compressed encoding provides a good example of an attempt to rigorously determine the best candidates.

This is explicitly because such encoding consumes so much of the opcode space.

Justification of what to include and exclude is well documented.

 while
hopefully reducing the number of explicit moves significantly.  (Of course,
without real investigation, I don't know whether this would actually pay
off.)

This is perpetually the problem. What application space will benefit, what future advances will nullify our current investigations.

I believe we have to anticipate were the tool chain and applications will be in 10 years and base decisions on such projections.

If we design to past entrenched limitations and current products we will fail to effectively allocate our limited opcode space resource.

 I recognize that this is off the Vector topic, and too late to
consider in any case, but I would appreciate being directed to the record of
the discussions that led to the present specification of these instructions.

I would participate happily in a thoughtful discussion about the mission of
the RISC-V enterprise,

excellent!

 but isa-dev doesn't feel like the right place for it.

 Is there an existing group where that discussion would be at home?

I don't believe there is.

There is a "todo" Jira ticket that is accessible only to RISCV memebers:

https://jira.riscv.org/browse/RVTS-671 Fully Document Arch Review Process

However, that ticket is the means by which discussion forums can be established for these discussions.

I expect that relevant aspects will be generally available to the public without a RISCV membership and

an opportunity in such public lists as this will be available for input from the general public.

Thank you so much for your inputs and participation.

Welcome!

ds2horner

unread,

Oct 20, 2021, 7:24:51 AM10/20/21

to Bruce Hoult, L Peter Deutsch, step...@riscv.org, RISC-V ISA Dev, Krste Asanovic, roger....@esperantotech.com

The FMA changes could be in incorporated into XFinX.

It is not too late for that. Just as the Float load/store space is freed up, those 4 major opcodes can be mostly freed.

Once it is in XFinX then a migration to Revised F,D,Q with the efficient FMA encoding is a much simpler transition.

The necessity is to determine such opportunities now and to re-evaluate the worth of the 32bit opcode space [among other assumptions].

How much is correcting past errors worth for RISCV's future.

We can plan and put in place the mechanisms to effect those transitions/corrections when needed.

The better we plan for them now the more readily we can converge to the optimal ISA [and variants in their respective domains]

We can do better in the "lessons learned" category. That will require recognizing and acknowledging what we did suboptimally.

Reply all

Reply to author

Forward