Proposal for ISA extension “P” for DSP and Packed SIMD instructions

chuanhua.chang

unread,

Nov 23, 2017, 8:06:50 AM11/23/17

to RISC-V ISA Dev

There were some discussions in the ISA-Dev mailing list about DSP instructions recently. With the past experience in this area, Andes would like to contribute our >150 DSP and Packed SIMD instructions (or simply DSP ISA) developed in our AndeStar V3 architecture as a starting basis for the RISC-V “P” extension. Our DSP ISA uses only GPRs and targets for high efficiency. We have quite successful experience in the DSP ISA, which made the AndesCore D1088 our most popular core after its debut 3 years ago.

The proposed ISA covers not only SIMD computations, but also a lot of modes needed in traditional DSP computations such as fixed-points, saturation with an overflow flag, shifting-and-round, accumulation to 64 bits (to guarantee the precision) , bit-reversal, etc. Accumulate-to-64b instructions consume 4 reads/2 writes of GPRs. Implementations can split them to 2 cycles to reduce the ports if desired, but still much better than 12 cycles using the existing instructions, or 5 cycles using the best instructions one can design with only 2R1W. The attached file shows the semantics of the 32-bit DSP ISA. We’re extending the ISA for RV64.

DSP and SIMD computations cover a wide range of applications, from audio/voice, to image/video, deep learning, and more. One size doesn’t fit all well. We’re aware that there are discussions about extensions for high-performance DSP computations. Our DSP ISA covers the high-efficiency end of the spectrum, demanded by many high-volume applications. In addition to the DSP ISA spec, Andes can contribute the corresponding compiler support as well.

Please feel free to drop us a line here, or come by for discussion in the coming workshop, where we’ll present the performance gains brought by the DSP ISA as part of our 12-minute presentation.

Chuanhua Chang

DSP_ISA_Summary.pdf

Bruce Hoult

unread,

Nov 23, 2017, 8:19:09 AM11/23/17

to chuanhua.chang, RISC-V ISA Dev

Cool!

Do you have RISC-V instruction encodings worked out for those? (they aren't in this document)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/07af1851-92fc-4a46-ad1e-9c51442c11df%40groups.riscv.org.

Allen J. Baum

unread,

Nov 23, 2017, 1:40:08 PM11/23/17

to Bruce Hoult, chuanhua.chang, RISC-V ISA Dev

Note that bit reversal (at least) is part of the bit manipulation extension.

On Thu, Nov 23, 2017 at 4:06 PM, chuanhua.chang <chuanhu...@gmail.com> wrote:

There were some discussions in the ISA-Dev mailing list about DSP instructions recently. With the past experience in this area, Andes would like to contribute our >150 DSP and Packed SIMD instructions (or simply DSP ISA) developed in our AndeStar V3 architecture as a starting basis for the RISC-V "P" extension. Our DSP ISA uses only GPRs and targets for high efficiency. We have quite successful experience in the DSP ISA, which made the AndesCore D1088 our most popular core after its debut 3 years ago.

The proposed ISA covers not only SIMD computations, but also a lot of modes needed in traditional DSP computations such as fixed-points, saturation with an overflow flag, shifting-and-round, accumulation to 64 bits (to guarantee the precision) , bit-reversal, etc. Accumulate-to-64b instructions consume 4 reads/2 writes of GPRs. Implementations can split them to 2 cycles to reduce the ports if desired, but still much better than 12 cycles using the existing instructions, or 5 cycles using the best instructions one can design with only 2R1W. The attached file shows the semantics of the 32-bit DSP ISA. We're extending the ISA for RV64.

DSP and SIMD computations cover a wide range of applications, from audio/voice, to image/video, deep learning, and more. One size doesn't fit all well. We're aware that there are discussions about extensions for high-performance DSP computations. Our DSP ISA covers the high-efficiency end of the spectrum, demanded by many high-volume applications. In addition to the DSP ISA spec, Andes can contribute the corresponding compiler support as well.

Please feel free to drop us a line here, or come by for discussion in the coming workshop, where we'll present the performance gains brought by the DSP ISA as part of our 12-minute presentation.

--

**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Jacob Bachmeyer

unread,

Nov 23, 2017, 10:45:26 PM11/23/17

to chuanhua.chang, RISC-V ISA Dev

chuanhua.chang wrote:
>
> There were some discussions in the ISA-Dev mailing list about DSP
> instructions recently. With the past experience in this area, Andes
> would like to contribute our >150 DSP and Packed SIMD instructions (or
> simply DSP ISA) developed in our AndeStar V3 architecture as a
> starting basis for the RISC-V “P” extension. Our DSP ISA uses only
> GPRs and targets for high efficiency. We have quite successful
> experience in the DSP ISA, which made the AndesCore D1088 our most
> popular core after its debut 3 years ago.
>

On this Thanksgiving holiday here in the US, I think I can speak for all
of us in saying "thank you" for this.

> The proposed ISA covers not only SIMD computations, but also a lot of
> modes needed in traditional DSP computations such as fixed-points,
> saturation with an overflow flag, shifting-and-round, accumulation to
> 64 bits (to guarantee the precision) , bit-reversal, etc.
> Accumulate-to-64b instructions consume 4 reads/2 writes of GPRs.
> Implementations can split them to 2 cycles to reduce the ports if
> desired, but still much better than 12 cycles using the existing
> instructions, or 5 cycles using the best instructions one can design
> with only 2R1W. The attached file shows the semantics of the 32-bit
> DSP ISA. We’re extending the ISA for RV64.
>

I think that we will probably need to break this up into several smaller
extensions, with some of them under a "DSP" umbrella and others in other
categories. Unfortunately, RVP is already specified in the current
draft ISA spec as using the FP register file, but this could be a good
starting point for a new fixed-point extension (possibly our first to
use 48-bit opcodes) using the integer register file.

The major concern I have with this as written is introducing an overflow
flag, since RISC-V explicitly eschews condition codes. I presume that
AndeStar V3 uses a condition code (and can probably branch on OV, much
like x86).

A wide-accumulate instruction will probably be appropriate, but could
also easily use "magic" CSRs: define user CSRs acc0v0, acc0v1, acc0v2,
acc0i0, acc0i1 with special write behavior: values written to acc0i0
are added to acc0v0 with carry into acc0v1 with second carry into
acc0v2, while values written to acc0i1 are added to acc0v1 with carry
into acc0v2. Additional accumulators can be acc1*, acc2*, etc.

The zero-overhead loop mechanism is interesting and simple enough to
adapt readily, even though its current MTLBI/MTLEI instructions will not
fit in RISC-V. I suggest simplifying it to a counted implicit branch:
When loopcount is greater than 1 and the program counter is equal to
loopend, the instruction at loopend is executed, but the program counter
is loaded from loopstart instead of advancing normally and loopcount is
decremented by 1. No constraints are placed on the values in loopstart
and loopend. An ordinary instruction fetch fault will be raised if
execution is transferred to loopstart and that address is found to be
invalid. The loopstart and loopend CSRs have special behavior when
written: a sign-extended 12-bit value is multiplied by 4 and added to
the current program counter to produce the value actually stored in the
CSR. Other values are stored as given. This precludes loading
addresses directly that are very close to NULL or top-of-memory, but
neither of those pages should be holding zero-overhead loops and this
allows the loop control to be provided using ADDI x0/CSRW pairs instead
of needing new encoding. Assuming the namespace change goes through,
this could be "Zizoloop" or "Izoloop" easily.

-- Jacob

chuanhua.chang

unread,

Nov 30, 2017, 12:49:44 AM11/30/17

to RISC-V ISA Dev, chuanhu...@gmail.com, br...@hoult.org

On Thursday, November 23, 2017 at 9:19:09 PM UTC+8, Bruce Hoult wrote:

Cool!

Do you have RISC-V instruction encodings worked out for those? (they aren't in this document)

The encodings of those instructions will use the custom or reserved opcode space temporarily for now. They will come later. But the important thing right now is to get people's feedback on the operations of those DSP instructions.

Chuanhua

chuanhua.chang

unread,

Nov 30, 2017, 12:53:54 AM11/30/17

to RISC-V ISA Dev, br...@hoult.org, chuanhu...@gmail.com

On Friday, November 24, 2017 at 2:40:08 AM UTC+8, Allen Baum wrote:

Note that bit reversal (at least) is part of the bit manipulation extension.

Our plan for instructions included in more than one extension is that If the bit manipulation extension is present, then the bit reverse instruction will be removed from the DSP extension. But if the bit manipulation extension is not present, then the bit reverse instruction will be included in the DSP extension.

Chuanhua Chang

chuanhua.chang

unread,

Nov 30, 2017, 1:25:26 AM11/30/17

to RISC-V ISA Dev, chuanhu...@gmail.com, jcb6...@gmail.com

On Friday, November 24, 2017 at 11:45:26 AM UTC+8, Jacob Bachmeyer wrote:

chuanhua.chang wrote:
>
> There were some discussions in the ISA-Dev mailing list about DSP
> instructions recently. With the past experience in this area, Andes
> would like to contribute our >150 DSP and Packed SIMD instructions (or
> simply DSP ISA) developed in our AndeStar V3 architecture as a
> starting basis for the RISC-V “P” extension. Our DSP ISA uses only
> GPRs and targets for high efficiency. We have quite successful
> experience in the DSP ISA, which made the AndesCore D1088 our most
> popular core after its debut 3 years ago.
>

On this Thanksgiving holiday here in the US, I think I can speak for all
of us in saying "thank you" for this.

We are very glad to contribute.

> The proposed ISA covers not only SIMD computations, but also a lot of
> modes needed in traditional DSP computations such as fixed-points,
> saturation with an overflow flag, shifting-and-round, accumulation to
> 64 bits (to guarantee the precision) , bit-reversal, etc.
> Accumulate-to-64b instructions consume 4 reads/2 writes of GPRs.
> Implementations can split them to 2 cycles to reduce the ports if
> desired, but still much better than 12 cycles using the existing
> instructions, or 5 cycles using the best instructions one can design
> with only 2R1W. The attached file shows the semantics of the 32-bit
> DSP ISA. We’re extending the ISA for RV64.
>

I think that we will probably need to break this up into several smaller
extensions, with some of them under a "DSP" umbrella and others in other
categories. Unfortunately, RVP is already specified in the current
draft ISA spec as using the FP register file, but this could be a good
starting point for a new fixed-point extension (possibly our first to
use 48-bit opcodes) using the integer register file.

The major concern I have with this as written is introducing an overflow
flag, since RISC-V explicitly eschews condition codes. I presume that
AndeStar V3 uses a condition code (and can probably branch on OV, much
like x86).

AndeStar V3 DSP extension uses an overflow flag and has overflow flag read and clear instructions. But it does not have a "branch on OV flag" instruction.

A wide-accumulate instruction will probably be appropriate, but could
also easily use "magic" CSRs: define user CSRs acc0v0, acc0v1, acc0v2,
acc0i0, acc0i1 with special write behavior: values written to acc0i0
are added to acc0v0 with carry into acc0v1 with second carry into
acc0v2, while values written to acc0i1 are added to acc0v1 with carry
into acc0v2. Additional accumulators can be acc1*, acc2*, etc.

From our experience, using GPRs directly for wider data type accumulation is more efficient than using separate accumulator states. That is why we define paired GPRs in 32-bit architecture for 64-bit data type.

The zero-overhead loop mechanism is interesting and simple enough to
adapt readily, even though its current MTLBI/MTLEI instructions will not
fit in RISC-V. I suggest simplifying it to a counted implicit branch:
When loopcount is greater than 1 and the program counter is equal to
loopend, the instruction at loopend is executed, but the program counter
is loaded from loopstart instead of advancing normally and loopcount is
decremented by 1. No constraints are placed on the values in loopstart
and loopend. An ordinary instruction fetch fault will be raised if
execution is transferred to loopstart and that address is found to be
invalid. The loopstart and loopend CSRs have special behavior when
written: a sign-extended 12-bit value is multiplied by 4 and added to
the current program counter to produce the value actually stored in the
CSR. Other values are stored as given. This precludes loading
addresses directly that are very close to NULL or top-of-memory, but
neither of those pages should be holding zero-overhead loops and this
allows the loop control to be provided using ADDI x0/CSRW pairs instead
of needing new encoding. Assuming the namespace change goes through,
this could be "Zizoloop" or "Izoloop" easily.

Why do you say that MTLBI/MTLEI do not fit in RISC-V? Is it because its 16-bit immediate value? The immediate value can be adjusted to smaller bit width (e.g., 12 bits) to fit into RISC-V encoding. The MTLBI instruction does what you described to multiply a sign-extended value by 2 (half word address) and add it to the current program counter, and write the result into the loopstart register (LB). The MTLEI instruction does the same thing and write the result into the loopend register (LE). The counted implicit branch operation you suggested is exactly what the hardware ZOL mechanism is doing to perform the looping behavior.

Chuanhua Chang

Aneesh Raveendran

unread,

Nov 30, 2017, 1:52:58 AM11/30/17

to chuanhua.chang, RISC-V ISA Dev, jcb6...@gmail.com

Wow.... Great word... You have covered most of the subclasses forDSP instructions set. Is there any plan to derive an energy efficient isa extension include oppose fixing etc?.

It's would be grateful if you openly share the isa formation for dsp extension for riscv isa?

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9214b53b-c3e1-4e25-9b27-6333213952bc%40groups.riscv.org.

Jacob Bachmeyer

unread,

Nov 30, 2017, 10:03:19 PM11/30/17

to chuanhua.chang, RISC-V ISA Dev

Fair enough, although I would still like to find some way to eliminate
that extra state. RISC-V currently does not have an overflow or carry
flag, and I am reluctant to support adding one.

> A wide-accumulate instruction will probably be appropriate, but could
> also easily use "magic" CSRs: define user CSRs acc0v0, acc0v1,
> acc0v2,
> acc0i0, acc0i1 with special write behavior: values written to acc0i0
> are added to acc0v0 with carry into acc0v1 with second carry into
> acc0v2, while values written to acc0i1 are added to acc0v1 with carry
> into acc0v2. Additional accumulators can be acc1*, acc2*, etc.
>
> From our experience, using GPRs directly for wider data type
> accumulation is more efficient than using separate accumulator states.
> That is why we define paired GPRs in 32-bit architecture for 64-bit
> data type.

Paired GPRs would be a significant addition to RISC-V. I am uncertain
about the wisdom of adding them.

Partly the 16-bit immediate (RISC-V immediates are 12-bit) but more
severely that OP-IMM is full. We have no encoding space in the base
32-bit region for more I-type instructions. Another possibility could
be a special instruction in the BRANCH opcode, which could even supply a
predicate. (The location of the "BRANCH-LOOP" instruction would be the
loop end, while its branch target would be the loop start. The first
iteration would be a normal branch, while subsequent iterations would be
zero-overhead.) There are two remaining funct3 codes in BRANCH.

> The immediate value can be adjusted to smaller bit width (e.g., 12
> bits) to fit into RISC-V encoding. The MTLBI instruction does what you
> described to multiply a sign-extended value by 2 (half word address)
> and add it to the current program counter, and write the result into
> the loopstart register (LB). The MTLEI instruction does the same thing
> and write the result into the loopend register (LE). The counted
> implicit branch operation you suggested is exactly what the hardware
> ZOL mechanism is doing to perform the looping behavior.

The counted implicit branch mechanism I suggested is intended to be
exactly ZOL, but without the (documented but apparently unenforced)
requirement that loop end be "after" loop begin and adapted to better
take advantage of the RISC-V instruction set. The special CSR behavior
allows using the baseline CSRW opcode to encode MTLB/MTLE, while using
ADDI x0/CSRW pairs to encode MTLBI/MTLEI with a general 12-bit immediate
in 64 bits. MTLBI/MTLEI with a 5-bit unsigned immediate can be encoded
using the baseline CSRWI in a single 32-bit instruction. Requiring ZOL
loops to span 32-bit instruction words (or to use longer sequences to
load the CSRs) seems a reasonable compromise, since I would like to
allow unrolled inner loops inside a ZOL. Arbitrary loop bounds can be
generated and set using AUIPC/ADDI/CSRW sequences in 96 bits.

-- Jacob

lk...@lkcl.net

unread,

Apr 6, 2018, 9:42:14 AM4/6/18

to RISC-V ISA Dev

On Thursday, November 23, 2017 at 1:06:50 PM UTC, chuanhua.chang wrote:

There were some discussions in the ISA-Dev mailing list about DSP instructions recently. With the past experience in this area, Andes would like to contribute our >150 DSP and Packed SIMD instructions (or simply DSP ISA) developed in our AndeStar V3 architecture as a starting basis for the RISC-V “P” extension. Our DSP ISA uses only GPRs and targets for high efficiency. We have quite successful experience in the DSP ISA, which made the AndesCore D1088 our most popular core after its debut 3 years ago.

Hi Chuanhua,

So apologies I am catching up here quite a lot. In this post [1] I promised to do a review of the proposed DSP-ISA. Firstly: I like it. especially that, clearly, it has a decent pedigree. Secondly, I'm concerned about the comment at the bottom of each page of the PDF, "shall not be distributed without express permission". Strictly speaking that makes it impossible for anyone to discuss the document. Can that please be rectified with an update? (also Page 2)

So onwards with a review:

(a) The most immediate thing: as mentioned in [1] it's a hybrid of "1) A means to do single-instruction, multiple data" and "2) A DSP". My feeling is that those aspects should be split out into completely separate proposals (for the reasons outlined in [1]: the multiple-data aspects can be utilised elsewhere NOT just on DSP). So the rest of these comments are within that context.

(b) Section 1.1 this looks like it's specific to AndeStar. Some RV implementors may wish to implement far wider SIMD, and may even wish to (*transparently as far as compiler is concerned*) make the level hardware-parallelism a choice based on what operation is to be carried out.

(c) Section 1.3 zero-overhead loop mechanism. I like this, a lot. I haven't quite got my head round how it could be incorporated into a general-purpose (proposed, separate) P-SIMD/Simple-V though. Also it says it's 32-bit: that would obviously need to be... be... XEXT-bit, is it? 32 for RV32, 64 for RV64. I *think*... it *might* just be possible to add without any kind of interference with what i propose in SIMD/Simple-V [2]

(d) Section 2.2.1 and 2.2.2: hoooraaaaay, some 8-bit and 16-bit operands, yippeee! First thing though: in [2] I proposed setting an implicit data-width (borrowing the idea from V). If that were to be implemented, ADD8 SUB8 would be removed because a standard RV add would *IMPLICITLY* be an 8-bit operation. likewise ADD16, SUB16.

(e) Again section 2.2.1 and 2.2.2: What happened to UADD8/16 and USUB8/16? if [2] was followed, those would be implicitly added because there already exists UADD and USUB (I hope!)

(f) Section 2.2.3: shifting - I'd strongly recommend re-awakening the B-Extension and moving this entire section to it.

(g) Section 2.2.4 and 2.2.5: if [2] were taken up this entire section would go because the standard RV* Base comparison instructions could be used.

(h) Section 2.2.6: most of these would stay if [2] were adopted. I honestly haven't studied RV multiply enough yet to say which of these could be moved and which could stay if P-SIMD became a DSP-only Extension.

(i) Section 2.2.7: looks great as a DSP-only Extension

(j) Section 2.2.8: these look to me like byte-swap instructions. Again, I'd recommend B-Extension be re-activated and these moved in their entirety for discussion there.

(k) Section 2.3: moot point in the context of proposal [2], why are these marked as "non-SIMD"? Surely in a 64-bit or 128-bit RV environment they would *be* SIMD? but as a DSP if Section 2.3 is morphed into a DSP-Extension it's a moot point

(l) Section 2.3.1: looks great as a DSP-Ext

(m) Section 2.3.2 and 2.3.3: suggest moving to B-Ext (perhaps not all of the 2.3.2 instructions, KSLRAW looks quite comprehensive)

(n) Section 2.3.4: SMMUL is *probably* (sorry haven't studied M-extension extensively!) duplicated, the rest would stay as a DSP-Ext.

(o) Section 2.3.5, 2.3.6 and 2.3.7: my favourites - Mult + Acc :) these would stay :)

(p) Section 2.3.8: bit-wise operations and count-leading-zero/one to be moved to B-Ext for discussion there, the rest look really good. oo, max/min! and parallel byte-sum accumulate, niiiice :)

(q) Section 2.3.9 and 2.3.11 Saturation: this is clearly targetted at audio. looks very sensible.

(r) Section 2.3.11 overflow: looks fine.

(s) Section 2.4.1 64-bit add/sub: again, just as with (d), ADD64, SUB64, remove: gain UADD64 and USUB64 by using implicit data-width of [2] which is *still needed* for RV128. the rest stay.

(t) Section 2.4.2 Signed 32-bit mul w/64-bit add: stays. perfect

(u) Section 2.4.3 Signed 16-bit mul w/64-bit add: stays. perfect.

(v) Section 2.4.oink?? Signed 8-bit mul w/64-bit add? and Signed 64-bit mul w/64-bit add? don't exist?

(w) also what about RV128?

(x) Section 3.3: loop. really really like it.

So just to recap: this is clearly based on an extremely powerful and successful DSP, which has clear signs of being suited to audio (i worked for CEDAR Audio back in 1992, we used a TI DSP, there).

As mentioned towards the end of [1] this proposal is definitely more DSP-focussed than it is actually SIMD-focussed. As such I really feel that it should be turned *into* a DSP-Extension, so that some implementors may consider using RISC-V *as* a low-cost DSP.

*As* a DSP-extension, it would be possible to have specific 32-bit, 64-bit and 128-bit variants RV32/64/128, thus making it possible for implementors to choose the level and extent to which they would like to go as far as bit-width was concerned. I dread to think quite how large a single-cycle 128-bit Multiply-and-Accumulate engine would actually need to be, but that's entirely up to RV128 implementors to deal with.

Interestingly there isn't actually any description of *how* SIMD is to be implemented, other than specifying that the 8-bit operations *can* be performed in parallel (is that mandatory? can 2 8-bit operations be done in parallel? what about 8 8-bit operations? or for RV128, what about 16 8-bit operations? what about 32 8-bit operations?)

And that's really the point of [2], which is that the Multiple-part of SIMD really *should* I feel be completely separate, and specified separately. A separate SIMD/Simple-V extension which quite literally comprises only 1 maybe 2 instructions absolute maximum should, I feel, be applicable across the *entire* RISC-V instruction set... *if desired*. Which then comes, as mentioned towards the end of [1], with the additional benefit that a SIMD/Simple-V extension is just as equally applicable to Floating-Point as it is to DSP.

... which, Richard, would indirectly tend to support what you were advocating, namely that a DSP-extension should remain in the integer register-space of RISC-V. not least: if FENCE instructions are used, if DSP was using the floating-point registers you could not have an Out-of-Order engine simultaneously issue Floating-Point instructions.

ok that's a lot. enough :)

l.

[1] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/SEMyC9IlAgAJ

[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8

Xan Phung

unread,

Apr 24, 2018, 11:16:46 AM4/24/18

to RISC-V ISA Dev

I've been looking at the Andes SIMD instructions and also the RV Vector extension, and I believe it is possible to harmonise the operation of SIMD (as proposed by Andes) and Vectors (as proposed by RVV Working Group).

The main restriction is that whereas Andes SIMD instructions can use all 32 integer registers arbitrarily/disjointly as either INT8 or INT16 vectors, a harmonised SIMD/RVV will need to divide the integer register file into two contiguous banks of Vector(INT8) and Vector(INT16) respectively (the split defaults to 14+14, but is resizeable up to 32-0 or down to 0-32).

See the attached PDF file for a 1 page outline of how this would work, and then a 30 page comprehensive annotation of each Andes SIMD instruction and it's mapping to the RV Vector extension instructions - almost the entire SIMD instruction set fits into a simple subset of RVV. (Also, a minority of the Andes non-SIMD/DSP instructions also map onto RVV, using an INT32 vector element type).

This "Harmonised" SIMD/RVV proposal also borrows some ideas of Waterman/Patterson (https://www.sigarch.org/simd-instructions-considered-harmful/) and "Simple V" (http://libre-riscv.org/simple_v_extension), but actually is much simpler than "Simple V" (in my opinion). In particular, unlike Simple V, all parallelism per instruction is restricted what can be stored in a single register (eg. single RVI32 register, single RVI64 register, or single dedicated Vector register - which can be 128 or more bits in width). Binary code can be written to be forward compatible with all 3 of these options.

RVP_ISA_Summary.pdf

Xan Phung

unread,

Apr 24, 2018, 9:44:42 PM4/24/18

to RISC-V ISA Dev

P.S. In case it isn't obvious, I've amended my "Harmonised" RVV/Packed SIMD proposal to explicitly state:

1. There is a 1:1 fixed mapping between r0-r31 and v0-v31 (in implementations that use the integer register file for vector operations)

An exception to this might be mapping v1 to r5, to avoid v1 masking operations clashing with r1 procedure linkage use

2. VL CSR, setvl instruction and MVL all operate as per the RV Vector spec.

(However using these RVV features remains optional in Packed SIMD code that does not need to have binary forward compatibility with RVV)

Also I fixed a couple of typos/errors in my Andes instruction set annotations on pages 5-8... see revised Harmonised proposal and Andes ISA annotation attached.

Also, for the Andes non SIMD/DSP and zero overhead loop instructions that don't map neatly to Vector INT32/INT16 instructions, these are outside the scope of Harmonised RVV/Packed SIMD spec - they would be specified separately and not use VOP/VMEM major opcodes (but could still be added using another major opcode).

--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/vYVi95gF2Mo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3484b7de-e754-44fe-ac82-93b6444f8aa3%40groups.riscv.org.

RVP_ISA_Summary printed.pdf

Luke Kenneth Casson Leighton

unread,

Apr 24, 2018, 10:27:29 PM4/24/18

to Xan Phung, RISC-V ISA Dev

On Wed, Apr 25, 2018 at 2:43 AM, Xan Phung <xan....@gmail.com> wrote:
> P.S. In case it isn't obvious, I've amended my "Harmonised" RVV/Packed SIMD
> proposal to explicitly state:
>
> 1. There is a 1:1 fixed mapping between r0-r31 and v0-v31 (in
> implementations that use the integer register file for vector operations)
> An exception to this might be mapping v1 to r5, to avoid v1 masking
> operations clashing with r1 procedure linkage use
>
> 2. VL CSR, setvl instruction and MVL all operate as per the RV Vector spec.
> (However using these RVV features remains optional in Packed SIMD code
> that does not need to have binary forward compatibility with RVV)

Thank you, Xan: you've seen that I have a page where I am collating
information on how to retrofit (Harmonise) RVV / Packed SIMD? The
focus of the effort has the same goal.

http://libre-riscv.org/simple_v_extension/p_comparative_analysis/

It is a beginning. Also, Richard Herveille very kindly pointed me
towards TI's DSP series (and... *wow*. 770-page ISA manual).

> Also, for the Andes non SIMD/DSP and zero overhead loop instructions that
> don't map neatly to Vector INT32/INT16 instructions, these are outside the
> scope of Harmonised RVV/Packed SIMD spec - they would be specified
> separately and not use VOP/VMEM major opcodes (but could still be added
> using another major opcode).

Here I found something called ZOLC: I posted separately about it (or
at least I thought I had...) and having seen quite how powerful ZOLC
really is (it can be used to get a WHOPPING 40% reduction in time
taken to do MPEG decode for example which is an extremely complex
6-level-nested multi-entry-point, multi-exit-point algorithm) I'm
inclined to say that any Zero-Overhead Looping should be evaluated
completely separately (separate extension) on the basis that it's too
good an opportunity to let roll by, by focussing just on single inner
loops.

l.

Xan Phung

unread,

Apr 25, 2018, 12:05:31 AM4/25/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev

Hi Luke, yes I have seen the comparative analyses done for Simple V.

By the way, I mostly agree with all your conclusions, ie:

1. go for "variable" parallelism (to provide forward compatibility and avoid explosion of SIMD instructions for every data type & register width)

2. go for "indirect" instruction bit width (ie: instructions are polymorphic)

**however** I harmonise this with Andes' approach of putting explicit bit width into instructions, by effectively encoding bit width into MSD of 5bit register specifiers

ie: there is a default register bank layout that divides 32 registers into two banks of 14 + 14 registers for byte & word vectors respectively

(the other four registers are v0/r0 = hardwired zero, v1 = predicate, v30-31 = INT32 type registers)

(this "default" layout is compliant with RVV vector regfile config options, and the split between INT8/INT16 and signed/unsigned is configurable)

3. vector load/store are complex but beneficial... however I see RVP as a "low end" RVV implementation

- hence I propose RVP support basic VLD/VST, but trap more complex strided and indexed VMEM ops of RVV

4. tag or no tag: predicates not needed for harmonising RVV to Andes Packed ISA but I leave scope for this to be included in RVP

- I only reuse mm="01" to encode Andes per-instruction saturation/rounding options

(which according to Roger Espaza's presentation is unused by RVV)

5. implicit loops to be specified separately, don't tie into SIMD

However, I diverge from Simple V in the following ways:

- vectors map into only single integer registers (rather than multiple registers)

(if an implementation wishes to support larger vectors beyond RV32/RV64 then it has option of dedicated vector register file)

- vector operations & load/stores are mapped into RVV's VOP and VMEM opcodes, does not map base integer ISA instructions

this preserves upwards binary compatibility between RVV and RVP

regards

Xan

Luke Kenneth Casson Leighton

unread,

Apr 25, 2018, 8:19:46 AM4/25/18

to Xan Phung, RISC-V ISA Dev

On Wed, Apr 25, 2018 at 5:04 AM, Xan Phung <xan....@gmail.com> wrote:
> Hi Luke, yes I have seen the comparative analyses done for Simple V.

:) sorry brief words, rsi now.

http://libre-riscv.org/Comparative_analysis_Harmonised_RVP_vs_Andes_Packed_SIMD_ISA_proposal/

nice! may move/rename if thats ok, email me offlist when paused major editing.

> By the way, I mostly agree with all your conclusions, ie:
> 1. go for "variable" parallelism (to provide forward compatibility and avoid
> explosion of SIMD instructions for every data type & register width)

O(N^5). insane.

> 2. go for "indirect" instruction bit width (ie: instructions are
> polymorphic)
> **however** I harmonise this with Andes' approach of putting explicit bit
> width into instructions, by effectively encoding bit width into MSD of 5bit
> register specifiers
> ie: there is a default register bank layout that divides 32 registers into
> two banks of 14 + 14 registers for byte & word vectors respectively
> (the other four registers are v0/r0 = hardwired zero, v1 = predicate,
> v30-31 = INT32 type registers)

nice. can you pseudocode it? difficulty with words, pseudocode
crystal clear. lesson of intel TRMs (and clifford's xbitmanip).

oh. table in link above, good illustration. what is definition of "GPR"?

slight concern reduction no of registers, nice idea overload regnum
with bitwidth. whats difference vewn, sect 17.5 draft-0.4 v2.3-draft
rv manual?

> (this "default" layout is compliant with RVV vector regfile config
> options, and the split between INT8/INT16 and signed/unsigned is
> configurable)
>
> 3. vector load/store are complex but beneficial... however I see RVP as a
> "low end" RVV implementation
> - hence I propose RVP support basic VLD/VST, but trap more complex strided
> and indexed VMEM ops of RVV

yes. valid implementation. needs HW CSRs to do it well. emulate
CSRs hard in SW.
http://libre-riscv.org/simple_v_extension/#implementation_paradigms
will expand later.

> 4. tag or no tag: predicates not needed for harmonising RVV to Andes Packed
> ISA but I leave scope for this to be included in RVP

big advantages.

> - I only reuse mm="01" to encode Andes per-instruction saturation/rounding
> options
> (which according to Roger Espaza's presentation is unused by RVV)

oo that's good idea.

> 5. implicit loops to be specified separately, don't tie into SIMD
>
> However, I diverge from Simple V in the following ways:
> - vectors map into only single integer registers (rather than multiple
> registers)
> (if an implementation wishes to support larger vectors beyond RV32/RV64
> then it has option of dedicated vector register file)
> - vector operations & load/stores are mapped into RVV's VOP and VMEM
> opcodes, does not map base integer ISA instructions
> this preserves upwards binary compatibility between RVV and RVP

how to discern between the two?

also would like to see way to take "cross/uncross" out of explicit
instruction encoding. cross/uncross then can apply to compares, SMIN,
SMAX and more.

SIMD ISA has cross/uncross (CRAS) so gates are going in anyway, might
as well bring up to all of ISA, makes more powerful.

l.

Xan Phung

unread,

Apr 25, 2018, 9:22:28 AM4/25/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev

Hi Luke,

The material at http://libre-riscv.org/Comparative_analysis_Harmonised_RVP_vs_Andes_Packed_SIMD_ISA_proposal/ is stable now, feel free to move and rename.

In relation to your feedback/queries:

1. What are "GPRs"?

GPR = general purpose integer registers

ie: the target for a Harmonised RV Packed spec would be "low end" processors that don't have power/transistor budget for dedicated vector registers, hence instead integer GPRs are used for SIMD.

Once a CPU has a transistor budget for dedicated vector registers, then it should really implement full blown RV Vector and not RV Packed

2. What is difference between Harmonised RVP and RV Vector?

How does it relate to RV Vector vewn & what is psuedocode to define Harmonised vector config?

Harmonised RVP by intentional design essentially is a subset of RVV (hence "harmonised").

The main difference is VLD/VST go to/from integer GPRs, and not dedicated vector regs like in RV Vector.

The use case I have in mind is something like a string copy, which on low end CPUs might use integer register SIMD to copy bytes 4x per loop

The same code should then execute unmodified on high end CPUs with wide vector regs that might copy string byes at 16x or 32x per loop.

If software needs to discern the two, the misa CSR will be where to look, although this can only be read by privileged code...

(I can't think of any way for user mode code to quickly check misa??)

Also, unlike the "blank slate" of RV Vector, in Harmonised RVP, a default set of register types are defined, and mapped 1:1 into the 31 GPRs... this is to keep implementations simple & avoid "fancy" vector packing complexities. (The aim of having a default register type config is to avoid the vector unit startup overhead. Essentially, SIMD instructions can be executed straight away without doing any setup, on the proviso programmers can live with reduced register numbers due to default partitioning. If programmer needs more registers of a type then they need to do their own config.)

The default Harmonised vector config I propose can be replicated with the following RV Vector code:

li r31, <00100><00001><001 00> // 16 bit immediate value for <vshapen><verepn><vewn>

csrw vcfg0, r31 // r0-r7 are vector[signed int8]

csrw vcfg1, r31

csrw vcfg2, r31
csrw vcfg3, r31

li r31, <00100><00000><001 00>

csrw vcfg4, r31 // r8-r15 are vector[unsigned int8]

csrw vcfg5, r31

csrw vcfg6, r31
csrw vcfg7, r31

li r31, <00100><00001><002 00>

csrw vcfg8, r31 // r16-r23 are vector[signed int16]

csrw vcfg9, r31

csrw vcfg10, r31
csrw vcfg11, r31

li r31, <00100><00000><002 00>

csrw vcfg12, r31 // r16-r23 are vector[unsigned int16]

csrw vcfg13, r31

csrw vcfg14, r31

li r31, <00100><00001><003 00>

csrw vcfg15, r31 // r30-31 are vector[signed int32]

3. Crossed adds/multiplies

Yeah agree this is a bit of the messy part of Andes SIMD, and there isnt currently enough detail on RV Vector instructions to let me figure out the best way to do this in RV Vector paradigm.

Luke Kenneth Casson Leighton

unread,

Apr 25, 2018, 11:20:39 PM4/25/18

to Xan Phung, RISC-V ISA Dev

On Wed, Apr 25, 2018 at 2:21 PM, Xan Phung <xan....@gmail.com> wrote:
> Hi Luke,
>

> The material at
> http://libre-riscv.org/Comparative_analysis_Harmonised_RVP_vs_Andes_Packed_SIMD_ISA_proposal/
> is stable now, feel free to move and rename.

done, can you identify which of
http://libre-riscv.org/harmonised_rvv_rvp/ and
Harmonised_RVV/Packed_SIMD are up-to-date?

> In relation to your feedback/queries:
>
> 1. What are "GPRs"?
>
> GPR = general purpose integer registers

ok will add that once i know which page is up-to-date

> ie: the target for a Harmonised RV Packed spec would be "low end" processors
> that don't have power/transistor budget for dedicated vector registers,
> hence instead integer GPRs are used for SIMD.

yes. goal of s-v as well.

> Once a CPU has a transistor budget for dedicated vector registers, then it
> should really implement full blown RV Vector and not RV Packed
>
> 2. What is difference between Harmonised RVP and RV Vector?
> How does it relate to RV Vector vewn & what is psuedocode to define
> Harmonised vector config?
>
> Harmonised RVP by intentional design essentially is a subset of RVV (hence
> "harmonised").
> The main difference is VLD/VST go to/from integer GPRs, and not dedicated
> vector regs like in RV Vector.

doesn't necessarily mean that vewn should be excluded from H-RVV.

> The use case I have in mind is something like a string copy, which on low
> end CPUs might use integer register SIMD to copy bytes 4x per loop
> The same code should then execute unmodified on high end CPUs with wide
> vector regs that might copy string byes at 16x or 32x per loop.
>
> If software needs to discern the two, the misa CSR will be where to look,
> although this can only be read by privileged code...
> (I can't think of any way for user mode code to quickly check misa??)

highlights fundamental problem with RV spec, solution WARL mvendorid-marchid

> Also, unlike the "blank slate" of RV Vector, in Harmonised RVP, a default
> set of register types are defined, and mapped 1:1 into the 31 GPRs... this
> is to keep implementations simple & avoid "fancy" vector packing
> complexities. (The aim of having a default register type config is to avoid
> the vector unit startup overhead. Essentially, SIMD instructions can be
> executed straight away without doing any setup, on the proviso programmers
> can live with reduced register numbers due to default partitioning. If
> programmer needs more registers of a type then they need to do their own
> config.)

concern: what about e.g. LTE handsets, power greatly reduced by
massively-wide SIMD (512, 1024 bit) 8 and 16-bit ops to do FFT, I/Q
etc. basically SDR.

> 3. Crossed adds/multiplies
>
> Yeah agree this is a bit of the messy part of Andes SIMD, and there isnt
> currently enough detail on RV Vector instructions to let me figure out the
> best way to do this in RV Vector paradigm.

able to reverse-engineer, takes 4-5 weeks.

similar to vewn (17.5 0.4draft v2.3draft) and verepn (17.4)
possibility exists add swapper (butterfly) 6-bit covers full
permutation of 4-wide elements. 4-wide elements permits H-L / L-H and
Q0-Q3 full byte reordering.

beyond that VSLIDE (17.20), register gather-scatter (17.19) can do
larger reordering.

however question is, use xBitManip? does same job, except concern: DSP
archs much simpler, all RV proponents assume OoO, reg-rename, other
strategies. DSP archs TI C670* way simpler.

l.

Xan Phung

unread,

Apr 26, 2018, 9:43:03 AM4/26/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev

Hi Luke,

The two pages of my material that are up to date are:

http://libre-riscv.org/harmonised_rvv_rvp/

http://libre-riscv.org/harmonised_rvv_rvp/comparative_analysis_harmonised_rvp_vs_andes_packed_simd_isa_proposal/

The first link above is an overview

The second link above contains the register file layout, instruction encoding and tables setting out each Andes SIMD instruction with equivalent Harmonised RVP instruction.

regards

Xan

Luke Kenneth Casson Leighton

unread,

Apr 26, 2018, 10:19:13 AM4/26/18

to Xan Phung, RISC-V ISA Dev, chuanhua.chang

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Apr 26, 2018 at 2:42 PM, Xan Phung <xan....@gmail.com> wrote:
> Hi Luke,
>

> The two pages of my material that are up to date are:
>
> http://libre-riscv.org/harmonised_rvv_rvp/
> http://libre-riscv.org/harmonised_rvv_rvp/comparative_analysis_harmonised_rvp_vs_andes_packed_simd_isa_proposal/

super. clear. shortened to
http://libre-riscv.org/harmonised_rvv_rvp/comparative_analysis/

> The first link above is an overview
>
> The second link above contains the register file layout, instruction
> encoding and tables setting out each Andes SIMD instruction with equivalent
> Harmonised RVP instruction.

will start linking in, tomorrow.

also going to borrow tables :)

what do you think of the "CSR cross[32][6]" idea? sorry below may
not be exactly clear, it's basically a way to generalise all
cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would
reduce down to one instruction as opposed to 8 right now.

def butterfly_remap(remap_me):
# hmmm a little hazy on the details here....
# help, help! logic-dyslexia kicking in!
# erm do some crossover using the 6 bits from
# the CSR cross map. first 2 bits swap
# elements in index positions 0,1 and 2,3
# second 2 bits swap elements in positions 0,2 and 1,3
# then swap 0,1 and 2,3 a second time.
# gives full set of all permutations.
return something, something

def crossover(elidx, destreg):
base = elidx & ~0x7
return butterfly_remap(CSR_cross[destreg][elidx & 0x7])

def op(v1, v2, v3):
for l in vlen:
remap_src1, remap_src2 = crossover(i, v1)
GPR[v1] = scalar_op(GPR[v2][remap_src1], GPR[v3][remap_src2]) #
remap_srcN references byte offsets? erm.... :)

lk...@lkcl.net

unread,

Apr 27, 2018, 12:31:37 AM4/27/18

to RISC-V ISA Dev, xan....@gmail.com, chuanhu...@gmail.com

On Thursday, April 26, 2018 at 3:19:13 PM UTC+1, lk...@lkcl.net wrote:

will start linking in, tomorrow.

added first link

what do you think of the "CSR cross[32][6]" idea? sorry below may
not be exactly clear, it's basically a way to generalise all
cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would
reduce down to one instruction as opposed to 8 right now.

placed question(s) here so not forgotten:

http://libre-riscv.org/harmonised_rvv_rvp/discussion/

Xan Phung

unread,

Apr 27, 2018, 6:28:05 AM4/27/18

to lk...@lkcl.net, RISC-V ISA Dev, chuanhu...@gmail.com

Can anyone on the RV Vector WG give more info on:

1. vrgather instruction

2. VOP: how many unused func7 opcodes are there? (func7=bits 31:25).

3. "mm" opcodes: are there two "mm" mask opcode bits in VOP (as per Roger Espaza's Nov 2017 presentation) or is there one predicate selection bit (as per current draft Vector spec?)

4. missing vsle instruction (see slide 5 of Roger Espaza presentation): why vsge (set greater than or equal) is provided when it doesn't add any new functionality not provided by vslt (set less-than), but vsle (set less than or equal) is missing from the instruction list?

5. vslide instruction

6. vextract & vpopc instructions

For the Andes SUNPKD and ZUNPKD, I believe RV Vector's vrgather instruction can provide equivalent functionality (but the draft RV Vector spec doesn't give enough detail for me to be sure).

In relation to the Andes crossed SIMD instructions, I don't know enough about the use cases of crossed SIMD arithmetic to comment in an informed way on how they should be provided. Can anyone else give insight into whether they are heavily used instructions or used only in specialised niches? Also, why does Andes only provided 16 bit crossed operations and not 8 bit?

regards

Xan

Luke Kenneth Casson Leighton

unread,

May 19, 2018, 3:02:58 PM5/19/18

to Xan Phung, RISC-V ISA Dev, chuanhua.chang

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:
> Can anyone on the RV Vector WG give more info on:
> 1. vrgather instruction
> 2. VOP: how many unused func7 opcodes are there? (func7=bits 31:25).
> 3. "mm" opcodes: are there two "mm" mask opcode bits in VOP (as per Roger
> Espaza's Nov 2017 presentation) or is there one predicate selection bit (as
> per current draft Vector spec?)
> 4. missing vsle instruction (see slide 5 of Roger Espaza presentation): why
> vsge (set greater than or equal) is provided when it doesn't add any new
> functionality not provided by vslt (set less-than), but vsle (set less than
> or equal) is missing from the instruction list?
> 5. vslide instruction
> 6. vextract & vpopc instructions

found one: p18 https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf

roger espaca's 2018 barcelona slides also had these (and are more up-to-date)

l.

Luke Kenneth Casson Leighton

unread,

May 19, 2018, 3:20:47 PM5/19/18

to Xan Phung, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:

> In relation to the Andes crossed SIMD instructions, I don't know enough
> about the use cases of crossed SIMD arithmetic to comment in an informed way
> on how they should be provided.

i'm currently looking at a way to implement these in simple-v using:
* xBitManip GREV followed by
* non-crossed SIMD op followed by
* xBitManip GREVI

with optimisations being free to perform macro-op fusion on 2 (or even
3) of those. i'm not totally happy with that approach however as it
does rather overload the instruction cache (3 instructions instead of
1, even if the SIMD op itself can be Compressed).

i'm really quite concerned about O(N^6) SIMD instruction proliferation
[0], am having difficulty thinking how to do this.

l.

[0] 1: opcode 2: num-elements 3: el-bitwidth 4: src1 cross 5: src2
cross 6: dest cross

Jacob Bachmeyer

unread,

May 19, 2018, 9:00:00 PM5/19/18

to Luke Kenneth Casson Leighton, Xan Phung, RISC-V ISA Dev, chuanhua.chang

Luke Kenneth Casson Leighton wrote:
> On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:
>
>> In relation to the Andes crossed SIMD instructions, I don't know enough
>> about the use cases of crossed SIMD arithmetic to comment in an informed way
>> on how they should be provided.
>>
>
> i'm currently looking at a way to implement these in simple-v using:
> * xBitManip GREV followed by
> * non-crossed SIMD op followed by
> * xBitManip GREVI
>

GREVI is "GREV immediate", not "GREV inverse".

> with optimisations being free to perform macro-op fusion on 2 (or even
> 3) of those. i'm not totally happy with that approach however as it
> does rather overload the instruction cache (3 instructions instead of
> 1, even if the SIMD op itself can be Compressed).
>
> i'm really quite concerned about O(N^6) SIMD instruction proliferation
> [0], am having difficulty thinking how to do this.
>
> l.
>
> [0] 1: opcode 2: num-elements 3: el-bitwidth 4: src1 cross 5: src2
> cross 6: dest cross

In the RVP partitioning scheme I propose these map to:
1: opcode (new opcodes for partitioned ops)
2 - 3: part CSR (implicitly defines both element count and bitwidth,
but total width must equal XLEN)
4 - 6: use BitManip to shuffle inputs and outputs

-- Jacob

Xan Phung

unread,

May 19, 2018, 10:00:00 PM5/19/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, chuanhua.chang

Hi Luke,

Thanks for also for the Berkley teaching materials on RVV - I guess this is now mostly out of date given the Barcelona changes.

I have also seen the Barcelona slides on the revised Vector proposal & I notice that (in the base V ISA) they drop polymorphic vector instructions and instead provide multiple sets of FP and integer arithmetic instructions. An extension to V will then re-introduce polymorphic instructions.

Adding the Andes SIMD proposal, this means there will be at least 5x SIMD-like instruction sets in RISC V:

(i) RVP for integer 8 bit types

(ii) RVP for integer 16 bit types

(iii) RVV Base for integer types

(iv) RVV Base for FP types (with H, S, D and Q subsets)

(v) RVV Extension for polymorphic types

The above seems to contradict the RISC V philosophy, which had previously critiqued other architectures for their explosion of SIMD instruction counts. The reason given in the Barcelona slides was "concern on total state needed to hold reg types". I'm waiting on the video upload to find out more & exactly what was discussed.

regards

Xan

On 20 May 2018 at 05:02, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:

> Can anyone on the RV Vector WG give more info on:
> 1. vrgather instruction
> 2. VOP: how many unused func7 opcodes are there? (func7=bits 31:25).
> 3. "mm" opcodes: are there two "mm" mask opcode bits in VOP (as per Roger
> Espaza's Nov 2017 presentation) or is there one predicate selection bit (as
> per current draft Vector spec?)
> 4. missing vsle instruction (see slide 5 of Roger Espaza presentation): why
> vsge (set greater than or equal) is provided when it doesn't add any new
> functionality not provided by vslt (set less-than), but vsle (set less than
> or equal) is missing from the instruction list?
> 5. vslide instruction
> 6. vextract & vpopc instructions

Luke Kenneth Casson Leighton

unread,

May 20, 2018, 3:45:35 AM5/20/18

to jcb6...@gmail.com, Xan Phung, RISC-V ISA Dev, chuanhua.chang

On Sunday, May 20, 2018, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Luke Kenneth Casson Leighton wrote:

On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:

In relation to the Andes crossed SIMD instructions, I don't know enough
about the use cases of crossed SIMD arithmetic to comment in an informed way
on how they should be provided.

i'm currently looking at a way to implement these in simple-v using:
* xBitManip GREV followed by
* non-crossed SIMD op followed by
* xBitManip GREVI

GREVI is "GREV immediate", not "GREV inverse".

Oh, thx. That's what i needed.

with optimisations being free to perform macro-op fusion on 2 (or even
3) of those. i'm not totally happy with that approach however as it
does rather overload the instruction cache (3 instructions instead of
1, even if the SIMD op itself can be Compressed).

i'm really quite concerned about O(N^6) SIMD instruction proliferation
[0], am having difficulty thinking how to do this.

l.

[0] 1: opcode 2: num-elements 3: el-bitwidth 4: src1 cross 5: src2
cross 6: dest cross

In the RVP partitioning scheme I propose these map to:
1: opcode (new opcodes for partitioned ops)
2 - 3: part CSR (implicitly defines both element count and bitwidth, but total width must equal XLEN)
4 - 6: use BitManip to shuffle inputs and outputs

Yes this seems eminently sensible to me. The tradeoff is a delay due to setup of the CSRs plus bitmanip ops will need macroop fusion.

One other way to do it is to have 4-6 as CSRs as well, and underneath it indicates implicitly that the op gets redirected through the bitmanip HW pre and post op.

However that means extra bits in the CSR and also it starts to get a bit... less-elegant if you know what I mean.

Wavering here :)

-- Jacob

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5B00C88C.7050906%40gmail.com.

--

Luke Kenneth Casson Leighton

unread,

May 20, 2018, 4:15:15 AM5/20/18

to Xan Phung, RISC-V ISA Dev, chuanhua.chang

On Sunday, May 20, 2018, Xan Phung <xan....@gmail.com> wrote:

Hi Luke,

Thanks for also for the Berkley teaching materials on RVV - I guess this is now mostly out of date given the Barcelona changes.

I have also seen the Barcelona slides

Can you send me a link on how to find them?

on the revised Vector proposal & I notice that (in the base V ISA) they drop polymorphic vector instructions and instead provide multiple sets of FP and integer arithmetic instructions. An extension to V will then re-introduce polymorphic instructions.

After seeing how python has to implement polymorphism I am not surprised. It's really quite complex. Implicit conversion of length is fine except you have sign to deal with.

In notsosimplev I am proposing to deal with this by just overloading standard ops, and standard meanings of standard ops will apply, where the spec has this all analysed and sorted out and made absolutely clear.

That the standard base is so clear and already done is precisely why it is a sane idea to drop vectorisation directly on top. RVV has to not only duplicate the pipeline and the regfile it also has to duplicate the spec (49 arith ops) and theres a huge amount of work to be done in gcc as well.

So you use FMV.X.W and so on and that converts int vec to fp vec , sorted. Size when type is FP16 will need to be analysed.

Adding the Andes SIMD proposal, this means there will be at least 5x SIMD-like instruction sets in RISC V:
(i) RVP for integer 8 bit types
(ii) RVP for integer 16 bit types
(iii) RVV Base for integer types
(iv) RVV Base for FP types (with H, S, D and Q subsets)
(v) RVV Extension for polymorphic types

The above seems to contradict the RISC V philosophy, which had previously critiqued other architectures for their explosion of SIMD instruction counts.

Yes I agree wholeheartedly. With apologies to the people whose time and effort has gone into RVV and so on, no disrespect implied. I am making an effort in notsosimplev to make it possible for RVV to "lift up the carpet and have n-s-s-v underneath" if you know what i mean. Extra capabilities like 256 regs, polymorphism, GF and complex nums all added seamlessly on top.

I appreciate that RVV has a supercomputer pedigree and contributions from some extremely knowledgeable sources , I feel that a holistic approach needs to be taken here rather than take a division of labour approach (separate isolated WGs)

That implies opening up the discussions and allowing inter-working group and open unrestricted access to documents! A holistic approach cannot possibly succeed if people are not allowed to freely and openly interact on a real-time basis.

A fly in the ointment is that VCLIP is not part of standard RV, it's the only op out of 49 that is not duplicated/present.

The reason given in the Barcelona slides was "concern on total state needed to hold reg types".

It's the interaction that gets you. Python does it by having each type be responsible for converting to other types. Thats an O(n^2) algorithm and theres not a lot you can do about that. If you try to reduce die area by chaining conversions you introduce latency and complexity due to double or even triple conversions.

I'm waiting on the video upload to find out more & exactly what was discussed.

regards

Xan

On 20 May 2018 at 05:02, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, Apr 27, 2018 at 11:27 AM, Xan Phung <xan....@gmail.com> wrote:
> Can anyone on the RV Vector WG give more info on:
> 1. vrgather instruction
> 2. VOP: how many unused func7 opcodes are there? (func7=bits 31:25).
> 3. "mm" opcodes: are there two "mm" mask opcode bits in VOP (as per Roger
> Espaza's Nov 2017 presentation) or is there one predicate selection bit (as
> per current draft Vector spec?)
> 4. missing vsle instruction (see slide 5 of Roger Espaza presentation): why
> vsge (set greater than or equal) is provided when it doesn't add any new
> functionality not provided by vslt (set less-than), but vsle (set less than
> or equal) is missing from the instruction list?
> 5. vslide instruction
> 6. vextract & vpopc instructions

found one: p18 https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf

roger espaca's 2018 barcelona slides also had these (and are more up-to-date)

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAO6moYtB4%2Bc8WwJphf9fnVUVXJFve45EyudJTWRfkOqXm_HBDQ%40mail.gmail.com.

--

Jacob Bachmeyer

unread,

May 20, 2018, 9:00:03 PM5/20/18

to Luke Kenneth Casson Leighton, Xan Phung, RISC-V ISA Dev, chuanhua.chang

Luke Kenneth Casson Leighton wrote:
> On Sunday, May 20, 2018, Jacob Bachmeyer <jcb6...@gmail.com

> <mailto:jcb6...@gmail.com>> wrote:
>
> Luke Kenneth Casson Leighton wrote:
>

> [...]

>
> with optimisations being free to perform macro-op fusion on 2
> (or even
> 3) of those. i'm not totally happy with that approach however
> as it
> does rather overload the instruction cache (3 instructions
> instead of
> 1, even if the SIMD op itself can be Compressed).
>
> i'm really quite concerned about O(N^6) SIMD instruction
> proliferation
> [0], am having difficulty thinking how to do this.
>
> l.
>
> [0] 1: opcode 2: num-elements 3: el-bitwidth 4: src1 cross 5: src2
> cross 6: dest cross
>
>
> In the RVP partitioning scheme I propose these map to:
> 1: opcode (new opcodes for partitioned ops)
> 2 - 3: part CSR (implicitly defines both element count and
> bitwidth, but total width must equal XLEN)
> 4 - 6: use BitManip to shuffle inputs and outputs
>
>
> Yes this seems eminently sensible to me. The tradeoff is a delay due
> to setup of the CSRs plus bitmanip ops will need macroop fusion.

I expect that most software using partitioning will load the part CSR
once or rarely and perform a large number of calculations for each
partitioning selected. Since partitioned ops have distinct opcodes,
hardware can continue to execute non-partitioned instructions while a
partitioning change propagates, stalling on an interlock if a
partitioned operation is reached before the configuration is ready.

Macro-op fusion varies; higher-performance implementations could fuse
BitManip and partitioned instructions, while simpler implementations can
simply execute them as written.

> One other way to do it is to have 4-6 as CSRs as well, and underneath
> it indicates implicitly that the op gets redirected through the
> bitmanip HW pre and post op.
>
> However that means extra bits in the CSR and also it starts to get a
> bit... less-elegant if you know what I mean.
>
> Wavering here :)

Using macro-op fusion is advantageous here, since it also allows fusable
sequences to be executed "unfused" on implementations that want to avoid
that complexity; further CSRs are a poor fit for this, since one
partitioning is likely to remain correct for many instructions, but
pre-swap and post-swap are likely to vary for each step.

-- Jacob

Luke Kenneth Casson Leighton

unread,

May 21, 2018, 1:48:23 AM5/21/18

to jcb6...@gmail.com, Xan Phung, RISC-V ISA Dev, chuanhua.chang

Yes Id agree with that.

One other way to do it is to have 4-6 as CSRs as well, and underneath it indicates implicitly that the op gets redirected through the bitmanip HW pre and post op.

However that means extra bits in the CSR and also it starts to get a bit... less-elegant if you know what I mean.

Wavering here :)

Using macro-op fusion is advantageous here, since it also allows fusable sequences to be executed "unfused" on implementations that want to avoid that complexity; further CSRs are a poor fit for this, since one partitioning is likely to remain correct for many instructions, but pre-swap and post-swap are likely to vary for each step.

Would really ideally like to see a comprehensive case study of expected algorithms , use cases, just to be sure. Believe you... just being thorough.

Many apologies can't trim context, damn phone doesnt have way to highlight and cut text.

-- Jacob

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5B021A10.7060400%40gmail.com.

Message has been deleted

Luke Kenneth Casson Leighton

unread,

Jun 5, 2018, 6:02:39 PM6/5/18

to David Smith, RISC-V ISA Dev

On Thu, May 24, 2018 at 11:09 PM, David Smith <agen...@gmail.com> wrote:
> Hi!
>
> A P subset for audio would be greatly appeciated. :-) This has got great
> potential indeed. :-)

hi david,

just speaking with chuanhua: the P-Ext is based on Andes v3's SIMD
ISA, and part of that DSP's remit is audio. you can tell by looking
at the proposed P-Ext ISA: clip, saturation, these are features you
need for audio, to avoid having extra instruction cycles for clipping
audio.

l.

Reply all

Reply to author

Forward