P-extension charter (SIMD)

lkcl

unread,

Apr 6, 2018, 2:51:32 AM4/6/18

to Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev

On Thu, Apr 5, 2018 at 8:46 AM, Richard Herveille <richard....@roalogic.com> wrote:

However we’re looking at starting a P-extension charter (SIMD).

Hi Richard, would you be happy to elaborate? As you may have noticed i'm interested to leverage RISC-V for 3D and Video processing, minimising the addition of special function blocks unless strictly necessary (Tile-base z-buffers for 3D, DCT for MPEG, that sort of thing).

Also have you seen the "SIMD is Bad" exposition, which explains that the fixed width of SIMD tends to make a dog's dinner mess due to excessive corner-cases.

l.

Richard Herveille

unread,

Apr 6, 2018, 3:25:22 AM4/6/18

to lkcl, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, Richard Herveille, chuanhua.chang

Chuanhua Chang, from Andes Technologies, is driving the effort. I CC-ed him on this email.

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions. In that sense a hardware loop instruction might be out of its scope.

Personal interest is for video (processing) and audio applications, which have very different needs, and both have odd bit-size requirements.

There are a few instructions from the abandoned B-extensions that might be useful. Bit-stuffing and bit-selection come to mind.

Richard

cid:image001.png@01D348FE.8B6D1030

Richard Herveille

Managing Director

Phone +31 (45) 405 5681

Cell +31 (6) 5207 2230

richard....@roalogic.com

lkcl

unread,

Apr 6, 2018, 4:00:00 AM4/6/18

to Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 6, 2018 at 8:25 AM, Richard Herveille <richard.herveille@roalogic.com> wrote:

Chuanhua Chang, from Andes Technologies, is driving the effort. I CC-ed him on this email.

Thanks Richard, hello Chuanhua.

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions.

So looking up the definition of packed-register instructions, that's basically MMX, right? If so, i have a very direct question, based on this analysis which was co-written by Andrew Waterman [1] whose opinion is especially highly regarded in the RISC-V community.

The fundamental basis of RISC-V is to learn from and not repeat the mistakes of past ISAs. Why would it be in the interests of RISC-V to replicate the mistake of polluting RISC-V ISA with several thousand instructions, all of which require heavy setup, teardown and corner-cases?

Was this a question that was put to the Technical Committe chair and vice-chair? Were they aware of Andrew's analysis at the time of the call?

In that sense a hardware loop instruction might be out of its scope.

The SIMD/simple-V proposal [2] would make it an implementor's choice as to whether to implement a hardware loop in parallel, sequential, or a combination of both. Broadcom refer to this as "virtual parallelsim" [3].

Personal interest is for video (processing) and audio applications, which have very different needs, and both have odd bit-size requirements.

Interesting. ok, so the others i can think of include crypto, Tensors (AI), image processing, radar, traditional DSP work and so on.

There are a few instructions from the abandoned B-extensions that might be useful. Bit-stuffing and bit-selection come to mind.

Ok so the other lesson from the V-Extension proposal is that adding what in effect turns out to be an entirely new (independent) engine is an awful lot of work that makes it extremely difficult for interested parties to consider contributing. Is this something that is intended to be taken into consideration?

l.

[1] https://www.sigarch.org/simd-instructions-considered-harmful/

[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8

[3] https://docs.broadcom.com/docs/12358545 Figure 2 P17 and Section 3 on P16.

Andrew Waterman

unread,

Apr 6, 2018, 5:19:13 AM4/6/18

to lkcl, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 6, 2018 at 12:59 AM, lkcl <luke.l...@gmail.com> wrote:

On Fri, Apr 6, 2018 at 8:25 AM, Richard Herveille <richard....@roalogic.com> wrote:

Chuanhua Chang, from Andes Technologies, is driving the effort. I CC-ed him on this email.

Thanks Richard, hello Chuanhua.

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions.

So looking up the definition of packed-register instructions, that's basically MMX, right? If so, i have a very direct question, based on this analysis which was co-written by Andrew Waterman [1] whose opinion is especially highly regarded in the RISC-V community.

The fundamental basis of RISC-V is to learn from and not repeat the mistakes of past ISAs. Why would it be in the interests of RISC-V to replicate the mistake of polluting RISC-V ISA with several thousand instructions, all of which require heavy setup, teardown and corner-cases?

Was this a question that was put to the Technical Committe chair and vice-chair? Were they aware of Andrew's analysis at the time of the call?

I can't officially speak for the Technical Committee, but I believe the thinking is as follows: (a) several RISC-V vendors view packed-SIMD as desirable because it can be added to existing processors at low cost (both in silicon and in design effort); (b) although vectors are vastly superior to packed-SIMD for general-purpose data parallelism, packed-SIMD is certainly sufficient for some applications; and (c) since people are going to do packed-SIMD no matter what we think, we'd like to minimize ecosystem fragmentation by standardizing how it is done.

My expectation is that the community will quickly coalesce around vectors for general-purpose processors and for high-throughput embedded processing. Packed SIMD will play an important role in the most area-constrained (but not necessarily the most energy-efficient) embedded processors.

Hopefully that's nuanced enough to avoid starting a religious war...

In that sense a hardware loop instruction might be out of its scope.

The SIMD/simple-V proposal [2] would make it an implementor's choice as to whether to implement a hardware loop in parallel, sequential, or a combination of both. Broadcom refer to this as "virtual parallelsim" [3].

Personal interest is for video (processing) and audio applications, which have very different needs, and both have odd bit-size requirements.

Interesting. ok, so the others i can think of include crypto, Tensors (AI), image processing, radar, traditional DSP work and so on.

There are a few instructions from the abandoned B-extensions that might be useful. Bit-stuffing and bit-selection come to mind.

Ok so the other lesson from the V-Extension proposal is that adding what in effect turns out to be an entirely new (independent) engine is an awful lot of work that makes it extremely difficult for interested parties to consider contributing. Is this something that is intended to be taken into consideration?

l.

[1] https://www.sigarch.org/simd-instructions-considered-harmful/
[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8
[3] https://docs.broadcom.com/docs/12358545 Figure 2 P17 and Section 3 on P16.

Richard

Richard Herveille
Managing Director
Phone +31 (45) 405 5681
Cell +31 (6) 5207 2230
richard....@roalogic.com

On 06/04/2018, 08:51, "lkcl" <luke.l...@gmail.com> wrote:

On Thu, Apr 5, 2018 at 8:46 AM, Richard Herveille <richard....@roalogic.com> wrote:

However we’re looking at starting a P-extension charter (SIMD).

Hi Richard, would you be happy to elaborate? As you may have noticed i'm interested to leverage RISC-V for 3D and Video processing, minimising the addition of special function blocks unless strictly necessary (Tile-base z-buffers for 3D, DCT for MPEG, that sort of thing).

Also have you seen the "SIMD is Bad" exposition, which explains that the fixed width of SIMD tends to make a dog's dinner mess due to excessive corner-cases.

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDx8%2Ba1Nv-3qLPe_H8FPKZrY_ym7Y3t3EuA3%2BtRBZsnKPw%40mail.gmail.com.

Richard Herveille

unread,

Apr 6, 2018, 7:31:57 AM4/6/18

to lkcl, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang, Richard Herveille

On 06/04/2018, 10:00, "lkcl" <luke.l...@gmail.com> wrote:

On Fri, Apr 6, 2018 at 8:25 AM, Richard Herveille <richard....@roalogic.com> wrote:

Chuanhua Chang, from Andes Technologies, is driving the effort. I CC-ed him on this email.

Thanks Richard, hello Chuanhua.

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions.

So looking up the definition of packed-register instructions, that's basically MMX, right? If so, i have a very direct question, based on this analysis which was co-written by Andrew Waterman [1] whose opinion is especially highly regarded in the RISC-V community.

The fundamental basis of RISC-V is to learn from and not repeat the mistakes of past ISAs. Why would it be in the interests of RISC-V to replicate the mistake of polluting RISC-V ISA with several thousand instructions, all of which require heavy setup, teardown and corner-cases?

Was this a question that was put to the Technical Committe chair and vice-chair? Were they aware of Andrew's analysis at the time of the call?

I think you’re jumping to conclusions or at least making premature assumptions. Your points are well taken and valid. However we will not be adding thousands of instructions, at least I hope not! MMX is an implementation of packed-register instructions, specifically one that (re)uses the floating point registers for SIMD. That doesn’t mean it is the only possible implementation.

Andes proposed their extensions https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo which uses the integer register file.

The current P-extension placeholder (https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex) suggests it should target the floating point registers.

My personal preference is to use the integer register file instead of the floating point one, as that would also include designs without a floating point unit. The SIMD extensions should be simple; if you want to run high performance 3D video analysis, AI, or machine learning, then the V-extension is what you’re looking for.

However I would also like to see odd sized bits; e.g. 10,12,24 (video/audio). But again, my preference. Nothing has been discussed, let alone decided yet.

In that sense a hardware loop instruction might be out of its scope.

The SIMD/simple-V proposal [2] would make it an implementor's choice as to whether to implement a hardware loop in parallel, sequential, or a combination of both. Broadcom refer to this as "virtual parallelsim" [3].

It shouldn’t be an option; it’s either there or it isn’t. You can’t have parts of an extension be implemented and others not. Making it optional is a compiler’s nightmare.

My personal preference would be to add a hardware loop, but again … nothing has been discussed yet.

Personal interest is for video (processing) and audio applications, which have very different needs, and both have odd bit-size requirements.

Interesting. ok, so the others i can think of include crypto, Tensors (AI), image processing, radar, traditional DSP work and so on.

Yes. Although traditional DSP would mean parallel load/store (load/store in parallel with arithmetic operation), bit reverse addressing, …

Nothing has been discussed yet, but that might be out of the scope of a general purpose CPU like RISC-V. But maybe it isn’t …

There are a few instructions from the abandoned B-extensions that might be useful. Bit-stuffing and bit-selection come to mind.

Ok so the other lesson from the V-Extension proposal is that adding what in effect turns out to be an entirely new (independent) engine is an awful lot of work that makes it extremely difficult for interested parties to consider contributing. Is this something that is intended to be taken into consideration?

Personal preference again … it definitely should. The V-Extensions sounds like a separate out-of-order engine in parallel to the CPU (and it probably is). My preference for P-extension is that these instructions are part of the regular instruction flow, very much like the M-extensions.

Cheers,

Richard

lkcl

unread,

Apr 6, 2018, 8:33:37 AM4/6/18

to Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

(hi andrew thanks for responding, just picking up on richard's reply first)

On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille <richard....@roalogic.com> wrote:

The SIMD/simple-V proposal [2] would make it an implementor's choice as to whether to implement a hardware loop in parallel, sequential, or a combination of both. Broadcom refer to this as "virtual parallelsim" [3].

It shouldn’t be an option; it’s either there or it isn’t. You can’t have parts of an extension be implemented and others not. Making it optional is a compiler’s nightmare.

Nono, sorry, you misunderstand: the option as to whether the implementor would make the implementation a hardware loop, a parallel block, or a combination of both, is proposed to be *completely transparent* as far as the compiler is concerned.

The compiler would literally issue, plain-and-simple:

where X = (64/M) * N in the RV64-bit case and

X = (32/M) * N in the RV32 case.

and "under the hood", the implementation would go "hmm i only have a 2-wide parallel ALU for that particular operation, let me just break down those X operations into 2-wide hardware-parallel operations that take X/2 clock cycles to complete"

and - and this is *exactly* what Broadcom actually did in VideoCore-IV (again, totally seamless and transparent to the compiler):

"... however *this* particular operation is a bit uncommon, there's only a *ONE*-wide ALU for this particular operation (reciprocal-square-root), so we'll just execute that as X sequential operations that take X clock cycles to complete".

My personal preference would be to add a hardware loop, but again … nothing has been discussed yet.

Imposing the requirement to duplicate an entire ALU seems to me to have "red flags" written on it, because area-conscious implementors will freak out.

Also imposing the requirement to have a hardware loop where the ALU is only one-wide *also* has "red flags" written on it, because performance-conscious implementors will freak out.

To satisfy both it is just... logical to *TRANSPARENTLY* permit implementors to do both. or either. and whilst you're at it, also permit the implementors to apply that strategy *on a per-operation basis*.

Then they get to choose precisely where the best power-performance ratio is for a given application.

Personal interest is for video (processing) and audio applications, which have very different needs, and both have odd bit-size requirements.

Interesting. ok, so the others i can think of include crypto, Tensors (AI), image processing, radar, traditional DSP work and so on.

Yes. Although traditional DSP would mean parallel load/store (load/store in parallel with arithmetic operation), bit reverse addressing, …
Nothing has been discussed yet, but that might be out of the scope of a general purpose CPU like RISC-V. But maybe it isn’t …

In this (thanks for the link) https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo there's a document (hello chuanhua), going over it, it looks to be 95% DSP and 5% P-SIMD. i'll reply with comments in that thread, if that's ok?

So the summary of that reply will be "some implementors will want the DSP aspects, some definitely won't: if the DSP aspects are *mandatory* as part of P-SIMD in an all-or-nothing fashion that's not going to fly".

Basically i'm continuing to make the case for a split "SIMD=Simple-V that leverages/extends all other RV Extensions including Base", such that the P-SIMD proposal you made, Chuanhua, becomes a "DSP-Extension".

I do like the zero-overhead loop idea, Chuanhua: I would advocate it stay and be part of P-SIMD/Simple-V.

There are a few instructions from the abandoned B-extensions that might be useful. Bit-stuffing and bit-selection come to mind.

Ok so the other lesson from the V-Extension proposal is that adding what in effect turns out to be an entirely new (independent) engine is an awful lot of work that makes it extremely difficult for interested parties to consider contributing. Is this something that is intended to be taken into consideration?

Personal preference again … it definitely should. The V-Extensions sounds like a separate out-of-order engine in parallel to the CPU (and it probably is).

yehyeh.

My preference for P-extension is that these instructions are part of the regular instruction flow, very much like the M-extensions.

Having looked at both P and V as they stand, they're _both_ very much "separate engines" that, despite both their respective merits and extremely powerful features, don't really cleanly fit into the RV design ethos (or the flexible extensibility) and, as such, are both in danger of not being widely adopted. I'm inclined towards recommending:

* splitting out the DSP aspects of P-SIMD to create a single-issue DSP

* splitting out the polymorphism, esoteric data types (GF, complex numbers) and unusual operations of V to create a single-issue "Esoteric Floating-Point" extension

* splitting out the loop-aspects, vector aspects and data-width aspects of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they apply across *all* Extensions, whether those be DSP, M, Base, V, P - everything.

In this way much of the duplication of instructions in both P-SIMD and V can be removed, they are also useful stand-alone... you get the general idea.

l.

lkcl

unread,

Apr 6, 2018, 10:54:52 AM4/6/18

to Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille <richard....@roalogic.com> wrote:

I think you’re jumping to conclusions or at least making premature assumptions. Your points are well taken and valid. However we will not be adding thousands of instructions, at least I hope not! MMX is an implementation of packed-register instructions, specifically one that (re)uses the floating point registers for SIMD. That doesn’t mean it is the only possible implementation.
Andes proposed their extensions https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo which uses the integer register file.

Super, thank you for that link, Richard.

The current P-extension placeholder (https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex) suggests it should target the floating point registers.

That would be bad [see below]

My personal preference is to use the integer register file instead of the floating point one, as that would also include designs without a floating point unit.

Agreed. Not just that: the entire DSP instructions (Chuanhua, as i mention in [1] the AndesDSP3 looks to me more like an integer-based DSP ISA. it's very *very* briefly mentioned anything about SIMD) are, as a DSP, integer-based. If the FP registers are used, then implementations which make extensive use of FENCE to do simultaneous floating-point and integer vector work (think 3D where you want to do F.P. triangles followed by conversion to RGB8888) and would like to do that as an Out-of-Order Execution Engine... they could end up being screwed, couldn't they?

Not least, the compiler's job is made harder, but also, you'd have to pick-and-place the operands from FP registers into integer registers *WITHOUT* doing type-conversion, and that would likely mean a LOAD-STORE... it gets messy. No, logic leans definitely leans towards leaving DSP in the integer registers.

The SIMD extensions should be simple; if you want to run high performance 3D video analysis, AI, or machine learning, then the V-extension is what you’re looking for.

An additional example which requires both: 3D is a weird mixture of 32-bit and 16-bit floating point *and* specialist 8 and 16-bit integer operations. NVidia even have some weird 20-bit FPU intermediary format.

However I would also like to see odd sized bits; e.g. 10,12,24 (video/audio). But again, my preference. Nothing has been discussed, let alone decided yet.

yyeah, YUV-to-RGB is *definitely* something that needs to be recognised / taken into consideration. It would be a pity to have to write a special hard macro just to do YUV-to-RGB conversion. buuut... it's been done, y'ken. The AMD Geode-LX (formerly National Semi Geode) was, incredibly, capable of 720p30 video playback... as a *500mhz 486*! the reason why? a hardware YUV-to-RGB engine that covered the last phase. It was amazing. only a slight tearing when watching a 720p video on a 1280x1024 monitor.

Answering directly: odd-sized bits i feel (without actually doing the analysis) should at least fit into that BGS/BEXT B-extension proposal, right? (discuss on [2] or separate thread?)

anyway: i'm putting a page together on the SIMD/Simple-V proposal (one or two instructions *only*) as there's quite a lot to track [3]

l.

[1] analysis of DSP document https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/vYVi95gF2Mo/yImzJ9A5AgAJ

[2] B-Extension discussion https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/zi_7B15kj6s

[3] SIMD/Simple-V parallelism extension http://libre-riscv.org/simple_v_extension/

lkcl

unread,

Apr 6, 2018, 3:29:52 PM4/6/18

to Andrew Waterman, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 6, 2018 at 10:18 AM, Andrew Waterman <and...@sifive.com> wrote:

On Fri, Apr 6, 2018 at 12:59 AM, lkcl <luke.l...@gmail.com> wrote:

On Fri, Apr 6, 2018 at 8:25 AM, Richard Herveille <richard....@roalogic.com> wrote:

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions.

The fundamental basis of RISC-V is to learn from and not repeat the mistakes of past ISAs. Why would it be in the interests of RISC-V to replicate the mistake of polluting RISC-V ISA with several thousand instructions, all of which require heavy setup, teardown and corner-cases?

Was this a question that was put to the Technical Committe chair and vice-chair? Were they aware of Andrew's analysis at the time of the call?

I can't officially speak for the Technical Committee, but I believe the thinking is as follows: (a) several RISC-V vendors view packed-SIMD as desirable because it can be added to existing processors at low cost (both in silicon and in design effort);

... with the resultant huuuge proliferation that we all know and love in both Intel (god knows how many instructions but a count of the number of transistors on modern x86s hits the BILLION mark) and ARC (several THOUSAND for their video SIMD extensions) and many more...

... all of which is anathemic to the entire principle on which RISC-V was founded. is it possible to beat these abberant RISC-V vendors about the head with corporately-suitably-sized rubber ducks until they get the general message that just because it's always been done that way doesn't mean it has to be done that way *in RISC-V*. we're supposed to be *learning* from the mistakes of the past, not *repeating* them!

(b) although vectors are vastly superior to packed-SIMD for general-purpose data parallelism, packed-SIMD is certainly sufficient for some applications;

i would agree with that sentiment, very much. However.... looking *at* the proposal that Chuanhua put forward, 95% of it is actually DSP *not* actual SIMD. Which tends to support the thought, "well, if you're going to implement packed-SIMD, why not split out the SIMD bit from the DSP bit?"

and (c) since people are going to do packed-SIMD no matter what we think, we'd like to minimize ecosystem fragmentation by standardizing how it is done.

Whilst I appreciate that it's a "realistic" perspective, I cannot help but view it as "caving in to external pressure". You've done absolutely amazing things, attracted huge numbers of extremely skilled people round the RISC-V banner, which is based on certain guiding design principles that everyone I've encountered on here I can pretty much say is in awe of, and whole-heartedly supports.

Can I therefore recommend standing up to those vendors and say, politely and diplomatically (which counts me out for sure), "scuse me, we don't do it that way in RISC-V if there's any choice in the matter"? Unless anyone else would like to step forward I will be more than happy to take responsibility for coordinating an effort to create a "parallelism extension" that would be suitable for both P and V, augmenting both to provide *both* the "V" in vector *and* the "packed" in P-SIMD, and have begun writing that up here:

http://libre-riscv.org/simple_v_extension/

My expectation is that the community will quickly coalesce around vectors for general-purpose processors and for high-throughput embedded processing.

I don't know if you saw my analysis of the [draft 2.3] V-extension, but my feeling is that it's simply too much. The fact that there's only one implementor (Hwacha) and that's ended in 2017 *and hasn't published the RTL* should speak volumes. By spliiting out the actual parallelism, many implementors I think would feel *much* more comfortable with a *general-purpose* parallelism that applies *across all extensions*, whether they be G, M, P-without-the-parallelism or V-without-the-paralleism.

I realise that this would effectively turn P into a "DSP extension" and V into a "Floating-point esoteric types such as Half-precision IEEE704 F.P., Galois-Field and complex numbers" extension. both of which would actually be interesting and useful extensions *in their own right even without any kind of parallelism*.

Packed SIMD will play an important role in the most area-constrained (but not necessarily the most energy-efficient) embedded processors.

Yes I was particularly impressed with the zero-overhead loop mechanism proposed by Chuanhua, and noted (in the simple_v_extension page link above) that it is a really, really easy way to potentially keep a DSP pipeline 100% occupied even in a simple in-order non-super-scalar architecture where it really matters most: the inner loop of an algorithm. Really damn good idea that, which I've never seen in any architecture before (and over the past 30 years I've studied a *lot* of different architectures).

My point is: I hear what you're saying, that Packed SIMD is viewed as being important (because everyone does it that way... mostly by blindly following Intel *sigh*): my point is, if you *actually* go back to the *requirements* - to first principles which from the SIMD article that you wrote i *know* you Grok - what people tend to put into Packed SIMD is actually *two* things: (a) specialist small-length integer operations and (b) parallelism.

*Not* splitting those out into two actually loses a golden opportunity to have the features of a DSP in RISC-V *without* needing the parallelism (where it isn't needed, in even *more* resource-constrained scenarios than plain Packed-SIMD), whilst at the same time losing the golden opportunity of introducing parallelism across *all* of RISC-V's extensions (present and future including V), *and* losing the opportunity to remove redundant instructions present in the existing P (and V) extensions which you *have* to put there because the base RV quite legitimtely cannot cater for them. Oh and the opportunity to add in general support for 8 and 16-bit integer types and 16-bit floating point into RISC-V in a transparent backwards-compatible is *also* lost...

I'm running out of words but i know that there are more opportunities lost... oh yes: a legitimate reason to revive the B-Extension WG being one of them (and regain some of the kudos / face lost by the RISC-V Foundation when they slammed the door on the community, there).

Bottom line: going ahead with Packed SIMD (and V) as they stand, no disrespect to Chunhua and the AndesCore 3 *at all* here, or the people who developed V) - caving in to pressure of the proprietary vendors (none of whom seem to be present on these lists in any way, by the way, with the exception of Andes Technology) - is at the very least an all round 5-hammer-blow lose-lose-lose-lose-lose whammy.

So can I ask you if you would be willing to put that to the Technical Committee, that this, no doubt about it, is a "decision crunch point" which basically makes-or-breaks RISC-V's future and viability. They either take the vendors by the horns (or the ring through their noses, if they have one) and diplomatically steer them in the right direction, or they let the *vendors* - by default by way of custom extensions or just forking - steer RISC-V into, turning it into yet another instruction-obliterated fragmented quagmire that absolutely nobody wants to support, and certainly don't want to teach in Universities.

I'm certain that the (independent) RISC-V community will support you with the former, whole-heartedly, because we like - no *love* - the idea of being able to stick it to ARM and Intel. err I mean, we love the idea of being part of something new and exciting that learns from the mistakes of the past, yes that's much more diplomatic market-speak.

Hopefully that's nuanced enough to avoid starting a religious war...

:)

Andrew Waterman

unread,

Apr 6, 2018, 3:50:24 PM4/6/18

to lkcl, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Fri, Apr 6, 2018 at 12:29 PM, lkcl <luke.l...@gmail.com> wrote:

On Fri, Apr 6, 2018 at 10:18 AM, Andrew Waterman <and...@sifive.com> wrote:

On Fri, Apr 6, 2018 at 12:59 AM, lkcl <luke.l...@gmail.com> wrote:

On Fri, Apr 6, 2018 at 8:25 AM, Richard Herveille <richard....@roalogic.com> wrote:

I had a call with the Technical Committee chair and vice-chair and they made it very clear that the P-extensions were only about packed-register instructions.

The fundamental basis of RISC-V is to learn from and not repeat the mistakes of past ISAs. Why would it be in the interests of RISC-V to replicate the mistake of polluting RISC-V ISA with several thousand instructions, all of which require heavy setup, teardown and corner-cases?

Was this a question that was put to the Technical Committe chair and vice-chair? Were they aware of Andrew's analysis at the time of the call?

I can't officially speak for the Technical Committee, but I believe the thinking is as follows: (a) several RISC-V vendors view packed-SIMD as desirable because it can be added to existing processors at low cost (both in silicon and in design effort);

... with the resultant huuuge proliferation that we all know and love in both Intel (god knows how many instructions but a count of the number of transistors on modern x86s hits the BILLION mark) and ARC (several THOUSAND for their video SIMD extensions) and many more...

... all of which is anathemic to the entire principle on which RISC-V was founded. is it possible to beat these abberant RISC-V vendors about the head with corporately-suitably-sized rubber ducks until they get the general message that just because it's always been done that way doesn't mean it has to be done that way *in RISC-V*. we're supposed to be *learning* from the mistakes of the past, not *repeating* them!

(b) although vectors are vastly superior to packed-SIMD for general-purpose data parallelism, packed-SIMD is certainly sufficient for some applications;

i would agree with that sentiment, very much. However.... looking *at* the proposal that Chuanhua put forward, 95% of it is actually DSP *not* actual SIMD. Which tends to support the thought, "well, if you're going to implement packed-SIMD, why not split out the SIMD bit from the DSP bit?"

and (c) since people are going to do packed-SIMD no matter what we think, we'd like to minimize ecosystem fragmentation by standardizing how it is done.

Whilst I appreciate that it's a "realistic" perspective, I cannot help but view it as "caving in to external pressure". You've done absolutely amazing things, attracted huge numbers of extremely skilled people round the RISC-V banner, which is based on certain guiding design principles that everyone I've encountered on here I can pretty much say is in awe of, and whole-heartedly supports.

Can I therefore recommend standing up to those vendors and say, politely and diplomatically (which counts me out for sure), "scuse me, we don't do it that way in RISC-V if there's any choice in the matter"? Unless anyone else would like to step forward I will be more than happy to take responsibility for coordinating an effort to create a "parallelism extension" that would be suitable for both P and V, augmenting both to provide *both* the "V" in vector *and* the "packed" in P-SIMD, and have begun writing that up here:

http://libre-riscv.org/simple_v_extension/

My expectation is that the community will quickly coalesce around vectors for general-purpose processors and for high-throughput embedded processing.

I don't know if you saw my analysis of the [draft 2.3] V-extension, but my feeling is that it's simply too much. The fact that there's only one implementor (Hwacha) and that's ended in 2017 *and hasn't published the RTL* should speak volumes. By spliiting out the actual parallelism, many implementors I think would feel *much* more comfortable with a *general-purpose* parallelism that applies *across all extensions*, whether they be G, M, P-without-the-parallelism or V-without-the-paralleism.

Hwacha doesn't implement the V extension, and in fact the Hwacha ISA is a totally different ISA style from the V extension. So that's just a distraction.

The reason there are no implementations of the V extension because the ISA is still being designed.

lk...@lkcl.net

unread,

Apr 6, 2018, 4:21:01 PM4/6/18

to RISC-V ISA Dev, luke.l...@gmail.com, richard....@roalogic.com, glem...@vectorblox.com, jcb6...@gmail.com, ces...@cesarb.eti.br, chuanhu...@gmail.com

On Friday, April 6, 2018 at 8:50:24 PM UTC+1, andrew wrote:

Hwacha doesn't implement the V extension, and in fact the Hwacha ISA is a totally different ISA style from the V extension. So that's just a distraction.

Ah! Ok, thanks for clarifying. much to learn.

The reason there are no implementations of the V extension because the ISA is still being designed.

ok. understood. How's that going, and are drafts published regularly so that people in the wider community can participate, review, and contribute? i am primarily taking the best aspects of P and V and distilling them. the last available draft of V is V2.3, and it would be very unfortunate to be proposing Simple-V (variable-width SIMD) against a background where the people most likely to look at it would treat Simple-V as "hopelessly out-of-date" because it precisely mirrors the best aspects of an *out-of-date* (V2.3) V Draft Specification.

l.

Andrew Waterman

unread,

Apr 6, 2018, 5:12:30 PM4/6/18

to lk...@lkcl.net, RISC-V ISA Dev, lkcl, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang

The V extension task group is drafting the specs in the public riscv-isa-manual repo, but we're perpetually in a state of documentation debt.

As a general observation, our ISA-design proposals have always benefited significantly from building prototypes and porting software. So once the V-extension draft proposal settles down, there will be plenty of time to play around with, learn from, and improve the design before attempting to ratify it.

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4edaba55-e753-4b28-b310-c8e5a3921481%40groups.riscv.org.

Luke Kenneth Casson Leighton

unread,

Apr 6, 2018, 6:49:44 PM4/6/18

to Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang

[btw really appreciate you taking the time to reply, andrew. cutting
things back drastically]

On Fri, Apr 6, 2018 at 10:12 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

>> ok. understood. How's that going, and are drafts published regularly so
>> that people in the wider community can participate, review, and contribute?

> The V extension task group is drafting the specs in the public
> riscv-isa-manual repo, but we're perpetually in a state of documentation
> debt.

Yaa, i hear you [libre software engineer].

> As a general observation, our ISA-design proposals have always benefited
> significantly from building prototypes and porting software.

Learning is doing. I understand.

> So once the
> V-extension draft proposal settles down, there will be plenty of time to
> play around with, learn from, and improve the design before attempting to
> ratify it.

Ok, so I have one question i'd like to ask which illustrates one
really really key strategic aspect of the Simple-V / Parallelism /
"variable-width SIMD" Extension proposal.

Excluding V: if every single other register-related instruction in
every single Extension (and that covers type-conversions from
int-to-float and vice-versa, integer operations, floating-point,
load/store, *everything*), suddenly could be parallelised i.e. had an
implicit "vector length" in a CSR which automatically and implicitly
turned *every* register-based operation into a variable-length
parallel one...

... under such hypothetical circumstances how much of the V Extension
- right now - would be duplicated / redundant opcodes?

l.

p.s. I did a similar analysis on P-Ext, where in combination with
"V-Length" CSR there would also be a "Packing-Width" CSR, and as a
combined result of two such hypothetical instructions/CSRs as part of
a hypothetical Parallelism / varible-width-SIMD extension, about *5
percent* of P-Ext could be dropped entirely (as duplicates), A
further (appx) 10% could be considered redundant if moved to a
re-opened B-Ext.

Andrew Waterman

unread,

Apr 6, 2018, 7:18:11 PM4/6/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang

If you look at the V proposal in the repo, you'll see the majority of the instructions are named after scalar instructions but with a 'V' in front. These match your description.

However, there are also several features that go beyond simply attaching VL to a scalar operation and are crucial to being able to vectorize a lot of code. To name a few:

- Conditional execution (i.e., predicated operations)

- Inter-lane data movement (e.g. SLIDE, SELECT)

- Reductions (e.g., VADD with a scalar destination)

- Vector-length speculation (making it possible to vectorize some loops with unknown trip count) - I don't think this part of the proposal is written down yet.

Also, note the vector ISA consumes relatively little opcode space (all the arithmetic fits in 7/8ths of a major opcode). This is mainly because data type and size is a function of runtime configuration, rather than of opcode.

Jacob Bachmeyer

unread,

Apr 6, 2018, 11:23:03 PM4/6/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> (hi andrew thanks for responding, just picking up on richard's reply
> first)
>
> On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille
> <richard....@roalogic.com

This would seem to require adding prefix instructions that indicate
"next instruction is a vector operation" with given parameters. I
really do not like that idea if it can possibly be avoided.

> My personal preference would be to add a hardware loop, but again
> … nothing has been discussed yet.
>
>
> Imposing the requirement to duplicate an entire ALU seems to me to
> have "red flags" written on it, because area-conscious implementors
> will freak out.

Vector units have a tendency to dwarf their companion scalar units. I
expect that area-constrained implementations will not support RVV.

> Also imposing the requirement to have a hardware loop where the ALU is
> only one-wide *also* has "red flags" written on it, because
> performance-conscious implementors will freak out.
>
> To satisfy both it is just... logical to *TRANSPARENTLY* permit
> implementors to do both. or either. and whilst you're at it, also
> permit the implementors to apply that strategy *on a per-operation basis*.
>
> Then they get to choose precisely where the best power-performance
> ratio is for a given application.

I suspect that the "hardware loop" in question is actually a
zero-overhead loop unit that diverts execution from address X to address
Y if a certain condition is met. Yes, it implements COME FROM. For
small inner loops in assembler, that is not completely insane. (The
inevitable abuse of that facility is insane but also a different matter.)

[...]

>
> My preference for P-extension is that these instructions are part
> of the regular instruction flow, very much like the M-extensions.
>
>
> Having looked at both P and V as they stand, they're _both_ very much
> "separate engines" that, despite both their respective merits and
> extremely powerful features, don't really cleanly fit into the RV
> design ethos (or the flexible extensibility) and, as such, are both in
> danger of not being widely adopted. I'm inclined towards recommending:
>
> * splitting out the DSP aspects of P-SIMD to create a single-issue DSP

I suggested a similar use of that contribution.

> * splitting out the polymorphism, esoteric data types (GF, complex
> numbers) and unusual operations of V to create a single-issue
> "Esoteric Floating-Point" extension

The polymorphism in RVV is critical for keeping the number of new
instructions down; the lack of such is why there are so many
MMX/SSE/SSE2/... instructions that are "the same thing" with different
data types. The esoteric data types are possible future extensions. (I
would like to add "bignum" to that list.)

> * splitting out the loop-aspects, vector aspects and data-width
> aspects of both P and V to a *new* "P-SIMD / Simple-V" and requiring
> that they apply across *all* Extensions, whether those be DSP, M,
> Base, V, P - everything.
>
> In this way much of the duplication of instructions in both P-SIMD
> and V can be removed, they are also useful stand-alone... you get the
> general idea.

The main problem that I see with this latter goal is the fairly tight
association of RVV to an actual vector unit; these aspects are
inseparable from the vector unit's internal parallelism.

-- Jacob

Jacob Bachmeyer

unread,

Apr 6, 2018, 11:37:28 PM4/6/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille
> <richard....@roalogic.com

> <mailto:richard....@roalogic.com>> wrote:
>
> I think you’re jumping to conclusions or at least making premature
> assumptions. Your points are well taken and valid. However we will
> not be adding thousands of instructions, at least I hope not! MMX
> is an implementation of packed-register instructions, specifically
> one that (re)uses the floating point registers for SIMD. That
> doesn’t mean it is the only possible implementation.
>
> Andes proposed their extensions
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo

> <https://groups.google.com/a/groups.riscv.org/forum/#%21topic/isa-dev/vYVi95gF2Mo>

> which uses the integer register file.
>
>
> Super, thank you for that link, Richard.
>
>
> The current P-extension placeholder
> (https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex

> <https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex>)

> suggests it should target the floating point registers.
>
>
> That would be bad [see below]

The main argument that I see for using the FP register file is that the
FP registers can be expanded easily to support RVP without the
awkwardness that would result from similarly expanding the integer
register file or hardware complexity that would result from using
contiguous groups of integer registers. (Holy regfile port
proliferation, Batman!)

> My personal preference is to use the integer register file instead
> of the floating point one, as that would also include designs
> without a floating point unit.
>
>
> Agreed. Not just that: the entire DSP instructions (Chuanhua, as i
> mention in [1] the AndesDSP3 looks to me more like an integer-based
> DSP ISA. it's very *very* briefly mentioned anything about SIMD) are,
> as a DSP, integer-based. If the FP registers are used, then
> implementations which make extensive use of FENCE to do simultaneous
> floating-point and integer vector work (think 3D where you want to do
> F.P. triangles followed by conversion to RGB8888) and would like to do
> that as an Out-of-Order Execution Engine... they could end up being
> screwed, couldn't they?
>
> Not least, the compiler's job is made harder, but also, you'd have to
> pick-and-place the operands from FP registers into integer registers
> *WITHOUT* doing type-conversion, and that would likely mean a
> LOAD-STORE... it gets messy. No, logic leans definitely leans towards
> leaving DSP in the integer registers.

RISC-V has the FMV.*.* instructions for transferring bit patterns
between the integer and FP register files. Of course, these do not work
so well for RVP, as they could only access the first packed element(s).
RVP could easily add additional element-get/element-set instructions to
handle this.

Of course, oops, the current RVV draft uses the encodings that were
reserved for RVP LOAD-FP/STORE-FP.

-- Jacob

Christopher Celio

unread,

Apr 7, 2018, 12:26:35 AM4/7/18

to Jacob Bachmeyer, lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

Vector units have a tendency to dwarf their companion scalar units. I expect that area-constrained implementations will not support RVV.

They don't have to. In fact, the vector unit can re-use the existing scalar functional units (and temporally execute as needed). They only real cost to RVV is the extra register file, which can very densely be stored in SRAM.

-Chris

On Fri, Apr 6, 2018 at 8:22 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

lkcl wrote:

(hi andrew thanks for responding, just picking up on richard's reply first)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AC83993.10001%40gmail.com.

Richard Herveille

unread,

Apr 7, 2018, 1:53:38 AM4/7/18

to lkcl, Andrew Waterman, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

Sent from my iPhone

On 6 Apr 2018, at 21:29, lkcl <luke.l...@gmail.com> wrote:

snip

Packed SIMD will play an important role in the most area-constrained (but not necessarily the most energy-efficient) embedded processors.

Yes I was particularly impressed with the zero-overhead loop mechanism proposed by Chuanhua, and noted (in the simple_v_extension page link above) that it is a really, really easy way to potentially keep a DSP pipeline 100% occupied even in a simple in-order non-super-scalar architecture where it really matters most: the inner loop of an algorithm. Really damn good idea that, which I've never seen in any architecture before (and over the past 30 years I've studied a *lot* of different architectures).

Actuality zero overhead loops have been part of DSPs since day one.

ADI DSPs typically had start/end pointers and loop-counter. TI DSPs only had the loop-counter, which then would repeat the next instruction N times.

In a DSP you can specify a complete filter or FFT sequence in a single instruction, given they have dedicated address registers which support auto-inc/decrement and reverse bit addressing.

RISC-V is not a DSP architecture, so it would need more instructions for the same sequence, which is why it needs begin/end pointers and the loop counter.

Fun part is, this would not only benefit DSP sequences, but many generic for-loop sequences too.

Richard

Richard Herveille

unread,

Apr 7, 2018, 1:59:28 AM4/7/18

to Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang

Sent from my iPhone

Which is a great idea which I really like. I can only see advantages in using a similar approach for SIMD.

Richard

Richard Herveille

unread,

Apr 7, 2018, 2:04:35 AM4/7/18

to jcb6...@gmail.com, lkcl, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

Sent from my iPhone

> On 7 Apr 2018, at 05:37, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> lkcl wrote:
>> On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille <richard....@roalogic.com <mailto:richard....@roalogic.com>> wrote:
>>
>> I think you’re jumping to conclusions or at least making premature
>> assumptions. Your points are well taken and valid. However we will
>> not be adding thousands of instructions, at least I hope not! MMX
>> is an implementation of packed-register instructions, specifically
>> one that (re)uses the floating point registers for SIMD. That
>> doesn’t mean it is the only possible implementation.
>>
>> Andes proposed their extensions
>> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo
>> <https://groups.google.com/a/groups.riscv.org/forum/#%21topic/isa-dev/vYVi95gF2Mo>
>> which uses the integer register file.
>>
>>
>> Super, thank you for that link, Richard.
>>
>> The current P-extension placeholder
>> (https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex
>> <https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex>)
>> suggests it should target the floating point registers.
>>
>>
>> That would be bad [see below]
>
> The main argument that I see for using the FP register file is that the FP registers can be expanded easily to support RVP without the awkwardness that would result from similarly expanding the integer register file or hardware complexity that would result from using contiguous groups of integer registers. (Holy regfile port proliferation, Batman!)

The disadvantage now is that you mandate an FPU if one wants to implement SIMD.
My personal interest is in integer only SIMD (audio). I can easily hook that into our integer only CPU line.

Richard

lkcl

unread,

Apr 7, 2018, 4:15:00 AM4/7/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Sat, Apr 7, 2018 at 4:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> "... however *this* particular operation is a bit uncommon, there's only a
>> *ONE*-wide ALU for this particular operation (reciprocal-square-root), so
>> we'll just execute that as X sequential operations that take X clock cycles
>> to complete".
>
>
> This would seem to require adding prefix instructions that indicate "next
> instruction is a vector operation" with given parameters. I really do not
> like that idea if it can possibly be avoided.

It's pretty much exactly what the existing V-Ext is proposing
(V2.3-draft, section 17.16 "Vector Length" CSR vl). I'm merely
advocating that the concept be "generalised and applied globally
across all Extensions".

> Vector units have a tendency to dwarf their companion scalar units. I
> expect that area-constrained implementations will not support RVV.

As Christopher Cello kindly points out later in the thread, they don't have to.

In the simplest possible implementation case - and i am talking about
both the current V-Ext *and* the proposed Simple-V here - I would
envisage that the engine which did the "Vector-length-ification" could
be abstracted out and inserted in between the register file, the
instruction decode and the instruction issue, feeding "virtual
instructions" to ALUs in a way that is similar to zero-overhead loops
(more on the distinction below) whilst not _actually_ being that
intrusive or disruptive to add in an existing "non-vector-capable"
design.

... unlike SIMD :)

SIMD, and i am referring to generally-implemented SIMD as well as
Simple-V's "packed datawidth" which is directly conceptually taken
from V-Ext (V2.3-draft, section 17.6 "Bitwidth"), could potentially
have a more intrusive effect.

SIMD in effect takes the register file of width N and height M, and
*overrides* the (normally fixed) width N with a dynamically-specified
width O, to turn the exact same register file on a per-instruction
basis into one of width O and height (N/O * M).

However i *think*... i hope.... that even that would not be too
intrusive to abstract out in a general fashion.

Where it *would* get hairy is what the implications are for the ALU.
As in: implementors who have a simple RV32 engine might freak out if
they are *forced* to implement 16 and 8 sized datawidths, with the
implication that their ALU now has to handle 8 and 16-bit integer
operations, adjusting overflow suitably and so on.

With that in mind I would be inclined to let implementors decide
which packed datawidths they wish to support in hardware (and which
they wish to punt to software-emulation / trapping). Clearly, then,
if P-Ext as it stands has the SIMD removed, leaving just the DSP
aspects in it, and an implementor chooses to add DSP-Ext, adding an
8-bit ALU and a 16-bit ALU suddenly *does* make sense, as would
correspondingly including (go figure) hardware-level support for 8 and
16 bit packed datawidths.

> I suspect that the "hardware loop" in question is actually a zero-overhead
> loop unit that diverts execution from address X to address Y if a certain
> condition is met.

not quite. The zero-overhead loop unit interestingly would be at an
[independent] level above vector-length. The distinctions are as
follows:

* Vector-length issues *virtual* instructions where the register
operands are *specifically* altered (to cover a range of registers),
whereas zero-overhead loops *specifically* do *NOT* alter the operands
in *ANY* way.

* Vector-length-driven "virtual" instructions are driven by *one* and
*only* one instruction (whether it be a LOAD, STORE, or pure
one/two/three-operand operation) whereas zero-overhead loop units
specifically apply to *multiple* [and *unaltered*] instructions.

Where vector-length-driven "virtual" instructions might get
conceptually blurred with zero-overhead loops is LOAD / STORE. In the
case of LOAD / STORE, to actually be useful, vector-length-driven LOAD
/ STORE should increment the LOAD / STORE memory address to
correspondingly match the increment in the register bank. example:

* set vector-length for r0 to 4
* issue RV32 LOAD from addr 0x1230 to r0

translates effectively to:

* RV32 LOAD from addr 0x1230 to r0
* ...
* ...
* RV32 LOAD from addr 0x123B to r3

> Yes, it implements COME FROM.

i love COME FROM. oh wait, my favourite's "IF THEN UNLESS", where
implementors are permitted to try to undo the if or then clause if the
UNLESS condition is true, and, if that's not possible, start deleting
user data to hide the failure, up to and including executing a "HCF"
instruction....

> For small inner loops in
> assembler, that is not completely insane. (The inevitable abuse of that
> facility is insane but also a different matter.)

mwahaha. insanity being defined as "doing the same thing over and
over, based on the exact same same conditions, anticipating and
expecting a different result each and every time". sounds like a
_perfect_ definition of zero-overhead loops and vector-length-driven
instructions to me!

hmmm... I-Extension's not taken yet....

>> * splitting out the DSP aspects of P-SIMD to create a single-issue DSP
>
> I suggested a similar use of that contribution.

i saw that, and concur.

>> * splitting out the polymorphism, esoteric data types (GF, complex
>> numbers) and unusual operations of V to create a single-issue "Esoteric
>> Floating-Point" extension
>
>
> The polymorphism in RVV is critical for keeping the number of new
> instructions down; the lack of such is why there are so many
> MMX/SSE/SSE2/... instructions that are "the same thing" with different data
> types.

[going into introspection mode here: how do i express this as a
request that also respects your right to make your own decisions? If
i get it wrong please do assume respectful-intent and adjust the
wording implicitly]

Would it be ok to ask if you could re-visit that opinion / conclusion
at a future time, first allowing me (and others) to successfully
communicate (and define, for spec purposes) the subtle distinction
between variable-length, variable-packed-datawidth and zero-overhead
loops?

> The esoteric data types are possible future extensions. (I would
> like to add "bignum" to that list.)

karatsuuuba multiplication algorithms... in hardware... niiiice :)

>> In this way much of the duplication of instructions in both P-SIMD and V
>> can be removed, they are also useful stand-alone... you get the general
>> idea.
>
>
> The main problem that I see with this latter goal is the fairly tight
> association of RVV to an actual vector unit; these aspects are inseparable
> from the vector unit's internal parallelism.

I hope above I managed to provide some clarity above for at least the
critical "minimum implementation" case that would allow you to
understand that it is feasible, just as Christopher Cello later in the
thread points out, to decouple the *normally-implied* tight
association of any Vector-based implementation (including RVV, and
Simple-V), by setting the vector unit's internal parallelism width to
exactly 1 (i.e. "no change *at all* to a pre-existing scalar ALU").

Where it does start to get a bit more complex (with associated
clearly-understood performance and power-reduction benefits that make
the whole exercise worthwhile in the first place) is where
implementors choose to set the vector unit internal parallelism to a
size greater than 1.

The reason why I suggested allowing implementors to choose precisely
which parts of any given ALU to parallelise and which to leave as
"internal vector parallelism = 1" is in pragmatic recognition of it
being simply too much - and not even desirable - to expect
implementors to duplicate an *entire* RV* ALU *right* across all
Extensions, just to get potentially one and *only* one
application-specific bit of parallelism into their implementation.
That's unreasonable to expect, and any spec worthy of respecting the
RV design paradigm should, I feel, reflect that.

l.

lkcl

unread,

Apr 7, 2018, 4:59:47 AM4/7/18

to Richard Herveille, jcb6...@gmail.com, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

[combining 2 related replies into 1]

On Sat, Apr 7, 2018 at 7:04 AM, Richard Herveille
<richard....@roalogic.com> wrote:

>> On 7 Apr 2018, at 05:37, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> The main argument that I see for using the FP register file is that the
>> FP registers can be expanded easily to support RVP without the
>> awkwardness that would result from similarly expanding the integer
>> register file or hardware complexity that would result from using
>> contiguous groups of integer registers. (Holy regfile port proliferation, Batman!)

oh. Jacob, are you saying that because you believe that to achieve
vector parallelism or SIMD parallelism it is *always* required to add
multiple simultaneous ports (like dual-ported RAM) to the register
file?

In my previous reply to you, I proposed a hypothetical implementation
that stood in between the register file and the ALU, sequentially
redirecting register indexes for e.g. r0 to instead be r0..r(N-1) each
*one* per clock cycle.

So are you implying that even to implement that simple an abstractor,
where there is *still only one* fetch/store from the register file per
cycle, that implementors would *still* need to add extra ports to the
register file?

I don't believe it to be necessary, but I don't precisely know the
details. I suspect that it would only be necessary to consider
multi-porting the register file if the *actual* underlying ALU was
full/partially-duplicated to become, to use a phrase you also used, an
*internal* parallel vector unit. Would you concur?

> The disadvantage now is that you mandate an FPU if one wants to implement SIMD.

or an FPU register-file... even if you have no FPU. which may be
undesirable... just *less* undesirable.

> My personal interest is in integer only SIMD (audio). I can easily hook that into our integer only CPU line.

exactly. as long as [see below] it's not completely area-wise insane
to add CSRs per register?

On Sat, Apr 7, 2018 at 5:26 AM, Christopher Celio <ce...@berkeley.edu> wrote:

>> Vector units have a tendency to dwarf their companion scalar units. I
>> expect that area-constrained implementations will not support RVV.

> They don't have to. In fact, the vector unit can re-use the existing scalar
> functional units (and temporally execute as needed). They only real cost to
> RVV is the extra register file, which can very densely be stored in SRAM.

Ok so this is an aspect of Simple-V that I hadn't thought through,
yet (proposal / idea only a few days old!). in V2.3-Draft ISA Section
17.10 the CSRs are listed. I note that there's some general-purpose
CSRs (including a global/active vector-length) and 16 vcfgN CSRs. i
don't precisely know what those are for.

In the Simple-V proposal, *every* register in both the integer
register-file *and* the floating-point register-file would have at
least a 2-bit "data-width" CSR and probably something like an 8-bit
"vector-length" CSR (less in RV32E, by exactly one bit, in recognition
of the reduction to half the number of registers).

What I *don't* know is whether that would be considered perfectly
reasonable or completely insane. If it turns out that the proposed
Simple-V CSRs can indeed be stored in SRAM then I would imagine that
adding somewhere in the region of 10 bits per register would be...
okay? I really don't honestly know.

Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
be multi-ported? No I don't believe they would.

Help evaluating those assertions greatly appreciated.

l.

Jacob Bachmeyer

unread,

Apr 8, 2018, 12:06:02 AM4/8/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Sat, Apr 7, 2018 at 4:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>>> "... however *this* particular operation is a bit uncommon, there's only a
>>> *ONE*-wide ALU for this particular operation (reciprocal-square-root), so
>>> we'll just execute that as X sequential operations that take X clock cycles
>>> to complete".
>>>
>> This would seem to require adding prefix instructions that indicate "next
>> instruction is a vector operation" with given parameters. I really do not
>> like that idea if it can possibly be avoided.
>>
>
> It's pretty much exactly what the existing V-Ext is proposing
> (V2.3-draft, section 17.16 "Vector Length" CSR vl). I'm merely
> advocating that the concept be "generalised and applied globally
> across all Extensions".
>

Not quite as I see it -- RVV proposes specific vector opcodes that are
affected by the current state of vl (and later, vm and vn), while
"Simple-V" seems to intend to use some prefix that then causes
*ordinary* RISC-V instructions to be interpreted as vector operations.

>> Vector units have a tendency to dwarf their companion scalar units. I
>> expect that area-constrained implementations will not support RVV.
>>
>
> As Christopher Cello kindly points out later in the thread, they don't have to.
>
> In the simplest possible implementation case - and i am talking about
> both the current V-Ext *and* the proposed Simple-V here - I would
> envisage that the engine which did the "Vector-length-ification" could
> be abstracted out and inserted in between the register file, the
> instruction decode and the instruction issue, feeding "virtual
> instructions" to ALUs in a way that is similar to zero-overhead loops
> (more on the distinction below) whilst not _actually_ being that
> intrusive or disruptive to add in an existing "non-vector-capable"
> design.
>

Putting something "between the register file, the instruction decode and
the instruction issue" seems very likely to significantly extend the
effective critical path for all instructions, while a dedicated vector
unit has its own critical paths, but those only affect vector instructions.

> ... unlike SIMD :)
>
> SIMD, and i am referring to generally-implemented SIMD as well as
> Simple-V's "packed datawidth" which is directly conceptually taken
> from V-Ext (V2.3-draft, section 17.6 "Bitwidth"), could potentially
> have a more intrusive effect.
>
> SIMD in effect takes the register file of width N and height M, and
> *overrides* the (normally fixed) width N with a dynamically-specified
> width O, to turn the exact same register file on a per-instruction
> basis into one of width O and height (N/O * M).
>

This kind of reshapable regfile seems likely to be extremely complex to
me. (Not that RVV does not envision exactly that in the vector unit,
but not for the scalar register file.)

> [...]

These "virtual" instructions seem to me like something that only a large
out-of-order processor could use without extensive modifications. I do
not believe that such a design is a good fit for RISC-V.

>> For small inner loops in
>> assembler, that is not completely insane. (The inevitable abuse of that
>> facility is insane but also a different matter.)
>>
>
> mwahaha. insanity being defined as "doing the same thing over and
> over, based on the exact same same conditions, anticipating and
> expecting a different result each and every time". sounds like a
> _perfect_ definition of zero-overhead loops and vector-length-driven
> instructions to me!
>
> hmmm... I-Extension's not taken yet....
>

Sadly, it is used: RVI is the full base Integer ISA, while RVE is the
half-register-file base Embedded ISA.

>>> * splitting out the polymorphism, esoteric data types (GF, complex
>>> numbers) and unusual operations of V to create a single-issue "Esoteric
>>> Floating-Point" extension
>>>
>> The polymorphism in RVV is critical for keeping the number of new
>> instructions down; the lack of such is why there are so many
>> MMX/SSE/SSE2/... instructions that are "the same thing" with different data
>> types.
>>
>
> [going into introspection mode here: how do i express this as a
> request that also respects your right to make your own decisions? If
> i get it wrong please do assume respectful-intent and adjust the
> wording implicitly]
>
> Would it be ok to ask if you could re-visit that opinion / conclusion
> at a future time, first allowing me (and others) to successfully
> communicate (and define, for spec purposes) the subtle distinction
> between variable-length, variable-packed-datawidth and zero-overhead
> loops?
>

Sure, but the critical requirement for RVV is that the same opcode be
usable for any type or shape of data for which that operation is
defined. (In other words, there must be *one* vector-add opcode and it
must work for all vectors where addition is defined.)

>> The esoteric data types are possible future extensions. (I would
>> like to add "bignum" to that list.)
>>
>
> karatsuuuba multiplication algorithms... in hardware... niiiice :)
>

As I understand, an RVV vector unit (like the Hwacha vector unit) is
expected to be internally microcoded, so multiplication might not need
to be defined in hardware, if an add-with-carry operation is available.
(I proposed such an operation in message-id <57EB10A7...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57EB10A7.1000801%40gmail.com>
using a prefix ("vector-save-carry") that only affects a following
vector addition and a special instruction ("vector-process-carry") to
convert saved carry information into a vector that can be added to a
"carryless" result to produce the correct sum.)

Also, vector "bignum" support would require software cooperation, since
it would also support bignums that are larger than the capacity of the
vector unit.

This is also the reason (as I see it) for RVV to use distinct vector
opcodes instead of reusing the baseline opcodes with vectors using the
main register file.

...

There is an interesting way to combine these: each X register has an
implicit "vector-tail", the length of which is given by the per-register
"vector lengths" simple-V proposes. The vector unit then "shadows" the
scalar unit and performs vector operations on the "vector-tails", which
are stored in the vector unit.

-- Jacob

Jacob Bachmeyer

unread,

Apr 8, 2018, 12:10:11 AM4/8/18

to Richard Herveille, lkcl, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

Richard Herveille wrote:
>> On 7 Apr 2018, at 05:37, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> lkcl wrote:
>>
>>> On Fri, Apr 6, 2018 at 12:31 PM, Richard Herveille <richard....@roalogic.com <mailto:richard....@roalogic.com>> wrote:
>>>
>>> I think you’re jumping to conclusions or at least making premature
>>> assumptions. Your points are well taken and valid. However we will
>>> not be adding thousands of instructions, at least I hope not! MMX
>>> is an implementation of packed-register instructions, specifically
>>> one that (re)uses the floating point registers for SIMD. That
>>> doesn’t mean it is the only possible implementation.
>>>
>>> Andes proposed their extensions
>>> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo
>>> <https://groups.google.com/a/groups.riscv.org/forum/#%21topic/isa-dev/vYVi95gF2Mo>
>>> which uses the integer register file.
>>>
>>>
>>> Super, thank you for that link, Richard.
>>>
>>> The current P-extension placeholder
>>> (https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex
>>> <https://github.com/riscv/riscv-isa-manual/blob/master/src/p.tex>)
>>> suggests it should target the floating point registers.
>>>
>>>
>>> That would be bad [see below]
>>>
>> The main argument that I see for using the FP register file is that the FP registers can be expanded easily to support RVP without the awkwardness that would result from similarly expanding the integer register file or hardware complexity that would result from using contiguous groups of integer registers. (Holy regfile port proliferation, Batman!)
>>
>
> The disadvantage now is that you mandate an FPU if one wants to implement SIMD.
> My personal interest is in integer only SIMD (audio). I can easily hook that into our integer only CPU line.
>

Would it be possible for RVP-without-RVF to add the FP register file,
but only implement LOAD-FP/STORE-FP/FMV and the SIMD instructions?

How much of an FPU is the FP register file? I still like the idea of
putting SIMD in a separate register file, leaving the main integer
registers available for control purposes.

-- Jacob

Jacob Bachmeyer

unread,

Apr 8, 2018, 12:35:43 AM4/8/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> [combining 2 related replies into 1]
>
> On Sat, Apr 7, 2018 at 7:04 AM, Richard Herveille
> <richard....@roalogic.com> wrote:
>
>
>> On 7 Apr 2018, at 05:37, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> The main argument that I see for using the FP register file is that the
>>> FP registers can be expanded easily to support RVP without the
>>> awkwardness that would result from similarly expanding the integer
>>> register file or hardware complexity that would result from using
>>> contiguous groups of integer registers. (Holy regfile port proliferation, Batman!)
>>>
>
> oh. Jacob, are you saying that because you believe that to achieve
> vector parallelism or SIMD parallelism it is *always* required to add
> multiple simultaneous ports (like dual-ported RAM) to the register
> file?
>

If the instructions are to *actually* be executed in parallel, yes.

> In my previous reply to you, I proposed a hypothetical implementation
> that stood in between the register file and the ALU, sequentially
> redirecting register indexes for e.g. r0 to instead be r0..r(N-1) each
> *one* per clock cycle.
>

This implementation has implicit iteration, but does not actually
execute the vector operations in parallel, so there is no gain over
baseline RISC-V.

> So are you implying that even to implement that simple an abstractor,
> where there is *still only one* fetch/store from the register file per
> cycle, that implementors would *still* need to add extra ports to the
> register file?
>

In that case, no, the register file needs only the baseline 2R1W.
However, in that case, the "abstractor" has significantly extended the
critical path length, while not providing any improvement (ignoring code
size reductions) over simply unrolling the implied loop in software.

> I don't believe it to be necessary, but I don't precisely know the
> details. I suspect that it would only be necessary to consider
> multi-porting the register file if the *actual* underlying ALU was
> full/partially-duplicated to become, to use a phrase you also used, an
> *internal* parallel vector unit. Would you concur?
>

Yes, but if the ALU is not duplicated to provide multiple vector lanes,
what is the use of the vector extension? RVV, with its
quasi-independent vector unit, at least can perform a long-running
vector operation while the scalar unit continues execution, but the
simple-V model reuses the scalar execution resources for vector processing.

>> My personal interest is in integer only SIMD (audio). I can easily hook that into our integer only CPU line.
>>
>
> exactly. as long as [see below] it's not completely area-wise insane
> to add CSRs per register?
>

If the regfile is already enough of the required resources for an RVI
processor to justify an RVE base with half of the registers, I doubt any
CSR-per-register approach is likely to be well-received. :-/

> On Sat, Apr 7, 2018 at 5:26 AM, Christopher Celio <ce...@berkeley.edu> wrote:
>
>
>>> Vector units have a tendency to dwarf their companion scalar units. I
>>> expect that area-constrained implementations will not support RVV.
>>>
>> They don't have to. In fact, the vector unit can re-use the existing scalar
>> functional units (and temporally execute as needed). They only real cost to
>> RVV is the extra register file, which can very densely be stored in SRAM.
>>
>
> Ok so this is an aspect of Simple-V that I hadn't thought through,
> yet (proposal / idea only a few days old!). in V2.3-Draft ISA Section
> 17.10 the CSRs are listed.

A minor note: the version that matters is the one at Chapter 17: RVV
v0.4-draft. The overall manual has been v2.3-draft for some time and
the individual extensions are versioned independently in draft versions
of the manual. Even that version is no real guarantee; to be specific
the Git commit ID is needed. For discussions, "the current draft" is
usually close enough. Someone will point out conclusions derived from
an old version. (And sometimes the latest version in the repo is old!
The length-speculative vector loads I complained about have been fixed,
but not in the actual draft RVV ISA spec yet!)

> I note that there's some general-purpose
> CSRs (including a global/active vector-length) and 16 vcfgN CSRs. i
> don't precisely know what those are for.
>

Those store the details of which vector registers are used and what data
type each architectural vector register holds. They are explained in
sections 17.2 through 17.5.

> In the Simple-V proposal, *every* register in both the integer
> register-file *and* the floating-point register-file would have at
> least a 2-bit "data-width" CSR and probably something like an 8-bit
> "vector-length" CSR (less in RV32E, by exactly one bit, in recognition
> of the reduction to half the number of registers).
>
> What I *don't* know is whether that would be considered perfectly
> reasonable or completely insane. If it turns out that the proposed
> Simple-V CSRs can indeed be stored in SRAM then I would imagine that
> adding somewhere in the region of 10 bits per register would be...
> okay? I really don't honestly know.
>
> Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
> be multi-ported? No I don't believe they would.
>

Every instruction would need to read at least one such CSR or possibly
two. What does an ADD of two different-sized vectors do in simple-V?

> Help evaluating those assertions greatly appreciated.

In terms of storage space, it is comparable to (or smaller than) what is
needed for the current RVV draft. In terms of dynamic effects, I think
that it is more expensive than the current RVV draft, since simple-V
must read a length CSR for (almost) every instruction executed to
determine whether that instruction is scalar or vector. And what about
instructions like JALR? What does jumping to a vector do? (Sounds like
a great way to spawn threads in a multi-processor to me.)

-- Jacob

Richard Herveille

unread,

Apr 8, 2018, 2:22:38 AM4/8/18

to jcb6...@gmail.com, lkcl, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

Sent from my iPad

Now we get to the issue that the P extension requires partial instructions from FDQ. Which may be okay, since P might also need instructions typical for the B extension.

>
> How much of an FPU is the FP register file? I still like the idea of putting SIMD in a separate register file, leaving the main integer registers available for control purposes.

However I would like to not enforce using the FP register file, initially to support both packed integer and packed floating point instructions. But I guess that would be possible anyways. But then the ALU would need direct access to the FPU, whereas so far the FPU register file was dedicated to the FPU. This will increase the operand selection logic and path.
It also makes using the regular I-instructions more difficult. Checking if a packed RGB is zero for example. Or to check if one video format color range fits into another’s (RGB-YUV conversion).

Richard

>
>
> -- Jacob
>

lkcl

unread,

Apr 8, 2018, 7:45:48 AM4/8/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Sun, Apr 8, 2018 at 5:05 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl wrote:

>> It's pretty much exactly what the existing V-Ext is proposing
>> (V2.3-draft, section 17.16 "Vector Length" CSR vl). I'm merely
>> advocating that the concept be "generalised and applied globally
>> across all Extensions".
>>
>
>
> Not quite as I see it -- RVV proposes specific vector opcodes that are
> affected by the current state of vl (and later, vm and vn), while "Simple-V"
> seems to intend to use some prefix that then causes *ordinary* RISC-V
> instructions to be interpreted as vector operations.

Ordinary RISC-V instructions to be interpreted as vector operations
that are *identical* in every respect to their RVV equivalents.

As this is a P-Ext thread i should mention that the exact same
corresponding logic applies to P-Ext.

> Putting something "between the register file, the instruction decode and the
> instruction issue" seems very likely to significantly extend the effective
> critical path for all instructions, while a dedicated vector unit has its
> own critical paths, but those only affect vector instructions.

mmm.... true.... however remember i suggested it as a (conceptual)
Reference Design which illustrated the idea. I'm sure that
implementors would be able to come up with something that was optimal
and did not have such a down-side.

>> SIMD in effect takes the register file of width N and height M, and
>> *overrides* the (normally fixed) width N with a dynamically-specified
>> width O, to turn the exact same register file on a per-instruction
>> basis into one of width O and height (N/O * M).
>>
>
>
> This kind of reshapable regfile seems likely to be extremely complex to me.

Welcome to SIMD! :) Seriously: this is just how SIMD is, in effect.
It's a bit of a mind-melt.

Note that I described it as being "effectively", not *actually*.
Typically I would expect traditional SIMD to *actually* be implemented
as "the reshaping is actually done by the ALU rather than at the
instruction decode phase". i.e. the register is passed (in full, no
reshaping at all) directly to an ALU which performs N-simultaneous
bitwidth-divided register-chunked operations. and we know where that
leads: massive proliferation of instruction opcodes.

The Aspex massive-SIMD had the right idea: as an Array *String*
Processor you could break down the "string" into chunks of 32, and
each ALU had a communication (and a way to pass registers between)
each of its left-and-right neighbours (with breaks splitting that
string on 32-ALU boundaries). You could decide whether to:

* use each 2-bit ALU to do 32 individual operations (such that a
64-bit add would be completed in 32 cycles, 32 of them in parallel) OR
* use 2 neighbouring 2-bit ALUs to do 16 operations (such that a
32-bit add would be completed in 16 cycles, 16 of them in parallel) OR
* use 4 ......................... OR
* ...
* use all 32 ALUs to do **ONE** 64-bit add, completed in **ONE**
cycle, but just *ONE* of them

You think ordinary SIMD is Bad? Dude, you have nnooo idea. The ASP
was an absolute BITCH to program. Literally days PER INSTRUCTION.
The only reason to consider all possibilities of the above was because
it was a fixed amount of time for the memory (DMA) engine to load (or
store) in the background, and consequently, with all those options
above taking different times *for different algorithms* we had to
write frickin SPREADSHEETS to calculate the best way to implement a
SPECIFIC customers algorithm. IN ASSEMBLER! five, six, eight,
sometimes TEN separate implementations in hand-coded assembler before
we could even frickin decide which one would be best to deploy and
discard all the other four to nine!

Being one of the few people ever to have worked with this
ultra-ultra-specialist type of 2-bit SIMD processor, it is primarily
why I am so strongly supportive of Andrew's "SIMD Is Bad" principle.
I *know* how far down the rabbit-hole it really goes.

> (Not that RVV does not envision exactly that in the vector unit, but not for
> the scalar register file.)

... because RVV does not cover the SIMD case. However I can easily
make a case for combining the benefits of SIMD and RVV: half-precision
IEEE 754 (binary16 format). *Right now* V is wasting precious
vector-register file bits (50% in the case of double and a whopping
*75*% - a 4:1 ratio! in the case of long-double support).

>> * set vector-length for r0 to 4
>> * issue RV32 LOAD from addr 0x1230 to r0
>>
>> translates effectively to:
>>
>> * RV32 LOAD from addr 0x1230 to r0
>> * ...
>> * ...
>> * RV32 LOAD from addr 0x123B to r3
>>
>
>
> These "virtual" instructions seem to me like something that only a large
> out-of-order processor could use without extensive modifications.

why? It's just a loop (a hardware one).

> I do not believe that such a design is a good fit for RISC-V.

why? It's exactly what's in V right now, as we're discussing in a
separate thread, and I am advocating that the same functionality be
duplicated and moved to its own Extension. You even gave a
near-identical example just with an offset stride recommendation!
here is a direct link to my actual (individual) reply:
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/YzMeNNROxfc/hVSy-NvKAgAJ

and here is the link to the overall thread (msgid I hope I got that right?):
https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDy8q7WNLzsO-%3Dr5rDdhW09QmVtfx_MTp%3DjJ%2Bq_sFvzVVg%40mail.gmail.com

These are *precisely* the same discussions! :)

> Sadly, it is used: RVI is the full base Integer ISA, while RVE is the
> half-register-file base Embedded ISA.

shaame, I kinda liked the idea of an Insanity-Extension...

>> Would it be ok to ask if you could re-visit that opinion / conclusion
>> at a future time, first allowing me (and others) to successfully
>> communicate (and define, for spec purposes) the subtle distinction
>> between variable-length, variable-packed-datawidth and zero-overhead
>> loops?
>>
>
>
> Sure, but the critical requirement for RVV is that the same opcode be usable
> for any type or shape of data for which that operation is defined. (In
> other words, there must be *one* vector-add opcode and it must work for all
> vectors where addition is defined.)

yes. I would expect a (modified to take into account Simple-V) RVV
to do exactly that... just using (for the most part) the "standard" RV
opcodes for its polymorphism instead of specifically having explicit
(duplicate) ones.

this really needs to be on a separate thread

> As I understand, an RVV vector unit (like the Hwacha vector unit) is
> expected to be internally microcoded, so multiplication might not need to be
> defined in hardware, if an add-with-carry operation is available. (I
> proposed such an operation in message-id <57EB10A7...@gmail.com>
> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57EB10A7.1000801%40gmail.com>

Where the heck do you get those links from using the online
interface? I can't see how to do it, anywhere.

Ok so i did actually implement big-number multiply in the Aspex
Semiconductor ASP (massively-wide deep SIMD 2-bit ALUs, with
register-tagging aka masks aka "predication"), but can we discuss this
in a *completely* separate thread?

>> The reason why I suggested allowing implementors to choose precisely
>> which parts of any given ALU to parallelise and which to leave as
>> "internal vector parallelism = 1" is in pragmatic recognition of it
>> being simply too much - and not even desirable - to expect
>> implementors to duplicate an *entire* RV* ALU *right* across all
>> Extensions, just to get potentially one and *only* one
>> application-specific bit of parallelism into their implementation.
>> That's unreasonable to expect, and any spec worthy of respecting the
>> RV design paradigm should, I feel, reflect that.
>>
>
>
> This is also the reason (as I see it) for RVV to use distinct vector opcodes
> instead of reusing the baseline opcodes with vectors using the main register
> file.

Even within an *existing* RVV spec (as-is, unmodified), particularly
if the vector register file is say 1024 or 2048 or 4096 in size, you
certainly do not want to try making an ASIC which has *all* the ALU
features of RVV parallelised 1024 or 2048 or 4096 times. That would
be completely insane as you would likely end up having to batch most
of a 10in wafer into a single BGA package.

It's that Ahmdal's Law you mentioned, Jacob (was it you? damn my
memory's bad sometimes) - just applied to vectors and SIMD.

> There is an interesting way to combine these: each X register has an
> implicit "vector-tail", the length of which is given by the per-register
> "vector lengths" simple-V proposes. The vector unit then "shadows" the
> scalar unit and performs vector operations on the "vector-tails", which are
> stored in the vector unit.

Could you elaborate more, on the V-related thread? I may need a
drawing to understand it fully. I was going to suggest that
V-on-top-of-Simple-V provide some way to stop using the integer and
floating-point regfiles and to use their own vector-regfile. This
after seeing how INSANELY large the regfile was in Hwacha. 256
entries! That's a monster supercomputer. woo!

l.

lkcl

unread,

Apr 8, 2018, 8:02:51 AM4/8/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Sun, Apr 8, 2018 at 5:35 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl wrote:

>> oh. Jacob, are you saying that because you believe that to achieve
>> vector parallelism or SIMD parallelism it is *always* required to add
>> multiple simultaneous ports (like dual-ported RAM) to the register
>> file?
>>
>
>
> If the instructions are to *actually* be executed in parallel, yes.

Ok so as it's 95% V i moved the reply to the V-thread, sorry I
haven't yet worked out how to get those msgid links, jacob.
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/YzMeNNROxfc

Jacob Bachmeyer

unread,

Apr 8, 2018, 8:57:45 PM4/8/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Sun, Apr 8, 2018 at 5:05 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> lkcl wrote:
>>

> [...]

>
>>> SIMD in effect takes the register file of width N and height M, and
>>> *overrides* the (normally fixed) width N with a dynamically-specified
>>> width O, to turn the exact same register file on a per-instruction
>>> basis into one of width O and height (N/O * M).
>>>
>> This kind of reshapable regfile seems likely to be extremely complex to me.
>>
>
> Welcome to SIMD! :) Seriously: this is just how SIMD is, in effect.
> It's a bit of a mind-melt.
>
> Note that I described it as being "effectively", not *actually*.
> Typically I would expect traditional SIMD to *actually* be implemented
> as "the reshaping is actually done by the ALU rather than at the
> instruction decode phase". i.e. the register is passed (in full, no
> reshaping at all) directly to an ALU which performs N-simultaneous
> bitwidth-divided register-chunked operations. and we know where that
> leads: massive proliferation of instruction opcodes.
>

This is where two different SIMD options collide: there is packed-SIMD,
with multiple smaller elements in a single register (and I think I have
an idea for using the integer registers for this, with only minor opcode
proliferation, but that is for another branch of this thread) and there
is simple-V that you have proposed, with multiple regfile slots grouped
into an SIMD "register".

The latter seems to me very difficult for the ALU to do, unless the
physical regfile is split into multiple banks, but this is exactly how
Hwacha vector lanes are divided.

> The Aspex massive-SIMD had the right idea: as an Array *String*
> Processor you could break down the "string" into chunks of 32, and
> each ALU had a communication (and a way to pass registers between)
> each of its left-and-right neighbours (with breaks splitting that
> string on 32-ALU boundaries). You could decide whether to:
>
> * use each 2-bit ALU to do 32 individual operations (such that a
> 64-bit add would be completed in 32 cycles, 32 of them in parallel) OR
> * use 2 neighbouring 2-bit ALUs to do 16 operations (such that a
> 32-bit add would be completed in 16 cycles, 16 of them in parallel) OR
> * use 4 ......................... OR
> * ...
> * use all 32 ALUs to do **ONE** 64-bit add, completed in **ONE**
> cycle, but just *ONE* of them
>
> You think ordinary SIMD is Bad? Dude, you have nnooo idea. The ASP
> was an absolute BITCH to program. Literally days PER INSTRUCTION.
> The only reason to consider all possibilities of the above was because
> it was a fixed amount of time for the memory (DMA) engine to load (or
> store) in the background, and consequently, with all those options
> above taking different times *for different algorithms* we had to
> write frickin SPREADSHEETS to calculate the best way to implement a
> SPECIFIC customers algorithm. IN ASSEMBLER! five, six, eight,
> sometimes TEN separate implementations in hand-coded assembler before
> we could even frickin decide which one would be best to deploy and
> discard all the other four to nine!
>

Looks like you needed a better compiler. :-)

> Being one of the few people ever to have worked with this
> ultra-ultra-specialist type of 2-bit SIMD processor, it is primarily
> why I am so strongly supportive of Andrew's "SIMD Is Bad" principle.
> I *know* how far down the rabbit-hole it really goes.
>

On the other hand, there seems to be vocal support for at least some
SIMD capability.

>> (Not that RVV does not envision exactly that in the vector unit, but not for
>> the scalar register file.)
>>
>
> ... because RVV does not cover the SIMD case. However I can easily
> make a case for combining the benefits of SIMD and RVV: half-precision
> IEEE 754 (binary16 format). *Right now* V is wasting precious
> vector-register file bits (50% in the case of double and a whopping
> *75*% - a 4:1 ratio! in the case of long-double support).
>

If RVV follows Hwacha on this, no, no vector regfile bits are wasted:
configuring smaller vector elements increases the hardware-supported
vector length.

>>> * set vector-length for r0 to 4
>>> * issue RV32 LOAD from addr 0x1230 to r0
>>>
>>> translates effectively to:
>>>
>>> * RV32 LOAD from addr 0x1230 to r0
>>> * ...
>>> * ...
>>> * RV32 LOAD from addr 0x123B to r3
>>>
>> These "virtual" instructions seem to me like something that only a large
>> out-of-order processor could use without extensive modifications.
>>
>
> why? It's just a loop (a hardware one).
>

Fair enough; I keep expecting vector operations to complete in parallel.

>> I do not believe that such a design is a good fit for RISC-V.
>>
>
> why? It's exactly what's in V right now, as we're discussing in a
> separate thread, and I am advocating that the same functionality be
> duplicated and moved to its own Extension. You even gave a
> near-identical example just with an offset stride recommendation!
> here is a direct link to my actual (individual) reply:
> https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/YzMeNNROxfc/hVSy-NvKAgAJ
>
> and here is the link to the overall thread (msgid I hope I got that right?):
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDy8q7WNLzsO-%3Dr5rDdhW09QmVtfx_MTp%3DjJ%2Bq_sFvzVVg%40mail.gmail.com
>
> These are *precisely* the same discussions! :)
>

There is one important difference, if I understand your proposal
correctly: RVV has separate vector opcodes for vector instructions,
while simple-V proposes to reuse the baseline scalar opcodes for vector
operations. I had also misunderstood simple-V at first, and believed
that you were proposing a "REPEAT-NEXT-INSTRUCTION" opcode, rather than
using CSRs to configure vectors attached to register numbers.

>> As I understand, an RVV vector unit (like the Hwacha vector unit) is
>> expected to be internally microcoded, so multiplication might not need to be
>> defined in hardware, if an add-with-carry operation is available. (I
>> proposed such an operation in message-id <57EB10A7...@gmail.com>
>> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57EB10A7.1000801%40gmail.com>
>>
>
> Where the heck do you get those links from using the online
> interface? I can't see how to do it, anywhere.
>

I do not use the online interface. I use a desktop email client. I
type out "message-id", copy and paste the message-id twice, then paste
in "URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/"
and edit the second copy of the message-id to URL-escape it.

>>> The reason why I suggested allowing implementors to choose precisely
>>> which parts of any given ALU to parallelise and which to leave as
>>> "internal vector parallelism = 1" is in pragmatic recognition of it
>>> being simply too much - and not even desirable - to expect
>>> implementors to duplicate an *entire* RV* ALU *right* across all
>>> Extensions, just to get potentially one and *only* one
>>> application-specific bit of parallelism into their implementation.
>>> That's unreasonable to expect, and any spec worthy of respecting the
>>> RV design paradigm should, I feel, reflect that.
>>>
>> This is also the reason (as I see it) for RVV to use distinct vector opcodes
>> instead of reusing the baseline opcodes with vectors using the main register
>> file.
>>
>
> Even within an *existing* RVV spec (as-is, unmodified), particularly
> if the vector register file is say 1024 or 2048 or 4096 in size, you
> certainly do not want to try making an ASIC which has *all* the ALU
> features of RVV parallelised 1024 or 2048 or 4096 times. That would
> be completely insane as you would likely end up having to batch most
> of a 10in wafer into a single BGA package.
>

That would be ridiculous, but some reasonable number of vector lanes can
work in mixed parallel-serial mode to evaluate much longer vectors with
significant performance gains, generally N times or close to it.

> It's that Ahmdal's Law you mentioned, Jacob (was it you? damn my
> memory's bad sometimes) - just applied to vectors and SIMD.
>

Someone else probably mentioned Ahmdal's Law.

>> There is an interesting way to combine these: each X register has an
>> implicit "vector-tail", the length of which is given by the per-register
>> "vector lengths" simple-V proposes. The vector unit then "shadows" the
>> scalar unit and performs vector operations on the "vector-tails", which are
>> stored in the vector unit.
>>
>
> Could you elaborate more, on the V-related thread? I may need a
> drawing to understand it fully. I was going to suggest that
> V-on-top-of-Simple-V provide some way to stop using the integer and
> floating-point regfiles and to use their own vector-regfile. This
> after seeing how INSANELY large the regfile was in Hwacha. 256
> entries! That's a monster supercomputer. woo!
>

Hwacha supports up to 256 vector registers, but can be configured to
enable fewer, trading vector registers for longer vectors.

-- Jacob

Jacob Bachmeyer

unread,

Apr 8, 2018, 9:03:47 PM4/8/18

to Richard Herveille, lkcl, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

RVP using the integer registers could use GREVI for element swaps, for
example.

>> How much of an FPU is the FP register file? I still like the idea of putting SIMD in a separate register file, leaving the main integer registers available for control purposes.
>>
>
> However I would like to not enforce using the FP register file, initially to support both packed integer and packed floating point instructions. But I guess that would be possible anyways. But then the ALU would need direct access to the FPU, whereas so far the FPU register file was dedicated to the FPU. This will increase the operand selection logic and path.
>

There would as likely be a dedicated SIMD ALU to go with the SIMD regfile.

> It also makes using the regular I-instructions more difficult. Checking if a packed RGB is zero for example. Or to check if one video format color range fits into another’s (RGB-YUV conversion).
>

I have thought about this and have a possible outline for using the
integer registers for limited SIMD (overall SIMD group size <=XLEN).
Some operations work fine for both scalar and SIMD, such as AND/OR/XOR,
whle others need to act differently, such as ADD. What differences are
actually needed to turn a scalar ADD into a packed-SIMD element-wise
ADD? The only change for ADD I can think of is to suppress carry
propagation across element boundaries. Are there other operations
desired on packed-SIMD-in-X-registers and are any of them more complex
to handle than ADD?

-- Jacob

lkcl

unread,

Apr 8, 2018, 9:55:44 PM4/8/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Mon, Apr 9, 2018 at 1:57 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> This is where two different SIMD options collide: there is packed-SIMD,
> with multiple smaller elements in a single register (and I think I have an
> idea for using the integer registers for this, with only minor opcode
> proliferation, but that is for another branch of this thread)

Awesome look forward to it.

> and there is simple-V that you have proposed, with multiple regfile slots
> grouped into an SIMD "register".

Ah so it's important to discern that i did *not* say "Simple-V Shall
Force Implementors To Split RegFiles By BitWidth". That's why i said
*conceptually* - as a way to *understand* SIMD - the regfile *may* be
considered to be "as if the regfile were split into bitwise chunks".

> The latter seems to me very difficult for the ALU to do, unless the physical
> regfile is split into multiple banks, but this is exactly how Hwacha vector
> lanes are divided.

okay! good! so there's precedent.

>> The Aspex massive-SIMD had the right idea: as an Array *String*

> Looks like you needed a better compiler. :-)

(they just didn't have the manpower. it was done as macro
post-processing after running gcc -E. best not to ask details... )

>> Being one of the few people ever to have worked with this
>> ultra-ultra-specialist type of 2-bit SIMD processor, it is primarily
>> why I am so strongly supportive of Andrew's "SIMD Is Bad" principle.
>> I *know* how far down the rabbit-hole it really goes.
>>
>
>
> On the other hand, there seems to be vocal support for at least some SIMD
> capability.

Oh, you may have misunderstood: there's no "other hand" here, we're
on the same page that SIMD capability is needed. variable-length
Vector *is* SIMD, it's just done as variable-length where SIMD is
traditionally done fixed-width.

i haven't quite got round to mentioning it yet but i *think* it might
be possible to put a vector-like front-end onto SIMD whilst also
avoiding the worst of SIMD (the corner-cases and insane proliferation
of opcodes) by introducing Vector-inspired predication (tag/mask) down
to the SIMD elements.

The idea there is that if an implementor decides to do the bitwise
regfile parallelism *inside* the ALU (receive 64 bits but treat the
register as 8 ints for example), they would, towards the end of the
loop, *automatically* receive a tag/mask which said "hey guys only run
the S of SIMD on these last N of the Data uhkhaaay?".

I appreciate that worst-case scenario some muppet programmer ends up
using a 64-bit SIMD instruction to do only an 8-bit add but you can't
stop absolutely everyone from shooting themselves in both feet.

> If RVV follows Hwacha on this, no, no vector regfile bits are wasted:
> configuring smaller vector elements increases the hardware-supported vector
> length.

damn i wish this stuff was properly documented. *sigh*. ok. thank you.

>> why? It's just a loop (a hardware one).

> Fair enough; I keep expecting vector operations to complete in parallel.

:)

> There is one important difference, if I understand your proposal correctly:
> RVV has separate vector opcodes for vector instructions, while simple-V
> proposes to reuse the baseline scalar opcodes for vector operations.

yes.

> I had
> also misunderstood simple-V at first, and believed that you were proposing a
> "REPEAT-NEXT-INSTRUCTION" opcode,

RPT instructions _plural_. it's apparently a well-known thing in DSPs
with SIMD. saving on the accumulate, compare and branch can in some
cases turn 25% pipeline utilisation into *100%* pipeline utilisation.
that single-instruction FFT loop Richard gave was aaawesome. oh and
no super-scalar or out-of-order needed, either.

> rather than using CSRs to configure
> vectors attached to register numbers.

That instruction comes from P (and I'd like Simple-V to combine
*both* SIMD *and* Vector abstraction). Since reviewing that Hwacha
slideshow you sent I now have a better understanding of vsetl, so I
now understand what Andrew was referring to about vector loops: vsetl
will take the for-loop variable (or what's left of it, counting down)
and take the MIN of that and the "maximum vector array size") such
that on the last vector loop you only end up with the *last* bits to
do.

(that btw is what I was referring to above, about switching off some
of the SIMD elements right at the end)

I'm going to have to think about this some more as it radically
alters what CSRs are needed. i believe it may be possible to blat
down Simple-V's CSRs per register down to a single bit per register
("hey I'm a vector" and "hey i'm bit-witted").

>> Where the heck do you get those links from using the online
>> interface? I can't see how to do it, anywhere.
>>
>
>
> I do not use the online interface. I use a desktop email client. I type
> out "message-id", copy and paste the message-id twice, then paste in
> "URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/" and edit
> the second copy of the message-id to URL-escape it.

*pained look*... i maaay write a hybrid cgi-bin script and python
program for that :) if i do i'll let you know where (both) is/are.

>> Could you elaborate more, on the V-related thread? I may need a
>> drawing to understand it fully. I was going to suggest that
>> V-on-top-of-Simple-V provide some way to stop using the integer and
>> floating-point regfiles and to use their own vector-regfile. This
>> after seeing how INSANELY large the regfile was in Hwacha. 256
>> entries! That's a monster supercomputer. woo!
>>
>
>
> Hwacha supports up to 256 vector registers, but can be configured to enable
> fewer, trading vector registers for longer vectors.

I'd like to hear more about the way that's configured... on a
separate thread, if that's ok. How about the one where I ask about
vsetl?

l.

lkcl

unread,

Apr 8, 2018, 10:11:33 PM4/8/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Mon, Apr 9, 2018 at 2:03 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> What differences are actually
> needed to turn a scalar ADD into a packed-SIMD element-wise ADD? The only
> change for ADD I can think of is to suppress carry propagation across
> element boundaries.

ta-daaaa :) now you know why i mentioned the Array *String*
Processor. that's exactly what they did (run-time configureably)

> Are there other operations desired on
> packed-SIMD-in-X-registers and are any of them more complex to handle than
> ADD?

Multiply doesn't work that way as it's inherently O(N^2). unless you
break it down and make the time O(N) longer. but if you already have
an eennorrmous single-cycle multiplier you _could_ probably break it
down (and shut down O(N^2-N) of the matrix), it'd make quite a mess of
the nice tidy layout but... am I making sense?

ROL / ROR ... you now need N loops but that's ok. same order
complexity as ADD.

DIV *snort* that'll be fun.

SQRT if you have it, likewise.

So quick assessment (not comprehensive!) the linear O(N) ones are ok,
it's the ones which usually need iterative algorithms (newton-raphson
and so on) that would need reimplementation.

l.

lk...@lkcl.net

unread,

Apr 9, 2018, 5:02:15 PM4/9/18

to RISC-V ISA Dev, jcb6...@gmail.com, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com

i've been giving some SIMD / DSP related thought to the example listed here

http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf

which is an implementation of SAXPY

for (i=0; i<n; i++) {

y[i] = a*x[i] + y[i];

}

a0: n, a1: a, a2: *x, a3: *y

stripmine:

vsetvl t0,a0

vlw v0, a2

vlw v1, a3

vfma v1, a1, v0, v1

vsw v1, a3

slli t1, t0, 2

add a2, a2, t1

add a3, a3, t1

sub a0, a0, t0

bnez a0, stripmine

in DSPs, their execution engine is kept *real* simple. no out-of-order, no speculative, just plain pipeline slogging. if the above were to be executed on an in-order plain DSP, its performance would be *really* quite spectacularly bad as the Vector/SIMD pipeline would be ridiculously low-loaded.

so i thought about that for a day or so, and wondered if it would be possible to propose a variant of zero-overhead loop that included auto-incrementing the two address registers a2 and a3, as well as providing a means to interact between the zero-overhead loop and the vsetvl instruction. a sort-of pseudo-assembly of that would look like:

zero-overhead-set-auto-increment a2, t0, 4 # a2 to be auto-incremented by t0*4

zero-overhead-set-auto-increment a3, t0, 4 # a2 to be auto-incremented by t0*4

zero-overhead-set-loop-terminator-condition a0 zero

zero-overhead-set-start-end stripmine, stripmine+endoffset

stripmine:

vsetvl t0,a0

vlw v0, a2

vlw v1, a3

vfma v1, a1, v0, v1

vsw v1, a3

sub a0, a0, t0

stripmine+endoffset:

the question is: would something like this even be desirable? it's a variant of auto-increment [1]. last time i saw any hint of auto-increment register opcodes was in the 1980s... 68000 if i recall correctly... yep see [1]

what's the general feeling, here? have "processors moved on such that superscalar is the new norm"? because rocketchip certainly isn't superscalar / OoO, the only implementation i know that is OoO is BOOM and i don't believe that's finished / stable?

or, would it be worthwhile to look at adding auto-increment to vlw and vsw?

basically what i don't know is: is this beginning to become "CISC"? what's the trade-off?

l.

[1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html

Jacob Bachmeyer

unread,

Apr 10, 2018, 7:29:58 PM4/10/18

to lk...@lkcl.net, RISC-V ISA Dev, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com

lk...@lkcl.net wrote:
> i've been giving some SIMD / DSP related thought to the example listed
> here
> http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf
>
> which is an implementation of SAXPY
> for (i=0; i<n; i++) {
> y[i] = a*x[i] + y[i];
> }
>
> a0: n, a1: a, a2: *x, a3: *y
>
> stripmine:
> vsetvl t0,a0
> vlw v0, a2
> vlw v1, a3
> vfma v1, a1, v0, v1
> vsw v1, a3
> slli t1, t0, 2
> add a2, a2, t1
> add a3, a3, t1
> sub a0, a0, t0
> bnez a0, stripmine
>
>
> in DSPs, their execution engine is kept *real* simple. no
> out-of-order, no speculative, just plain pipeline slogging. if the
> above were to be executed on an in-order plain DSP, its performance
> would be *really* quite spectacularly bad as the Vector/SIMD pipeline
> would be ridiculously low-loaded.

The vector instruction encoding that I suggested earlier (message-id
<5951A69D...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5951A69D.1000602%40gmail.com>)
on isa-dev (it was an incomplete draft and assumed the allocation of an
OP-VECTOR major opcode) avoided this problem by using an explicit queued
execution model: any vector operation that does not produce a scalar
result is simply stuffed into the vector execution queue and processed
as the vector unit gets to it while the scalar unit continues. The
trade-off is "vector lag": the scalar unit is allowed to run ahead of
the vector unit and several synchronization instructions were included
to allow programs to enforce synchronization when needed. This
synchronization may be implicit if an instruction that produces a scalar
result is executed. So that stripmine loop, in my model, would run all
the way through and stall at the next VSETVL until the vector unit
catches up. Or, in a high-performance implementation, the VSETVL could
produce a known result, be added to the queue, and scaler processing
continues to queue vector operations until the vector queue is full,
then stall at the next vector instruction until a queue slot becomes
available. (Implementations without such a queue have a queue depth of
1: at most one vector operation can be pending or in progress at any time.)

The vector unit never stops unless idle or frozen in this model; even
continuing to process queued operations (in user mode) while the scalar
unit handles supervisor traps. Another instruction allows the
supervisor to "freeze" the vector unit and dump its state before
task-switching or using the vector unit itself.

> so i thought about that for a day or so, and wondered if it would be
> possible to propose a variant of zero-overhead loop that included
> auto-incrementing the two address registers a2 and a3, as well as
> providing a means to interact between the zero-overhead loop and the
> vsetvl instruction. a sort-of pseudo-assembly of that would look like:
>
> zero-overhead-set-auto-increment a2, t0, 4 # a2 to be auto-incremented
> by t0*4
> zero-overhead-set-auto-increment a3, t0, 4 # a2 to be auto-incremented
> by t0*4
> zero-overhead-set-loop-terminator-condition a0 zero
> zero-overhead-set-start-end stripmine, stripmine+endoffset
> stripmine:
> vsetvl t0,a0
> vlw v0, a2
> vlw v1, a3
> vfma v1, a1, v0, v1
> vsw v1, a3
> sub a0, a0, t0
> stripmine+endoffset:
>
> the question is: would something like this even be desirable? it's a
> variant of auto-increment [1]. last time i saw any hint of
> auto-increment register opcodes was in the 1980s... 68000 if i recall
> correctly... yep see [1]

Another option for auto-increment is for vector-memory-access
instructions to support post-increment addressing for unit-stride and
constant-stride modes. This can be implemented by the scalar unit
passing the operation to the vector unit while itself executing an
appropriate multiply-and-add to produce the incremented address. This
does *not* require additional ports on the scalar register file, unlike
scalar post-increment addressing modes.

-- Jacob

Jacob Bachmeyer

unread,

Apr 10, 2018, 7:45:49 PM4/10/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Mon, Apr 9, 2018 at 2:03 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>> What differences are actually
>> needed to turn a scalar ADD into a packed-SIMD element-wise ADD? The only
>> change for ADD I can think of is to suppress carry propagation across
>> element boundaries.
>>
>
> ta-daaaa :) now you know why i mentioned the Array *String*
> Processor. that's exactly what they did (run-time configureably)
>

The basic idea is a CSR with element boundaries and packed-SIMD
instructions that use that CSR to modify operations on the integer
register file. RVI instructions operate on integers, while RVP
instructions respect the "psplit" CSR. Only the bits that the
implementation can actually use as element boundaries are writable in
the psplit CSR; the rest are hardwired clear. This would allow, for
example, 2-tuples of RGB565 data to be processed on RV32.

>> Are there other operations desired on
>> packed-SIMD-in-X-registers and are any of them more complex to handle than
>> ADD?
>>
>
> Multiply doesn't work that way as it's inherently O(N^2). unless you
> break it down and make the time O(N) longer. but if you already have
> an eennorrmous single-cycle multiplier you _could_ probably break it
> down (and shut down O(N^2-N) of the matrix), it'd make quite a mess of
> the nice tidy layout but... am I making sense?
>

So there *is* a regular pattern for packed-multiply? Is PMUL useful
enough to justify the extra hardware?

> ROL / ROR ... you now need N loops but that's ok. same order
> complexity as ADD.
>
> DIV *snort* that'll be fun.
>
> SQRT if you have it, likewise.
>

PDIV and PSQRT would almost certainly not exist.

> So quick assessment (not comprehensive!) the linear O(N) ones are ok,
> it's the ones which usually need iterative algorithms (newton-raphson
> and so on) that would need reimplementation.

So we have PADD, PSUB, PSLL, PSRL, PSRA, (RVB: PSLO, PSRO, PROL, PROR),
(RVM: PMUL) so far. Other operations like AND/OR/XOR, (RVB: GREV),
LOAD, and STORE, work just fine for RVP using their baseline RVI
definitions. GREV in particular is interesting for power-of-2-sized
elements, since it allows exchanging such elements at will.

I have since realized that "vector-tail" is really an excellent model
for SIMD, with a "pgroup" CSR defining which lanes should be active.
Again, only implemented bits in the pgroup CSR would be writable. Full
multi-lane SIMD gets a bit more complicated and will require group
LOAD/STORE instructions to handle the SIMD address issues. Each SIMD
lane would be an additional register file and integer ALU, possibly with
some amount of memory-access capability, or all lanes could use "wide"
LOAD/STORE led by lane 0.

-- Jacob

Luke Kenneth Casson Leighton

unread,

Apr 10, 2018, 8:10:34 PM4/10/18

to Jacob Bachmeyer, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang

On Wed, Apr 11, 2018 at 12:29 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lk...@lkcl.net wrote:
>> in DSPs, their execution engine is kept *real* simple. no out-of-order,
>> no speculative, just plain pipeline slogging. if the above were to be
>> executed on an in-order plain DSP, its performance would be *really* quite
>> spectacularly bad as the Vector/SIMD pipeline would be ridiculously
>> low-loaded.
>
>
> The vector instruction encoding that I suggested earlier (message-id
> <5951A69D...@gmail.com>
> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5951A69D.1000602%40gmail.com>)
> on isa-dev

Lots of useful insights in there. dang though it makes some of my
posts look short :)

> (it was an incomplete draft and assumed the allocation of an
> OP-VECTOR major opcode) avoided this problem by using an explicit queued
> execution model: any vector operation that does not produce a scalar result
> is simply stuffed into the vector execution queue and processed as the
> vector unit gets to it while the scalar unit continues.

oh. neat. hm... *thinks*... that would become... interestingly
challenging to implement as part of the "intertwined" proposal i'm
mulling over, because it proposes, for various reasons, that the
results of the vector be actually stored in the main integer (or FP)
regfiles.

The reasons for that are numerous, one of which being that after a
vector-computation it is possible to get directly at the results. so
it would be possible for example to carry out a scalar-style loop of
oddness that parallelisation could not possibly cope with, followed by
dropping back into vector-computation. Whether that be floating-point
or integer-based, before *or* after the "scalar oddness".

All of which kiinda makes vector execution queueing a leetle more
challenging :) for a superscalar OoO archtecture, no problem.

um... can anyone tell me if say rocketchip has detection of register
inter-dependence (result from one 5-stage-pipelined operation going
directly into the input of the *immediately* following instruction)?

> The trade-off is
> "vector lag": the scalar unit is allowed to run ahead of the vector unit
> and several synchronization instructions were included to allow programs to
> enforce synchronization when needed.

yyeah so this is why I asked if say rocketchip has automatic
pipeline-respecting register output-becomes-input detection (stalling
if that's the case) because if such infrastructure exists and the
standard register file is utilised by the vectorisation engine it
can't be *that* hard to hook into the same logic.

>> so i thought about that for a day or so, and wondered if it would be
>> possible to propose a variant of zero-overhead loop that included
>> auto-incrementing the two address registers a2 and a3, as well as providing
>> a means to interact between the zero-overhead loop and the vsetvl
>> instruction. a sort-of pseudo-assembly of that would look like:

> Another option for auto-increment is for vector-memory-access instructions
> to support post-increment addressing for unit-stride and constant-stride
> modes.

oh. yes. of course. and it's in the right place: as part of the
load / store.

> This can be implemented by the scalar unit passing the operation to
> the vector unit while itself executing an appropriate multiply-and-add to
> produce the incremented address. This does *not* require additional ports
> on the scalar register file, unlike scalar post-increment addressing modes.

niice. I like it. I might borrow the idea. or at least make a note
of it in the ever-increasingly-long "simple" vector proposal...

l.

Jacob Bachmeyer

unread,

Apr 10, 2018, 11:41:08 PM4/10/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Mon, Apr 9, 2018 at 1:57 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>> This is where two different SIMD options collide: there is packed-SIMD,
>> with multiple smaller elements in a single register (and I think I have an
>> idea for using the integer registers for this, with only minor opcode
>> proliferation, but that is for another branch of this thread)
>>
>
> Awesome look forward to it.
>
>
>> and there is simple-V that you have proposed, with multiple regfile slots
>> grouped into an SIMD "register".
>>
>
> Ah so it's important to discern that i did *not* say "Simple-V Shall
> Force Implementors To Split RegFiles By BitWidth". That's why i said
> *conceptually* - as a way to *understand* SIMD - the regfile *may* be
> considered to be "as if the regfile were split into bitwise chunks".
>

The important difference is between SIMD groups <= XLEN and SIMD groups
> XLEN; Simple-V is the latter, as is FP-regfile RVP.

>>> The Aspex massive-SIMD had the right idea: as an Array *String*
>>>
>> Looks like you needed a better compiler. :-)
>>
>
> (they just didn't have the manpower. it was done as macro
> post-processing after running gcc -E. best not to ask details... )
>

That is an insanely bad compiler port.

>>> Being one of the few people ever to have worked with this
>>> ultra-ultra-specialist type of 2-bit SIMD processor, it is primarily
>>> why I am so strongly supportive of Andrew's "SIMD Is Bad" principle.
>>> I *know* how far down the rabbit-hole it really goes.
>>>
>> On the other hand, there seems to be vocal support for at least some SIMD
>> capability.
>>
>
> Oh, you may have misunderstood: there's no "other hand" here, we're
> on the same page that SIMD capability is needed. variable-length
> Vector *is* SIMD, it's just done as variable-length where SIMD is
> traditionally done fixed-width.
>
> i haven't quite got round to mentioning it yet but i *think* it might
> be possible to put a vector-like front-end onto SIMD whilst also
> avoiding the worst of SIMD (the corner-cases and insane proliferation
> of opcodes) by introducing Vector-inspired predication (tag/mask) down
> to the SIMD elements.
>
> The idea there is that if an implementor decides to do the bitwise
> regfile parallelism *inside* the ALU (receive 64 bits but treat the
> register as 8 ints for example), they would, towards the end of the
> loop, *automatically* receive a tag/mask which said "hey guys only run
> the S of SIMD on these last N of the Data uhkhaaay?".
>
> I appreciate that worst-case scenario some muppet programmer ends up
> using a 64-bit SIMD instruction to do only an 8-bit add but you can't
> stop absolutely everyone from shooting themselves in both feet.
>

The question then becomes: how is packed-SIMD as you suggest any
different from vector processing? I think that the corner cases are
probably the defining difference between packed-SIMD and vector
processing, and are the consequence of packed-SIMD using fixed-length
groups, instead of variable-length vectors.

> [...]

>
>> I had
>> also misunderstood simple-V at first, and believed that you were proposing a
>> "REPEAT-NEXT-INSTRUCTION" opcode,
>>
>
> RPT instructions _plural_. it's apparently a well-known thing in DSPs
> with SIMD. saving on the accumulate, compare and branch can in some
> cases turn 25% pipeline utilisation into *100%* pipeline utilisation.
> that single-instruction FFT loop Richard gave was aaawesome. oh and
> no super-scalar or out-of-order needed, either.
>

So the iteration aspect is simple a zero-overhead loop feature, then?

>> rather than using CSRs to configure
>> vectors attached to register numbers.
>>
>
> That instruction comes from P (and I'd like Simple-V to combine
> *both* SIMD *and* Vector abstraction). Since reviewing that Hwacha
> slideshow you sent I now have a better understanding of vsetl, so I
> now understand what Andrew was referring to about vector loops: vsetl
> will take the for-loop variable (or what's left of it, counting down)
> and take the MIN of that and the "maximum vector array size") such
> that on the last vector loop you only end up with the *last* bits to
> do.
>
> (that btw is what I was referring to above, about switching off some
> of the SIMD elements right at the end)
>

I consider this the difference between packed-SIMD and vector
processing: packed-SIMD uses fixed-width groups, while vectors are more
flexible. That said, the "vector-tail" SIMD model I suggested in the
other branch does share some similarities, particularly the ability to
disable extra lanes when finishing a computation.

>>> Where the heck do you get those links from using the online
>>> interface? I can't see how to do it, anywhere.
>> I do not use the online interface. I use a desktop email client. I type
>> out "message-id", copy and paste the message-id twice, then paste in
>> "URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/" and edit
>> the second copy of the message-id to URL-escape it.
>>
>
> *pained look*... i maaay write a hybrid cgi-bin script and python
> program for that :) if i do i'll let you know where (both) is/are.
>

I type fast, so "message-id" is about nothing to me. Pasting twice is
about as easy as pasting once. The longest part is getting the "URL"
string into the selection, but I can either fetch that from klipper or
dig it out of a message I have previously sent. The editing is (thanks
to an extension) as simple as selecting the text and choosing
"encodeURIComponent" from the context menu. Overall, typing out this
paragraph took me longer than preparing one of those links does, which
is usually nothing compared to finding the message that I want to cite.

>>> Could you elaborate more, on the V-related thread? I may need a
>>> drawing to understand it fully. I was going to suggest that
>>> V-on-top-of-Simple-V provide some way to stop using the integer and
>>> floating-point regfiles and to use their own vector-regfile. This
>>> after seeing how INSANELY large the regfile was in Hwacha. 256
>>> entries! That's a monster supercomputer. woo!
>>>
>> Hwacha supports up to 256 vector registers, but can be configured to enable
>> fewer, trading vector registers for longer vectors.
>>
>
> I'd like to hear more about the way that's configured... on a
> separate thread, if that's ok. How about the one where I ask about
> vsetl?
>

There is no more about how that is configured in RVV yet. Those slides
were about it, aside from the Hwacha manuals presumably, but it has been
some time since I read those and I do not have time right now to read
them all again.

-- Jacob

lkcl

unread,

Apr 11, 2018, 8:57:00 AM4/11/18

to Jacob Bachmeyer, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

On Wed, Apr 11, 2018 at 4:41 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The important difference is between SIMD groups <= XLEN and SIMD groups >
> XLEN; Simple-V is the latter, as is FP-regfile RVP.

I'm investigating the feasibility of Simple-V covering both cases.
Call it "Parallelism Extension" if you will.

>>>> The Aspex massive-SIMD had the right idea: as an Array *String*

> That is an insanely bad compiler port.

Well, they've been bought by Ericsson (probably for use in LTE base
stations) so it's Ericsson's problem to deal with now :)

> The question then becomes: how is packed-SIMD as you suggest any different
> from vector processing?

Given that RVV effectively also does packed-SIMD (by splitting the
Vector-Regfile SRAM right the way down to 16-bit widths for
half-precision), there really isn't that much difference that I can
see! Hence why I'm interested to pursue this so that there is not a
massive duplication of effort (and instructions. and test suites.
and massively-duplicated RTL inside implementations. and. and. and).

> I think that the corner cases are probably the
> defining difference between packed-SIMD and vector processing, and are the
> consequence of packed-SIMD using fixed-length groups, instead of
> variable-length vectors.

Yes. Compared to *traditional* SIMD, yes absolutely. the key
difference is (are): you're only allowed to have "vector" lengths set
to precise, fixed powers of two (4, 8, 16, 32). As in, if we were to
make a Vector Engine that *precisely* mirrored the functionality of a
"traditional" SIMD, that implementation would be forced to have, in
its specification, "the vsetl instruction will ONLY subtract a fixed
power of 2 from its operand".

which would be a complete wasted opportunity, bothering to do that.
If that were followed through you might as well just abandon RVV
entirely and do a floating-point SIMD engine instead. with associated
hell-on-earth ISA proliferation. estimated about 1,000 instructions
and escalating, i should imagine, given the scope of what RVV
proposes.

Which is why i considered the "implicit last-loop predication" idea,
where on the last loop, if the vsetl instruction sets the vector
length *not* equal to the maximum SIMD width (and the SIMD engine *is*
implemented as a fixed-width multiple ALU block), it enables
predication on the remaining indexed elements of the SIMD array.

That would completely cut the need for explicit corner-case code.

[btw, reflection: I *believe* this may actually be identical to the
"vector-tail" idea that you proposed, Jacob].

This idea is partly an implementation-specific detail, as keeping the
SIMD engines more akin to "traditional" SIMD would be one possibie
implementation, whilst other implementations would potentially be much
more along the lines of the RVV engine, with separate (parallelised)
16-bit and potentially even separate (parallelised) 8-bit ALUs, silly
as that may seem. Whether that is worthwhile doing in practice (the
support infrastructure being far in excess of the size of a single
8-bit ALU, even a parallelised duplicated one), remains to be seen.

Ruling out implementation choices and failing to take into
consideration for example micro-controllers has been a point of
contention and friction in the past, so what may seem "silly" in one
context is "perfectly normal" in another. Respecting both would, I
feel, be... prudent.

>> [...]
>>
>>>
>>> I had
>>> also misunderstood simple-V at first, and believed that you were
>>> proposing a
>>> "REPEAT-NEXT-INSTRUCTION" opcode,
>>>
>>
>>
>> RPT instructions _plural_. it's apparently a well-known thing in DSPs
>> with SIMD. saving on the accumulate, compare and branch can in some
>> cases turn 25% pipeline utilisation into *100%* pipeline utilisation.
>> that single-instruction FFT loop Richard gave was aaawesome. oh and
>> no super-scalar or out-of-order needed, either.
>>
>
>

> So the iteration aspect is simply a zero-overhead loop feature, then?

Sorry i'm not totally clear on the question. I love the way that
Vectors (RVV / Hwacha) works: I'd like to see if it can be improved /
enhanced (benefit from P-Ext's zero-overhead loop or enhanced versions
of the same), but this is a separate topic.

>> That instruction comes from P (and I'd like Simple-V to combine
>> *both* SIMD *and* Vector abstraction). Since reviewing that Hwacha
>> slideshow you sent I now have a better understanding of vsetl, so I
>> now understand what Andrew was referring to about vector loops: vsetl
>> will take the for-loop variable (or what's left of it, counting down)
>> and take the MIN of that and the "maximum vector array size") such
>> that on the last vector loop you only end up with the *last* bits to
>> do.

> I consider this the difference between packed-SIMD and vector processing:
> packed-SIMD uses fixed-width groups, while vectors are more flexible.

it's extremely cool. I fully "Get" why SIMD was created: it's far
easier to just throw in some extra instructions, right down inside the
ALU. There, in the ALU, there's no need to get concerned about the
impact on the instruction decode engine.

> That
> said, the "vector-tail" SIMD model I suggested in the other branch does
> share some similarities, particularly the ability to disable extra lanes
> when finishing a computation.

I didn't quite follow... but I *think* we may be talking about the
same thing, just using different terminology. It would be good to
confirm that.

>>>> Where the heck do you get those links from using the online
>>>> interface? I can't see how to do it, anywhere.

> I type fast, so "message-id" is about nothing to me. Pasting twice is about
> as easy as pasting once.

ok so can I ask where you get those "message-ids" from? I use gmail
and the google groups interface, so they're... hum.. hidden. Do I
have to actually look at the original message (SMTP) and examine the
SMTP headers? (don't mind if i do, i'm just asking)

> context menu. Overall, typing out this paragraph took me longer than
> preparing one of those links does, which is usually nothing compared to
> finding the message that I want to cite.

:)

>>> Hwacha supports up to 256 vector registers, but can be configured to
>>> enable
>>> fewer, trading vector registers for longer vectors.
>>>
>>
>>
>> I'd like to hear more about the way that's configured... on a
>> separate thread, if that's ok. How about the one where I ask about
>> vsetl?
>>
>
>
> There is no more about how that is configured in RVV yet. Those slides were
> about it, aside from the Hwacha manuals presumably, but it has been some
> time since I read those and I do not have time right now to read them all
> again.

ok. I'll keep digging and asking [go through those resources
myself]. It's important that I find out, otherwise I can't possibly
propose an Extension that would be considered for adoption with a
minimum disruption to P and V. The proposal *has* to be directly
equivalent to both P *and* V's "parallelism"... with additional clear
benefits that make it worthwhile and compelling on many many
simultaneous levels.

l.

Jacob Bachmeyer

unread,

Apr 11, 2018, 10:50:38 PM4/11/18

to lkcl, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, RISC-V ISA Dev, chuanhua.chang

lkcl wrote:
> On Wed, Apr 11, 2018 at 4:41 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>> The important difference is between SIMD groups <= XLEN and SIMD groups > XLEN; Simple-V is the latter, as is FP-regfile RVP.
>>
>
> I'm investigating the feasibility of Simple-V covering both cases.
> Call it "Parallelism Extension" if you will.
>

I am writing a proposal for that.

> [...]

>> I think that the corner cases are probably the
>> defining difference between packed-SIMD and vector processing, and are the
>> consequence of packed-SIMD using fixed-length groups, instead of
>> variable-length vectors.
>>
>
> Yes. Compared to *traditional* SIMD, yes absolutely. the key
> difference is (are): you're only allowed to have "vector" lengths set
> to precise, fixed powers of two (4, 8, 16, 32). As in, if we were to
> make a Vector Engine that *precisely* mirrored the functionality of a
> "traditional" SIMD, that implementation would be forced to have, in
> its specification, "the vsetl instruction will ONLY subtract a fixed
> power of 2 from its operand".
>
> which would be a complete wasted opportunity, bothering to do that.
> If that were followed through you might as well just abandon RVV
> entirely and do a floating-point SIMD engine instead. with associated
> hell-on-earth ISA proliferation. estimated about 1,000 instructions
> and escalating, i should imagine, given the scope of what RVV
> proposes.
>
> Which is why i considered the "implicit last-loop predication" idea,
> where on the last loop, if the vsetl instruction sets the vector
> length *not* equal to the maximum SIMD width (and the SIMD engine *is*
> implemented as a fixed-width multiple ALU block), it enables
> predication on the remaining indexed elements of the SIMD array.
>
> That would completely cut the need for explicit corner-case code.
>

This suggests that RVP could emulate a subset of RVV; noted for my proposal.

> [btw, reflection: I *believe* this may actually be identical to the
> "vector-tail" idea that you proposed, Jacob].
>

The "vector-tail" idea is "grouping" in the RVP proposal sketch I am
writing.

> This idea is partly an implementation-specific detail, as keeping the
> SIMD engines more akin to "traditional" SIMD would be one possibie
> implementation, whilst other implementations would potentially be much
> more along the lines of the RVV engine, with separate (parallelised)
> 16-bit and potentially even separate (parallelised) 8-bit ALUs, silly
> as that may seem. Whether that is worthwhile doing in practice (the
> support infrastructure being far in excess of the size of a single
> 8-bit ALU, even a parallelised duplicated one), remains to be seen.
>

Which could mean that RVV could be used to emulate RVP, at least unless
partitioning is used.

> [...]

>
>>>>> Where the heck do you get those links from using the online
>>>>> interface? I can't see how to do it, anywhere.
>>>>>
>> I type fast, so "message-id" is about nothing to me. Pasting twice is about
>> as easy as pasting once.
>>
>
> ok so can I ask where you get those "message-ids" from? I use gmail
> and the google groups interface, so they're... hum.. hidden. Do I
> have to actually look at the original message (SMTP) and examine the
> SMTP headers? (don't mind if i do, i'm just asking)
>

I use a desktop email client, configured to show "Message-Id" amongst
the other headers; button-3 gives a menu with "Copy Message-ID" as an
option.

-- Jacob

Reply all

Reply to author

Forward