Re: [isa-dev] V-extension analysis (was: P-extension charter (SIMD))

93 views
Skip to first unread message

Luke Kenneth Casson Leighton

unread,
Apr 6, 2018, 9:13:13 PM4/6/18
to Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang
(splitting to separate thread, this is 95% V, not P)

On Sat, Apr 7, 2018 at 12:17 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> On Fri, Apr 6, 2018 at 3:49 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net>
> wrote:

>> ... under such hypothetical circumstances how much of the V Extension
>> - right now - would be duplicated / redundant opcodes?
>
> If you look at the V proposal in the repo, you'll see the majority of the
> instructions are named after scalar instructions but with a 'V' in front.
> These match your description.

is that https://github.com/riscv/riscv-isa-manual ? if so, many many
apologies, i just did a "git pull" and it's still at V2.3-draft:

commit a9d7704765360679c1a5e3fa06e0b0e41d6c5f26
Author: Andrew Waterman <and...@sifive.com>
Date: Tue Apr 3 02:32:47 2018 -0700
Fix description of PMP granularity probing mechanism

and that version doesn't have a section describing V's operands!
there's figure 17.4 which has some example instructions (and i also
recall some on the "SIMD Is Bad" article that you co-authored) so i do
have a fairly clear idea of what you mean. Is there another version
around that i don't know of (sorry if there is!) which is more
up-to-date?

> However, there are also several features that go beyond simply attaching VL
> to a scalar operation and are crucial to being able to vectorize a lot of
> code. To name a few:
> - Conditional execution (i.e., predicated operations)
> - Inter-lane data movement (e.g. SLIDE, SELECT)
> - Reductions (e.g., VADD with a scalar destination)

Ok so the Conditional and also the Reductions is one of the reasons
why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
of a decent name) i proposed that it be implemented as "if you say r0
is to be a vector / SIMD that means operations actually take place on
r0,r1,r2... r(N-1)". i.e. you don't have a separate vector register
file.

Consequently any parallel operation could be paused (or... more
specifically: vectors disabled by resetting it back to a default /
scalar / vector-length=1) yet the results would actually be in the
*main register file* (integer or float) and so anything that wasn't
possible to easily do in "simple" parallel terms could be done *out*
of parallel "mode" instead.

I do appreciate that the above does imply that there is a limit to
the length that SimpleV (whatever) can be parallelised, namely that
you run out of registers! my thought there was, "leave space for the
main V-Ext proposal to extend it to the length that V currently
supports". Honestly i had not thought through precisely how that
would work.

Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that, it
reminds me of the discussion with Clifford on bit-manipulation
(gather-scatter except not Bit Gather Scatter, *data* gather scatter):
if applied "globally and outside of V and P" SLIDE and SELECT might
become an extremely powerful way to do fast memory copy and reordering
[2].

However I haven't quite got my head round how that would work: i am
used to the concept of register "tags" (the modern term is "masks")
and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
STORE you would get the exact same thing as SELECT. oh wait! yes,
it's the same terminology as Section 17.17, "Predicated Execution".
almost. hum. needs thought.

SLIDE you could do simply by setting say r0 vector-length to say 16
(meaning that if referred to in any operation it would be an implicit
parallel operation on *all* registers r0 through r15), and temporarily
set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it
would implicitly mean "load from memory into r7 through r11". Then
you go back and do an operation on r0 and ta-daa, you're actually
doing an operation on a SLID {SLIDED?) vector.

The advantage of Simple-V (whatever) over V would be that you could
actually do *operations* in the middle of vectors (not just SLIDEs)
simply by (as above) setting r0 vector-length to 16 and r7
vector-length to 5. There would be nothing preventing you from doing
an ADD on r0 (which meant do an ADD on r0 through r15) followed
*immediately in the next instruction with no setup cost* a MUL on r7
(which actually meant "do a parallel MUL on r7 through r11").

btw it's worth mentioning that you'd get scalar-vector and
vector-scalar in 2-operand instructions implicitly by having one of
the source register be vector-length 1 (the default) and one being N >
1. but without having special opcodes to do it (17.12 rs1 specifying
the scalar therefore opcode would explicitly have to encode whether
the operation was v-v or v-s). i *believe* (or more like "logically
infer or deduce" as i haven't got access to the spec) that that would
result in a further opcode reduction when comparing [draft] V-Ext to
[proposed] Simple-V.

Also, Reduction *might* be possible by specifying that the
destination be a scalar (vector-length=1) whilst the source be a
vector. However... it would be an awful lot of work to go through
*every single instruction* in *every* Extension, working out which
ones could be parallelised (ADD, MUL, XOR) and those that definitely
could not (DIV, SUB). Is that worth the effort? maybe. Would it
result in huge complexity? probably. Could an implementor just go "I
ain't doing *that* as parallel! let's make it virtual-parallelism
(sequential reduction) instead"? absolutely. So, now that I think it
through, Simple-V (whatever) covers Reduction as well. huh, that's a
surprise.


> - Vector-length speculation (making it possible to vectorize some loops with
> unknown trip count) - I don't think this part of the proposal is written
> down yet.

Now that _is_ an interesting concept. A little scary, i imagine,
with the possibility of putting a processor into a hard infinite
execution loop... :)


> Also, note the vector ISA consumes relatively little opcode space (all the
> arithmetic fits in 7/8ths of a major opcode). This is mainly because data
> type and size is a function of runtime configuration, rather than of opcode.

yes. i love that aspect of V, i am a huge fan of polymorphism [1]
which is why i am keen to advocate that the same runtime principle be
extended to the rest of the RISC-V ISA [3]

Yikes that's a lot. I'm going to need to pull this into the wiki to
make sure it's not lost.

l.

[1] inherent data type conversion: 25 years ago i designed a
hypothetical hyper-hyper-hyper-escape-code-sequencing ISA based around
2-bit (escape-extended) opcodes and 2-bit (escape-extended) operands
that only required a fixed 8-bit instruction length. that relied
heavily on polymorphism and runtime size configurations as well. At
the time I thought it would have meant one HELL of a lot of CSRs...
but then I met RISC-V and was cured instantly of that
delusion^Wmisapprehension :)

[2] Interestingly if you then also add in the other aspect of Simple-V
(the data-size, which is effectively functionally orthogonal /
identical to "Packed" of Packed-SIMD), masked and packed *and*
vectored LOAD / STORE operations become byte / half-word / word
augmenters of B-Ext's proposed "BGS" i.e. where B-Ext's BGS dealt with
bits, masked-packed-vectored LOAD / STORE would deal with 8 / 16 / 32
bits at a time. Where it would get really REALLY interesting would be
masked-packed-vectored B-Ext BGS instructions. I can't even get my
head fully round that, which is a good sign that the combination would
be *really* powerful :)

[3] ok sadly maybe not the polymorphism, it's too complicated and I
think would be much too hard for implementors to easily "slide in" to
an existing non-Simple-V implementation. i say that despite really
*really* wanting IEEE 704 FP Half-precision to end up somewhere in
RISC-V in some fashion, for optimising 3D Graphics. *sigh*.

Andrew Waterman

unread,
Apr 6, 2018, 9:29:20 PM4/6/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang
I agree that everything you could do in the V extension you could do on the scalar registers instead.  But, as you point out, the number of registers limits parallelism.  It's not a minor point.  This strategy looks OK for things like BLAS1, but once you want to vectorize loop nests with lots of live values, the vector length becomes so short that you're practically running scalar code.

The effect is even more pronounced for higher-performance (i.e. multi-lane) implementations, which have many more values in-flight and so will need to rename registers to cope with the lack of architectural registers.

The effect is also more pronounced for high-latency memory systems, which long vectors go a long way towards ameliorating, especially for gathers.  (Of course, this is not a problem for many embedded systems.)



> - Vector-length speculation (making it possible to vectorize some loops with
> unknown trip count) - I don't think this part of the proposal is written
> down yet.

 Now that _is_ an interesting concept.  A little scary, i imagine,
with the possibility of putting a processor into a hard infinite
execution loop... :)

The vector length's still finite, so there's no possibility of vector instructions running forever.  The "speculation" part is that you might execute some loop iterations that shouldn't have been executed, which usually means falling back to scalar fixup code for the last VL-or-so iterations.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDy8q7WNLzsO-%3Dr5rDdhW09QmVtfx_MTp%3DjJ%2Bq_sFvzVVg%40mail.gmail.com.

Jacob Bachmeyer

unread,
Apr 6, 2018, 11:53:27 PM4/6/18
to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
Luke Kenneth Casson Leighton wrote:
> (splitting to separate thread, this is 95% V, not P)
>
> On Sat, Apr 7, 2018 at 12:17 AM, Andrew Waterman
> <wate...@eecs.berkeley.edu> wrote:
>
>
>> On Fri, Apr 6, 2018 at 3:49 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net>
>> wrote:
>>
>>> ... under such hypothetical circumstances how much of the V Extension
>>> - right now - would be duplicated / redundant opcodes?
>>>
>> If you look at the V proposal in the repo, you'll see the majority of the
>> instructions are named after scalar instructions but with a 'V' in front.
>> These match your description.
>>
>
> is that https://github.com/riscv/riscv-isa-manual ? if so, many many
> apologies, i just did a "git pull" and it's still at V2.3-draft:
>
> commit a9d7704765360679c1a5e3fa06e0b0e41d6c5f26
> Author: Andrew Waterman <and...@sifive.com>
> Date: Tue Apr 3 02:32:47 2018 -0700
> Fix description of PMP granularity probing mechanism
>
> and that version doesn't have a section describing V's operands!
> there's figure 17.4 which has some example instructions (and i also
> recall some on the "SIMD Is Bad" article that you co-authored) so i do
> have a fairly clear idea of what you mean.

The RISC-V specs use a "modular versioning" scheme. Chapter 17 in that
commit is RVV 0.4-draft, even though the manual as a whole is v2.3-draft.

>> However, there are also several features that go beyond simply attaching VL
>> to a scalar operation and are crucial to being able to vectorize a lot of
>> code. To name a few:
>> - Conditional execution (i.e., predicated operations)
>> - Inter-lane data movement (e.g. SLIDE, SELECT)
>> - Reductions (e.g., VADD with a scalar destination)
>>
>
> Ok so the Conditional and also the Reductions is one of the reasons
> why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
> of a decent name) i proposed that it be implemented as "if you say r0
> is to be a vector / SIMD that means operations actually take place on
> r0,r1,r2... r(N-1)". i.e. you don't have a separate vector register
> file.
>

This is a problem: implementations must either have a large number of
regfile read ports or abandon parallelism for this case. It sounds
great from a software point-of-view, but is infeasible to implement in
hardware. Actual vector units avoid this problem by splitting the
(vector) register file across multiple vector lanes, each of which has
its own "slice" of the vector register file and its own datapath/ALU
hardware.

> Consequently any parallel operation could be paused (or... more
> specifically: vectors disabled by resetting it back to a default /
> scalar / vector-length=1) yet the results would actually be in the
> *main register file* (integer or float) and so anything that wasn't
> possible to easily do in "simple" parallel terms could be done *out*
> of parallel "mode" instead.
>
> I do appreciate that the above does imply that there is a limit to
> the length that SimpleV (whatever) can be parallelised, namely that
> you run out of registers! my thought there was, "leave space for the
> main V-Ext proposal to extend it to the length that V currently
> supports". Honestly i had not thought through precisely how that
> would work.
>
> Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that, it
> reminds me of the discussion with Clifford on bit-manipulation
> (gather-scatter except not Bit Gather Scatter, *data* gather scatter):
> if applied "globally and outside of V and P" SLIDE and SELECT might
> become an extremely powerful way to do fast memory copy and reordering
> [2].
>

SLIDE and SELECT will work in RVV for that, since the vector unit must
have its own (wide) path to memory.

> However I haven't quite got my head round how that would work: i am
> used to the concept of register "tags" (the modern term is "masks")
> and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
> STORE you would get the exact same thing as SELECT. oh wait! yes,
> it's the same terminology as Section 17.17, "Predicated Execution".
> almost. hum. needs thought.
>

Many of the concepts in draft RVV are from Hwacha, which was described
in Berkeley technical reports UCB/EECS-2015-262 (ISA reference),
UCB/EECS-2015-263 (microarchitecture reference), and UCB/EECS-2015-264
(preliminary evaluation).

> [...]
>
>> - Vector-length speculation (making it possible to vectorize some loops with
>> unknown trip count) - I don't think this part of the proposal is written
>> down yet.
>>
>
> Now that _is_ an interesting concept. A little scary, i imagine,
> with the possibility of putting a processor into a hard infinite
> execution loop... :)
>

That would not be possible, since there is some finite maximum vector
length. When that is reached, software must iterate for the loop to
continue.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Apr 7, 2018, 5:32:03 AM4/7/18
to Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Apr 7, 2018 at 2:28 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> I agree that everything you could do in the V extension you could do on the
> scalar registers instead. But, as you point out, the number of registers
> limits parallelism. It's not a minor point.

appreciated.

> This strategy looks OK for
> things like BLAS1, but once you want to vectorize loop nests with lots of
> live values, the vector length becomes so short that you're practically
> running scalar code.

I'm not an algorithm-expert so had to look up BLAS1:
http://people.sc.fsu.edu/~jburkardt/m_src/blas1/blas1.html

I am happy to admit that I am insufficiently imaginative to be able
to assess what you say, here, Andrew. Would it be ok to ask if you
could provide a concrete example (a reference to an example
implementation online, preferably in c or python or perhaps to a
MATLAB or similar API name?)

Also, I don't know if you spotted it but I proposed also adding the
zero-overhead loop to proposed Simple-V, which might help? I don't
honestly know, I'd need to see and analyse an example.


> The effect is even more pronounced for higher-performance (i.e. multi-lane)
> implementations, which have many more values in-flight and so will need to
> rename registers to cope with the lack of architectural registers.
>
> The effect is also more pronounced for high-latency memory systems, which
> long vectors go a long way towards ameliorating, especially for gathers.
> (Of course, this is not a problem for many embedded systems.)

agreed. so Simple-V would cover a significant number of potential
use-cases in an extremely useful fashion, whilst clearly having
nothing like the power of a full (Cray-inspired) supercomputer.

So this is the point at which I would expect V-Ext to take over from
Simple-V, in a seamless transition that extends Simple-V in a similar
analogous way to the mention of "Base Vector ISA" implying that there
is more than one level being planned in V (V2.3-Draft, Section 17.5
Element bitwitdh)

That's not a cop-out on my part: it's a pragmatic recognition and
effort to break down the complexity of adding Vectors (and Packing) to
RISC-V into an "incremental" fashion, that would in turn make
implementing V (and a P-which-now-becomes-DSP) Extensions both a lot
less intimidating [costly].

Ultimately it's in recognition of the "doing is learning" paradigm,
inviting more contributors to participate and thus increasing the
number of eyes-on, and thus increasing the probability of both
adoption and *correct* adoption of [proposed-altered] V, P-now-DSP,
and Simple-V.

>> > - Vector-length speculation (making it possible to vectorize some loops
>> > with
>> > unknown trip count) - I don't think this part of the proposal is written
>> > down yet.
>> > [...]

> The vector length's still finite, so there's no possibility of vector
> instructions running forever.

whewww :)

> The "speculation" part is that you might
> execute some loop iterations that shouldn't have been executed, which
> usually means falling back to scalar fixup code for the last VL-or-so
> iterations.

oh my god! it really *is* an implementation of "IF THEN UNLESS"! :)
are implementors permitted to implement "HCF" if they can't satisfy
the fixup conditions? damn i can't find the reference to when i first
heard about that joke language-construct: it was 1990, after all.

Seriously though: I'm slightly scared. Do you have a suitable code
example or scenario that would allow a clearer understanding? Is this
something related to out-of-order implementations?

Thanks Andrew.

l.

Luke Kenneth Casson Leighton

unread,
Apr 7, 2018, 5:52:40 AM4/7/18
to Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
On Sat, Apr 7, 2018 at 4:53 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Luke Kenneth Casson Leighton wrote:
>> is that https://github.com/riscv/riscv-isa-manual ? if so, many many
>> apologies, i just did a "git pull" and it's still at V2.3-draft:
>>
>> commit a9d7704765360679c1a5e3fa06e0b0e41d6c5f26

> The RISC-V specs use a "modular versioning" scheme. Chapter 17 in that
> commit is RVV 0.4-draft, even though the manual as a whole is v2.3-draft.

oink?? oh YUK. Ok I scanned back to the beginning of the chapter
(p97) and yes I finally at long last spotted it mentions "0.4-DRAFT".
nowhere else. That's *horribly* confusing.

>> Ok so the Conditional and also the Reductions is one of the reasons
>> why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
>> of a decent name) i proposed that it be implemented as "if you say r0
>> is to be a vector / SIMD that means operations actually take place on
>> r0,r1,r2... r(N-1)". i.e. you don't have a separate vector register
>> file.
>>
>
>
> This is a problem: implementations must either have a large number of
> regfile read ports or abandon parallelism for this case.

yes. Nothing wrong with that [as long as you mean "abandon internal
parallelism" and do "virtual parallelism" as Broadcom put it [1]).
Christopher Cello gave a very useful one-line summary of how it is
quite common to implement "virtual parallelism" by keeping a scalar
ALU internlly actually scalar.

I suspect that there may be some cross-over here, in this post [2].
Summary: external appearance of vector-parallelism *does not* have to
be implemented as *internal* parallelism.


> SLIDE and SELECT will work in RVV for that, since the vector unit must have
> its own (wide) path to memory.

Again, I suspect that this is cross-over, you may be assuming that it
is *always* the case that a vector unit *will* be implemented as a
N-symmetric ALU, where N is the level of internal parallelism.
Christopher and I have both pointed out that it is perfectly valid
(and desirable) to have N=1.

It would be very useful to revisit these conclusions once this has
been clearly communicated. Many apologies for presenting
"everything-at-once" and introducing confusion and thus providing an
opportunity to fall foul of email-lag.


> Many of the concepts in draft RVV are from Hwacha, which was described in
> Berkeley technical reports UCB/EECS-2015-262 (ISA reference),
> UCB/EECS-2015-263 (microarchitecture reference), and UCB/EECS-2015-264
> (preliminary evaluation).

learning by doing. links:
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.html
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.html

Thank you for that, Jacob. will read.

>> [speculative RVV execution]
>>
>>>
> That would not be possible, since there is some finite maximum vector
> length. When that is reached, software must iterate for the loop to
> continue.

This is why i proposed that Simple-V also include a zero-overhead
loop instruction. Unfortunately I do not yet know enough about the
[proposed] speculative vector looping to be able to usefully
contribute further, yet.

l.

[1] Broadcom VideoCore-IV https://docs.broadcom.com/docs/12358545
Figure 2 P17 and Section 3 on P16.
[2] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/E3k1uklmAgAJ

Andrew Waterman

unread,
Apr 7, 2018, 7:49:43 AM4/7/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang
On Sat, Apr 7, 2018 at 2:31 AM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Apr 7, 2018 at 2:28 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> I agree that everything you could do in the V extension you could do on the
> scalar registers instead.  But, as you point out, the number of registers
> limits parallelism.  It's not a minor point.

 appreciated.

>  This strategy looks OK for
> things like BLAS1, but once you want to vectorize loop nests with lots of
> live values, the vector length becomes so short that you're practically
> running scalar code.

 I'm not an algorithm-expert so had to look up BLAS1:
 http://people.sc.fsu.edu/~jburkardt/m_src/blas1/blas1.html

 I am happy to admit that I am insufficiently imaginative to be able
to assess what you say, here, Andrew.  Would it be ok to ask if you
could provide a concrete example (a reference to an example
implementation online, preferably in c or python or perhaps to a
MATLAB or similar API name?)

Following the BLAS theme, matrix multiplication is the best example.  There's great temporal locality, so just as cache blocking is an effective technique to reduce cache misses, register blocking is an effective technique to reduce the number of loads and stores.  But register blocking requires having lots of registers.


 Also, I don't know if you spotted it but I proposed also adding the
zero-overhead loop to proposed Simple-V, which might help?  I don't
honestly know, I'd need to see and analyse an example.

Fetching instructions wastes lots of energy, and zero-overhead loops are one way to mitigate that fact while improving performance.  But when there's vector parallelism, the best way to exploit it is a vector machine.

Luke Kenneth Casson Leighton

unread,
Apr 7, 2018, 9:30:24 AM4/7/18
to Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Jacob Bachmeyer, Cesar Eduardo Barros, chuanhua.chang
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Apr 7, 2018 at 12:49 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> Following the BLAS theme, matrix multiplication is the best example.

Ok so let me grab a random example to illustrate and work from.
Apologies I like to go through things at this sort of level of detail,
so as to avoid the possibility of mistakes.
https://www.programmingsimplified.com/c-program-multiply-matrices

for (c = 0; c < m; c++) {

for (d = 0; d < q; d++) {
for (k = 0; k < p; k++) {
sum = sum + first[c][k]*second[k][d];
}

multiply[c][d] = sum;
sum = 0;
}
}

Right! ok! So, sum is the inner loop, and that looks fine
(initially), it's just a straight multiply-and-accumulate, right?
Except... hang on let me rewrite to illustrate:

for (c = 0; c < m; c++) {
float[] firstc = first[c];

for (d = 0; d < q; d++) {
for (k = 0; k < p; k++) {
sum = sum + firstc[k]*second[k][d];

Ok so now firstc[k] can be "vectorised" in a contiguous fashion, but
clearly whooops second[k][d] cannot... unless the vector-based ISA has
a means and method of referencing memory by non-contiguous offsets.

Which is particularly interesting to me as Aspex Microelectronics had
exactly this same problem. An ASP was a 4096-element *MASSIVELY* wide
SIMD architecture with 2-bit (!) ALUs and 256-bits of
Content-Addressable RAM per ALU. Their VASP-G chip, around 2003, was
the tenth largest ASIC being developed in the world at the time, and
they were a 15-man company with only 4 full-time hardware design
engineers!

So their architecture basically divided things down by having a
global LOAD/STORE DMA Engine that worked completely independently of
the main SIMD Engine. That DMA Engine had (in earlier versions) up to
a *THREE* dimensional linearisation system, which effectively meant
that they could "rotate" memory in 16 possible arbitrarily-offset
"rotations" if you know what I mean.

absolutely frickin awesome.

Anyway I notice in Hwacha (I'm assuming V-Ext has something similar?)
as at least 1-dimensional non-contiguous offsets in VLOAD and VSTORE,
which aaactuallyyyy.. in this particular case you could deploy with a
nasty hack:

for (d = 0; d < q; d++) {
float [] secondhack = second[0][d];
for (k = 0; k < p; k++) {
sum = sum + firstc[k]*(secondhack[k*sizeof(second[0))

Or something to that effect, you get the general idea: the addition of
d is external to the inner k-loop, leaving room for making a
1-dimensional non-contiguous offset of size of the.. err... row?
column? You get the idea (sorry, algorithms are where I often end up
with a strange form of logic-dyslexia).


So back to Simple-V. Could Simple-V have this added to it in a way
that's both useful, does not have implementors running away screaming
at the thought of adding of vast amounts of CSRs, and straightforward
to implement?

I *believe* the answer to that might be yes, by permitting perhaps
one (or two at the most) special-purpose CSRs that contain:

* the register number which, if used in a LOAD/STORE that has had
v-length set, "activates" non-contiguous incrementing
* the amount by which to increment memory (specified in bytes),
should this feature *be* activated.

That alone would have some very powerful general-purpose
memory-reordering capabilities, if there was an inner loop comprising
solely LOAD followed by STORE, so I would advocate the addition of two
independent "non-contiguous LOAD/STORE" CSRs. Aside from anything,
that would allow a source register to be say 32-bit wide, multiplied
by a 32-bit to create a 64-bit answer that needed to be stored in a
64-bit vector-destination.

Does that look reasonable, and something that could hypothetically be
extended by a (modified) V-Ext?

Remember, one of the primary reasons to consider this exercise is
precisely to make it easier for implementors to adopt first Simple-V
and *then* V-Ext (or even just V-Ext for some of the more esoteric
data-types and opcodes, just in scalar form, should the need arise.
assuming that's feasible and that the instructions *have* meaning in a
non-vector context. VFIRST would not, for example).

And at the very least, reduce V's arithmetic opcode space by something
like EIGHTY FIVE possibly even NINETY PERCENT (!!) if my preliminary
estimates are correct, through re-use of standard RV opcodes. Even
MADD and MSUB could hypothetically be shared with
P-reduced-to-basics-as-DSP, for an even *greater* opcode reduction.


>> Also, I don't know if you spotted it but I proposed also adding the
>> zero-overhead loop to proposed Simple-V, which might help? I don't
>> honestly know, I'd need to see and analyse an example.
>
>
> Fetching instructions wastes lots of energy,

[and fetching data, too. Learned that one from Jeff Bush (Nyuzi). Do
more whilst stuff's in the registers, avoid hitting cache.]

> and zero-overhead loops are one
> way to mitigate that fact while improving performance. But when there's
> vector parallelism, the best way to exploit it is a vector machine.

Well, the Simple-V proposal is a hybrid to bring both
variable-bitwidth (Packed) and variable-length in a general-purpose
fashion to RV, and Richard Herveille gave an absolutely awesome
single-instruction FFT example he'd seen that used zero-overhead [1],
so although it's not exactly strictly "Vector" I'd be reluctant to
recommend excluding zero-overhead just because a vector machine could
do better.

Not least, where Simple-V fails to be able to do an arbitrary-length
vector, at least zero-overhead loop could conceivably be deployed as
an outer-loop to a Simple-V operand, gaining back some performance.
It's by no means perfect, and reminds me *shudder* of the SIMD-is-bad
article in so many ways, but I was anticipating that V *on top* of
Simple-V would remove those kinds of kinks, providing automatic
register-bank loading or a separate Vector register-file or...
something! I'm hand-waving here, apologies.

l.

[1] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/S82Fz5JeAgAJ

Jacob Bachmeyer

unread,
Apr 8, 2018, 12:59:33 AM4/8/18
to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
Luke Kenneth Casson Leighton wrote:
> On Sat, Apr 7, 2018 at 4:53 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Luke Kenneth Casson Leighton wrote:
>>
>>> is that https://github.com/riscv/riscv-isa-manual ? if so, many many
>>> apologies, i just did a "git pull" and it's still at V2.3-draft:
>>>
>>> commit a9d7704765360679c1a5e3fa06e0b0e41d6c5f26
>>>
>> The RISC-V specs use a "modular versioning" scheme. Chapter 17 in that
>> commit is RVV 0.4-draft, even though the manual as a whole is v2.3-draft.
>>
>
> oink?? oh YUK. Ok I scanned back to the beginning of the chapter
> (p97) and yes I finally at long last spotted it mentions "0.4-DRAFT".
> nowhere else. That's *horribly* confusing.
>

It makes a bit more sense after considering that each chapter is
maintained as a separate LaTeX file which are all combined into the
manual. Still good for a nasty surprise when first found, though.
(There have been complaints on isa-dev about proposed features not being
in the manual as cited; turned out that the complainer was looking at a
version that had actually been released, while the proposed features
were only in the latest draft.)

Perhaps the drafts of the manual should have a warning about this on
their title page? Maybe even include the Git commit ID on the title
page for a draft spec (or every copy of the spec)?

>>> Ok so the Conditional and also the Reductions is one of the reasons
>>> why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
>>> of a decent name) i proposed that it be implemented as "if you say r0
>>> is to be a vector / SIMD that means operations actually take place on
>>> r0,r1,r2... r(N-1)". i.e. you don't have a separate vector register
>>> file.
>>>
>> This is a problem: implementations must either have a large number of
>> regfile read ports or abandon parallelism for this case.
>>
>
> yes. Nothing wrong with that [as long as you mean "abandon internal
> parallelism" and do "virtual parallelism" as Broadcom put it [1]).
> Christopher Cello gave a very useful one-line summary of how it is
> quite common to implement "virtual parallelism" by keeping a scalar
> ALU internlly actually scalar.
>
> I suspect that there may be some cross-over here, in this post [2].
> Summary: external appearance of vector-parallelism *does not* have to
> be implemented as *internal* parallelism.
>

While this is true, the benefit of vector operations are primarily from
parallelism, so throwing that away seems to me to defeat the purpose.

>> SLIDE and SELECT will work in RVV for that, since the vector unit must have
>> its own (wide) path to memory.
>>
>
> Again, I suspect that this is cross-over, you may be assuming that it
> is *always* the case that a vector unit *will* be implemented as a
> N-symmetric ALU, where N is the level of internal parallelism.
> Christopher and I have both pointed out that it is perfectly valid
> (and desirable) to have N=1.
>

While some implementations may be able to make use of that, particularly
the quasi-independence of the RVV vector unit permitting the scalar unit
to continue while the vector unit works through an operation, N>1 is the
major motivation for vector processing. For N=1, RISC-V has the scalar
unit.

> [2] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/E3k1uklmAgAJ
>
A minor request, when citing emails on isa-dev, could you give
message-ids or at least use the search-by-message-id links? I ask
because I prefer to read email in an email program and message-ids are
links that I can easily follow locally. They will also remain valid
even long after Google is gone and someone is reading this in some
distant future archive.


-- Jacob

Jacob Bachmeyer

unread,
Apr 8, 2018, 1:02:31 AM4/8/18
to Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
Andrew Waterman wrote:
> On Sat, Apr 7, 2018 at 2:31 AM, Luke Kenneth Casson Leighton
> <lk...@lkcl.net <mailto:lk...@lkcl.net>> wrote:
>
> ---
> crowd-funded eco-conscious hardware:
> https://www.crowdsupply.com/eoma68
> <https://www.crowdsupply.com/eoma68>
>
> [...]
>
>
> Also, I don't know if you spotted it but I proposed also adding the
> zero-overhead loop to proposed Simple-V, which might help? I don't
> honestly know, I'd need to see and analyse an example.
>
>
> Fetching instructions wastes lots of energy, and zero-overhead loops
> are one way to mitigate that fact while improving performance. But
> when there's vector parallelism, the best way to exploit it is a
> vector machine.

Aha! So there is a use for vector processing with only a single lane.


-- Jacob

Jacob Bachmeyer

unread,
Apr 8, 2018, 1:46:35 AM4/8/18
to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
Luke Kenneth Casson Leighton wrote:
I believe that RISC-V calls this "stride" after Hwacha's similar feature.

The vector load for a row of the first matrix would be "VLD v0,
(<first[c]>)" while the load for a column of the second matrix (assuming
both are in row-major order as that example does) would be "VLD v1,
(<second+d>)[<sizeof(second[])>]" which is a vector-load that picks
values spaced the length of a row in second starting at second+d, so v1
gets second[0][d],second[1][d],second[2][d],... The dot product of v0
and v1 is stored into multiply[c][d]. There are various ways to
parallelize this and I have of course omitted the outer loops that are
necessary if the row and column do not fit entirely into the vector unit
at once.

> [...]
>
> So back to Simple-V. Could Simple-V have this added to it in a way
> that's both useful, does not have implementors running away screaming
> at the thought of adding of vast amounts of CSRs, and straightforward
> to implement?
>
> I *believe* the answer to that might be yes, by permitting perhaps
> one (or two at the most) special-purpose CSRs that contain:
>
> * the register number which, if used in a LOAD/STORE that has had
> v-length set, "activates" non-contiguous incrementing
> * the amount by which to increment memory (specified in bytes),
> should this feature *be* activated.
>
> That alone would have some very powerful general-purpose
> memory-reordering capabilities, if there was an inner loop comprising
> solely LOAD followed by STORE, so I would advocate the addition of two
> independent "non-contiguous LOAD/STORE" CSRs. Aside from anything,
> that would allow a source register to be say 32-bit wide, multiplied
> by a 32-bit to create a 64-bit answer that needed to be stored in a
> 64-bit vector-destination.
>
> Does that look reasonable, and something that could hypothetically be
> extended by a (modified) V-Ext?
>

For LOAD/STORE, could a better option be to interpret the offset in the
opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is
configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12),
t5 = *(t2+24), t6 = *(t2+32)? Perhaps include a bit in the
vector-control CSRs to select between offset-as-stride and unit-stride
memory accesses?

> Remember, one of the primary reasons to consider this exercise is
> precisely to make it easier for implementors to adopt first Simple-V
> and *then* V-Ext (or even just V-Ext for some of the more esoteric
> data-types and opcodes, just in scalar form, should the need arise.
> assuming that's feasible and that the instructions *have* meaning in a
> non-vector context. VFIRST would not, for example).
>
> And at the very least, reduce V's arithmetic opcode space by something
> like EIGHTY FIVE possibly even NINETY PERCENT (!!) if my preliminary
> estimates are correct, through re-use of standard RV opcodes. Even
> MADD and MSUB could hypothetically be shared with
> P-reduced-to-basics-as-DSP, for an even *greater* opcode reduction.
>

I remain unconvinced that this kind of reduction is realistic to expect,
except for the possible case of the "vector-tail" implementation for
which I have a vague outline. Essentially, with "vector-tail", the
scalar unit becomes vector lane 0, and additional vector lanes operate
in parallel with it.

> [..]
>
>> and zero-overhead loops are one
>> way to mitigate that fact while improving performance. But when there's
>> vector parallelism, the best way to exploit it is a vector machine.
>>
>
> Well, the Simple-V proposal is a hybrid to bring both
> variable-bitwidth (Packed) and variable-length in a general-purpose
> fashion to RV, and Richard Herveille gave an absolutely awesome
> single-instruction FFT example he'd seen that used zero-overhead [1],
> so although it's not exactly strictly "Vector" I'd be reluctant to
> recommend excluding zero-overhead just because a vector machine could
> do better.
>

There are slides
(<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
that describe the programming model planned for RVV, with an example and
comparisons to other models (SIMD, SPMT). Can you make a similar
comparison between RVV and Simple-V? I think that would make Simple-V
much easier to understand, at least for me.

> Not least, where Simple-V fails to be able to do an arbitrary-length
> vector, at least zero-overhead loop could conceivably be deployed as
> an outer-loop to a Simple-V operand, gaining back some performance.
>

Zero-overhead loop deployed atop RVV could also be useful.


-- Jacob

Andrew Waterman

unread,
Apr 8, 2018, 3:54:44 AM4/8/18
to Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Richard Herveille, Guy Lemieux, Cesar Eduardo Barros, chuanhua.chang
That's one important reason, but there's also a performance argument.  Single-lane machines can still exploit some degree of vector ILP, e.g., an arithmetic operation chaining from a load.  Although the load and ALU op individually execute at most one element per cycle, the two instructions can collectively execute two instructions per cycle, because there are no structural hazards between the memory unit and the ALU.




-- Jacob

lk...@lkcl.net

unread,
Apr 8, 2018, 6:00:14 AM4/8/18
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com, jcb6...@gmail.com


On Sunday, April 8, 2018 at 6:46:35 AM UTC+1, Jacob Bachmeyer wrote:
Luke Kenneth Casson Leighton wrote:

I believe that RISC-V calls this "stride" after Hwacha's similar feature.


Okay!  very cool.  ok so *sigh* looking at Hwacha vector-strided ISA manual [1] instead of V-Ext Ref because we don't have it, Section 6.1, p16.  Checking the table on page 127 of V2.3-Draft User-ISA (which i found a couple days ago), VLD, VLDS and VLDX all exist so indicate potential relevance.  [Andrew this is a *lot* of inconvenience that we're being forced to go to, effectively black-box reverse-engineering what is supposed to be an *Open* ISA, making communication for everybody that much harder.]

VR-type format description gives us the hint that there are address registers, "base" tells us which of the (256???! wha-ha-yeowww that's a lot) address registers to use.

I'm looking at section 6.1.2 of Hwacha ISA, the following pseudo-code:

if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided

for (int i=0; i<vl; ++i)
  if ([!]preg[p][i])
    for (int j=0; j<seglen+1; j++)
      vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];

notes:

* stride (unit or from areg[as2]) is loop-invariant.
* areg[as1] is loop-invariant i.e. there's no double-level memory indirection (wheww, that would be awkward to introduce to base RV).
* seglen is actually selecting consecutive vector registers (vd + j) and the indexing i is selecting the element *in* consecutive vector registers.

I do like that.  Thoughts:

* The most pressing question is, why were address registers left out of Base RV?  It makes me nervous to suggest adding them to Simple-V, although the flip-side is: adding them to Simple-V would be a way to *get* address registers into the (main, non-specialist) spec.
* seglen is something that, because Simple-V is advocating using the standard register file(s) these are of smaller size than vreg in supercomputers, I'm not sure if it makes sense to have seglen, and instead to explicitly code the seglen loop as standard assembler rather than have it in hardware.
* I do appreciate that the inner seglen loop could (if stride==unit-stride) result in contiguous memory-reads.


>  * the register number which, if used in a LOAD/STORE that has had
> v-length set, "activates" non-contiguous incrementing
>  * the amount by which to increment memory (specified in bytes),
> should this feature *be* activated.
 
> Does that look reasonable, and something that could hypothetically be
> extended by a (modified) V-Ext?
>  

For LOAD/STORE, could a better option be to interpret the offset in the
opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is
configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12),
t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the
vector-control CSRs to select between offset-as-stride and unit-stride
memory accesses?

Yeah that sounds perfectly reasonable and powerful to me.

So there would be an instruction like this:

| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM |
| opcode | 5 bit | 1 bit             | 1 bit             | 5 bit, OFFn=XLEN |

which would mean:

* CSR-Offset register n <= (float|int) register number N
* CSR-Offset Stride-mode = offset or unit
* CSR-Offset amount register n = contents of register M

LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):

offs = 0
stride = 1
vector-len = CSR-Vector-length register N

for (o = 0, o < 2, o++)
  if (CSR-Offset register o == M)
      offs = CSR-Offset amount register o
      if CSR-Offset Stride-mode == offset:
          stride = ldoffs
      break

for (i = 0, i < vector-len; i++)
    r[N+i] = mem[(offs*i + r[M+i])*stride]

Again, my logic-dsylexia may kick in and get this wrong, help appreciated.


> Remember, one of the primary reasons to consider this exercise is
> precisely to make it easier for implementors to adopt first Simple-V
> and *then* V-Ext (or even just V-Ext for some of the more esoteric
> data-types and opcodes, just in scalar form, should the need arise.
> assuming that's feasible and that the instructions *have* meaning in a
> non-vector context.  VFIRST would not, for example).
>
> And at the very least, reduce V's arithmetic opcode space by something
> like EIGHTY FIVE possibly even NINETY PERCENT (!!) if my preliminary
> estimates are correct, through re-use of standard RV opcodes.  Even
> MADD and MSUB could hypothetically be shared with
> P-reduced-to-basics-as-DSP, for an even *greater* opcode reduction.
>  

I remain unconvinced that this kind of reduction is realistic to expect,

 I found the table of Instruction Listings, V2.3-Draft, Chapter 19, you want page 125 and 126, it's *nearly* identical to Hwacha, I'm currently adding them to the page http://libre-riscv.org/simple_v_extension/ so that I can demonstrate that yes, it really is near 85% duplication.  I will add a second column to the table to show the equivalent "base RV instruction".
 

except for the possible case of the "vector-tail" implementation for
which I have a vague outline.  Essentially, with "vector-tail", the
scalar unit becomes vector lane 0, and additional vector lanes operate
in parallel with it.

 apologies i don't follow, the terminology "vector lane" is unfamilar.  could you elaborate?
 
There are slides
(<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
that describe the programming model planned for RVV, with an example and
comparisons to other models (SIMD, SPMT).  

oo i've seen those before.
 
Can you make a similar
comparison between RVV and Simple-V?  I think that would make Simple-V
much easier to understand, at least for me.


Sure.  I did however do a really quick review, and i think it's as simple as "remove the v on the instruction"... *as long as* vector-packed-bitwidth is not set.   I'll post separately though.

l.

lk...@lkcl.net

unread,
Apr 8, 2018, 6:19:27 AM4/8/18
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com, jcb6...@gmail.com


On Sunday, April 8, 2018 at 5:59:33 AM UTC+1, Jacob Bachmeyer wrote:
Luke Kenneth Casson Leighton wrote:
 
It makes a bit more sense after considering that each chapter is
maintained as a separate LaTeX file which are all combined into the
manual.  

They should be separate PDFs until such time as the spec is finalised.  Or, only "frozen" Exts placed into the one big PDF.

Perhaps the drafts of the manual should have a warning about this on
their title page?  Maybe even include the Git commit ID on the title
page for a draft spec (or every copy of the spec)?

 that's a damn good idea.
 
>  I suspect that there may be some cross-over here, in this post [2].
> Summary: external appearance of vector-parallelism *does not* have to
> be implemented as *internal* parallelism.
>  

While this is true, the benefit of vector operations are primarily from
parallelism, so throwing that away seems to me to defeat the purpose.


Not at all.  It's down to flexibility and pragmatism.  Even Broadcom don't put *sixteen* FPUs down into VideoCore-IV but they'd clearly like the option to do that *transparently* in the future, keeping the ISA identical and not forcing developers to recompile code (or compiler writers to go SIMD-style-mental)

Implementors may *choose* to set the internal parallelism to 1.  *or* they may choose to set the internal parallelism (duplicating the ALU block) all the way up to the maximum vector length.

What you may be considering to be a "degenerate pointless case" is in fact an extremely simple way for implementors to gain *instruction-compatibility* of Vector capabilities... *without duplication of ALUs*.

Also, there is even more: even in the most powerful 3D GPUs and Vector VPUs in the world, they *still* don't duplicate *all* functions into parallel ALU blocks.  Broadcom, in VideoCore-IV *specifically* decided to not add 4x parallel blocks implementing the RECIP-SQRT, adding instead only one of those because it's not very commonly used.

If following the suggestion that the parallelism be explicit and yet still allow implementors to choose which instructions to parallelise and which not to parallelise, the ISA quickly becomes hell-on-earth, as vendor fragmentation on that decision would need an insane amount of run-time detection of the internal parallel vector unit width.  SIMD Width Hell but on a per-operation basis.

So it's really quite simple, you make the instructions uniformly *look* like they're parallel, and give implementors the freedom to choose precisely where to focus their time and energy.

> [2] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/E3k1uklmAgAJ
>  
A minor request, when citing emails on isa-dev, could you give
message-ids or at least use the search-by-message-id links?

I thought it was doing exactly that: on the google-groups web interface I'm clicking the down-arrow drop-down to right-above of each message (to the right of where you normally hit "reply"), selecting the "Link" then cut/pasting that.  This results in a link to a specific *message* in the discussion thread rather than a link to the "top of the discussion thread".  Is that working as intended or is it causing difficulties?  If you can describe instructions I am happy to follow them.

l.

lk...@lkcl.net

unread,
Apr 8, 2018, 9:42:18 AM4/8/18
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com, jcb6...@gmail.com
[jacob i'm moving this reply in its entirety from the P-Ext to the V-thread]

On Sun, Apr 8, 2018 at 5:35 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl wrote:

>>  oh.  Jacob, are you saying that because you believe that to achieve
>> vector parallelism or SIMD parallelism it is *always* required to add
>> multiple simultaneous ports (like dual-ported RAM) to the register
>> file?
>>  
>
>
> If the instructions are to *actually* be executed in parallel, yes.

 So, nothing wrong with that: in fact (aside from the argument of it being important for implementation-varying internal parallelism to be up to the vendor right from simple implementation all the way to hyper-full-on-supercomputer *AND* still have the *EXACT* same instruction set for each), that would be another extremely good supporting argument for *not* forcing implementors to be required to have to implement parallel engines.

 you certainly would not, in an embedded area-conscious Vector or SIMD engine, wish to be *forced* to implement dual-ported or quad-ported registers.  That would simply mean that (like V0.4 V right now) the adoption barrier is so high that nobody will use it.


>>  In my previous reply to you, I proposed a hypothetical implementation
>> that stood in between the register file and the ALU, sequentially
>> redirecting register indexes for e.g. r0 to instead be r0..r(N-1) each
>> *one* per clock cycle.
>>  
>
>
> This implementation has implicit iteration, but does not actually execute
> the vector operations in parallel, so there is no gain over baseline RISC-V.

 except for the incredibly important fact that binary-compiled applications would transparently run way way faster on future hardware that *did* implement internal parallelism.

>>  So are you implying that even to implement that simple an abstractor,
>> where there is *still only one* fetch/store from the register file per
>> cycle, that implementors would *still* need to add extra ports to the
>> register file?
>>  
>
>
> In that case, no, the register file needs only the baseline 2R1W.  However,
> in that case, the "abstractor" has significantly extended the critical path
> length, 

 i believe i covered that by saying that I only mention one potential implementation as "Reference", and that implementors would more than likely be capable of creating implementations which did not have significant increases in the critical path.  However until someone actually looks at it closely I honestly do not know.  We don't have Hwacha's implementation to look at (grrr).

> while not providing any improvement (ignoring code size reductions)
> over simply unrolling the implied loop in software.

 It provides drastic improvement: the implied loop in software would now need to know the *hardware* parallelsim width, which means an absolute nightmare we know and recognise already as "SIMD is Bad".  see Aspex ASP example if you want to see how TRULY insane that can get.

 You may be dramatically underestimating quite how strategically important the single-internal-ALU (vector=1) case really is.  However I've given quite a few explanations and reasons, I'm reluctant to repeat them (on-list), it would be I think really useful to specifically review "Note on implementation of parallelism" section of http://libre-riscv.org/simple_v_extension/ and see if that specifically covers everything that you've thought of, Jacob.  Would that be ok?

> Yes, but if the ALU is not duplicated to provide multiple vector lanes, what
> is the use of the vector extension? 

 I believe I have answered this multiple times now, across two separate threads.  If not please do let me know, but could I ask if it could specifically be in the context of assesing whether "Note on implementation of parallelism" is adequate?


> RVV, with its quasi-independent vector
> unit, at least can perform a long-running vector operation while the scalar
> unit continues execution, but the simple-V model reuses the scalar execution
> resources for vector processing.

 That's right.  And the same applies for the P-Ext-with-SIMD-removed-so-its-a-DSP.  Which is an advantage as I see it because you get the benefits of V-ifying the entirety of the standard RV ISA (*and* future Extensions) *without* requiring all the specialist features of V-Ext... that only apply to floating-point anyway.

 However I *do* recognise the value of what you are saying, and had started to wonder (18 hours ago) about letting V (modified-V) specify that the register bank is to be the *vector* bank not the integer or floating point bank.

 This would be _really_ awesome because it would then (implementation-dependent notwithstanding) be theoretically possible to issue Simple-V vector operations, Simple-V integer operations *and* Simple-V floating-point operations, all at once.  Maybe not in full because under the proposal you can only discern FP and Integer through the *base opcode*.

Still, being able to do integer-based SIMD at the same time as Vector FP is still awesome and actually desirable (3D).


> If the regfile is already enough of the required resources for an RVI
> processor to justify an RVE base with half of the registers, I doubt any
> CSR-per-register approach is likely to be well-received.  :-/

 yehyeh :)  Well V is pretty insane CSR-wise and nobody's objected, and I believe that Christopher Cello said SRAM could be used, I'd like to explore how that would work in practice.  Does an SRAM block assume only one operation at a time because it's row-column addressed?

 Also, having devised the trick of having only one or two CSRs which *mark* one (or two) registers as "having Vectorised offsets" I'm inclined to go "hmmm, that trick might work to reduce the number of Simple-V CSRs as well".  The only problem being: if you have too few, such that you cannot mark all registers that you need to efficiently parallelise a particular operation, you Just Messed Up.

 to illustrate that, here's the 2 proposals:

 integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0
 integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1
 ....
 .... 

 instead:

 CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits specifying an *INDEX* of WHICH int/fp register they refer to
 CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits specifying an *INDEX* of WHICH int/fp register they refer to
 ...
 ...

have to be very *very* careful about not implementing too few of those (or too many).


>>   I note that there's some general-purpose
>> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
>> don't precisely know what those are for.
>
> Those store the details of which vector registers are used and what data
> type each architectural vector register holds.  They are explained in
> sections 17.2 through 17.5.

 ok.  got it.

> Every instruction would need to read at least one such CSR or possibly two. 

 (Understanding this more, above)

> What does an ADD of two different-sized vectors do in simple-V?

* if the two source operands are not the same, throw an exception.
* if the destination operand is also a vector, and the source is longer than the destination, throw an exception.

>>  Help evaluating those assertions greatly appreciated.
>
>
> In terms of storage space, it is comparable to (or smaller than) what is
> needed for the current RVV draft.  In terms of dynamic effects, I think that
> it is more expensive than the current RVV draft, 

 or, you mean, because RVV is its own Extension (and engine) it's less expensive on the rest of RV.

> since simple-V must read a
> length CSR for (almost) every instruction executed to determine whether that
> instruction is scalar or vector. 

 Yyup.  Well... implementors don't *have* to implement it: they could just not implement Simple-V :)

 So now we come to a crunch-point, and again I'd like to reply to this individually, separate thread/subject.

> And what about instructions like JALR? 
> What does jumping to a vector do?  (Sounds like a great way to spawn threads
> in a multi-processor to me.)

 coooool!

 ok so not all instructions would be parallelised.  however, catching the exception and *actually* spawning multiple threads would be pretty amazing.

l.

lk...@lkcl.net

unread,
Apr 8, 2018, 10:31:24 AM4/8/18
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com, jcb6...@gmail.com
On Sun, Apr 8, 2018 at 5:35 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> In terms of storage space, it is comparable to (or smaller than) what is
> needed for the current RVV draft.  In terms of dynamic effects, I think that
> it is more expensive than the current RVV draft, 
> since simple-V must read a
> length CSR for (almost) every instruction executed to determine whether that
> instruction is scalar or vector. 

 Ok so this is crunch-time, and it's an implication that I had not thought of until we got this far.

 It could indeed have been logically deduced (or expected), because if overloading the opcodes to have different meanings, there is guaranteed to be some state, some-where, directly related to registers.  There are several cases:

* All operands vector-length=1 (scalars), all operands packed-bitwidth="default": instructions are passed through direct as if Simple-V did not exist.  Simple-V is, in effect, completely disabled.
* At least one operand vector-length > 1, all operands packed-bitwidth="default": any parallel vector ALUs placed on "alert", virtual parallelism looping may be activated.
* All operands vector-length=1 (scalars), at least one operand packed-bitwidth != default: degenerate case of SIMD, implementation-specific complexity here (packed decode before ALUs or *IN* ALUs)
* At least one operand vector-length > 1, at least one operand packed-bitwidth != default: parallel vector ALUs (if any) placed on "alert", virtual parallelsim looping may be activated, implementation-specific SIMD complexity kicks in (packed decode before ALUs or *IN* ALUs).

Bear in mind that I'm also proposing that the decision whether to parallelise in hardware or whether to virtual-parallelise (to dramatically simplify compilers and also not to run into the SIMD instruction proliferation nightmare) *or* a transprent combination of both, be done on a *per-operand basis*, so that implementors can specifically choose to create an application-optimised implementation that they believe (or know) will sell extremely well, without having "Extra Standards-Mandated Baggage" that would otherwise blow their area or power budget completely out the window.

Additionally, two possible CSR schemes have been proposed, in order to greatly reduce CSR space:

* per-register CSRs (vector-length and packed-bitwidth)
* a smaller number of CSRs with the same information but with an *INDEX* specifying WHICH register in one of three regfiles (vector, fp, int) the length and bitwidth applies to.


For an operation involving (referring to) register M, the latter would be implemented as (pseudo-code):

bitwidth = default # default for opcode?
vectorlen = 1 # scalar

for (o = 0, o < 2, o++) # max num CSV-Vector_* regs: 2
  if (CSR-Vector_registernum[o] == M)
      bitwidth = CSR-Vector_bitwidth[o]
      vectorlen = CSR-Vector_len[o]
      break

otherwise for the former it would simply be:

bitwidth = CSR-Vector_bitwidth[M]
vectorlen = CSR-Vector_len[M]

Also bear in mind that, for reasons of simplicity for implementors, I was coming round to the idea of permitting implementors to choose exactly which bitwidths they would like to support in hardware and which to allow to fall through to software-trap emulation.

So Jacob's question boils down to:

* whether either (or both) of those two CSR schemes have significant latency that could even potentially require an extra pipeline decode stage
* whether there are implementations that can be thought of which do *not* introduce significant latency and thus would not require an extra pipeline stage
* whether it is possible to explicitly (through quite simply disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1, all-simd-bitwidths=default) switch OFF any decoding, perhaps even to the extreme of making it possible to dynamically skip an entire pipeline stage (if one is needed)
* whether packed bitwidth and associated regfile splitting is so complex that it should definitely, definitely be made mandatory that implementors move regfile splitting into the ALU, and what are the implications of that
* whether even if that *is* made mandatory, is software-trapped "unsupported bitwidths" still desirable, on the basis that SIMD is such a complete nightmare that *even* having a software implementation is better, making Simple-V have more in common with a software API than anything else.
* anything else

I appreciate that this is a heck of a lot to go over.  It may take some time.  I will be at the RISC-V Workshop: I was considering proposing a BOF to discuss this?

l.

lk...@lkcl.net

unread,
Apr 8, 2018, 2:37:35 PM4/8/18
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, richard....@roalogic.com, glem...@vectorblox.com, ces...@cesarb.eti.br, chuanhu...@gmail.com, jcb6...@gmail.com


On Sunday, April 8, 2018 at 6:46:35 AM UTC+1, Jacob Bachmeyer wrote:
I believe that RISC-V calls this "stride" after Hwacha's similar feature.

>  Well, the Simple-V proposal is a hybrid to bring both
> variable-bitwidth (Packed) and variable-length in a general-purpose
> fashion to RV, and Richard Herveille gave an absolutely awesome
> single-instruction FFT example he'd seen that used zero-overhead [1],
> so although it's not exactly strictly "Vector" I'd be reluctant to
> recommend excluding zero-overhead just because a vector machine could
> do better.
>  

There are slides
(<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
that describe the programming model planned for RVV, with an example and
comparisons to other models (SIMD, SPMT).  Can you make a similar
comparison between RVV and Simple-V?  I think that would make Simple-V
much easier to understand, at least for me.


So as promised i thought it best to split answering this into a separate reply, after a bit of rest / thought.  The interim answer I gave I believe remains the same (or, to be clear: the *intent* is the same).  The SAXPY example for example remains *exactly the same*... just with standard RV LOAD-FP, FMADD and so on directly substituted in place of the appropriate vector-instructions.

Now, if there is any reason why, in the Simple-V proposal, that that is *not* the case, then the Simple-V proposal is wrong, and should be modified until it *is* the case :)

The only gotcha is: where V was anticipated to be absolutely enormous register-file sizes (256 and above), Simple-V is definitely and deliberately going to be "size of the int / float register file" and realistically the amount of parallelism you'd get would be say up to 8 64-bit 2-operand operations at once because you'd end up taking *twenty four* registers (op1: 8, op2: 8, dest: 8) and even that's probably pushing it.

Software writers *could* hypothetically use more, there's nothing to say you can't completely max out the entire register file entirely as Vectors, it's just that you'd need some to remain as scalars.

So let me go over this SAXPY example and understand it:

for (i=0; i<n; i++) {
 y[i] = a*x[i] + y[i];

a0: n, a1: a, a2: *x, a3: *y

stripmine:
 vsetvl t0, a0
 vlw v0, a2
 vlw v1, a3
 vfma v1, a1, v0, v1
 vsw v1, a3
 slli t1, t0, 2
 add a2, a2, t1
 add a3, a3, t1
 sub a0, a0, t0
 bnez a0, stripmine

oo.  ok.  So, the interesting thing is the "vsetvlt0, a0" tries to put an arbitrary-length number (n as stored in a0) into t0 as a vector length.  Then at the end, t0 is *subtracted* from a0.  So what that implies is: t0 was *not* assigned exactly equal to a0, it was presumably assigned the *maximum vector size* in all but the last loop case.

niiiiiiice :)  I'm impressed.

However, a point of um order, Andrew?  It um... wasn't completely obvious to me how that worked up until um... right now!  Which, like, given how much I've been reading recently is hilarious but also indicative of a... Documentation Fail, shall we say? arggh!  now I understand what you were talking about with the zero-overhead loop stuff!  rrriiiight... time to revisit *that* conversation :)

So yes, Jacob: even with that total-fail on my part to comprehend exactly how vsetvl would actually work until now, the above example does (or more to the point should/must) remain *exactly* the same, just substituting explicit vector opcodes for implicit, overloaded standard RV ones.  vfma -> FMADD and so on.

Does that make sense?

l.


Reply all
Reply to author
Forward
0 new messages