SIMD (or better, simpler Vector) extension proposal: invitation to assess viability.

377 views
Skip to first unread message

Luke Kenneth Casson Leighton

unread,
Apr 2, 2018, 5:51:50 AM4/2/18
to RISC-V ISA Dev
mindful of the lessons from this:
https://www.sigarch.org/simd-instructions-considered-harmful/

i have been looking at the bare minimum to get a
commercially-acceptable (embedded) GPU [0] and Video Decode Engine [1]
made from RISC-V and it's quite tricky. software-defined GPUs
typically do not do so well: see Intel Larrabee for an example, which
Nyuzi replicated (the difference being that Jeff Bush published
exactly *where* Nyuzi's resources disappeared: 40% on rasterisation
alone).

so i looked at the Vector Extension, and was struck by how
comprehensive it is. its deployment in its full capacity would result
in some of the most powerful supercomputers in the world. that it is
based on lessons from Cray make the proposal an amazing piece of
inspirational work. polymorphic tagging to keep the ISA clean and
minimalist. variable-length vectors so that (just as shown in the
article above) there's no ridiculous "cleanup" or "edge cases" as
there are in SIMD.

... so the question is: why is there only one known implementation
(hwacha), its design source code not available for further
enhancement, study, analysis and ultimately wide-spread deployment?
more to the point: why has *nobody else* created an interoperable
implementation?

i believe that the answer to that latter key question lies
unfortunately in how comprehensive V is. its potential being its
downfall, due to how much work it would take to implement, there is
nothing close to V which would allow any implementor except those with
huge engineering resources to work *towards* to a full implementation
of V.

with that as background i was wondering if there was any interest in
collaborating to design a simpler SIMD extension, specifically
learning from the Vector proposal. to start the discussion, i feel
there are four key important things:

1 aside from the usual 8,16,32,64-bit integers and 32 and 64-bit
floats, adding half-precision floating-point is becoming recognised as
important for saving power and increasing performance on Tensors and
in the latter phases of 3D graphics. so the proposal would be: add
8-bit integer operands as mandatory and 16-bit FP as optional, and
apart from that no data types would be made mandatory. if an
implementation has only RV32 with no floating-point, that's perfectly
fine: the implementor would not be *required* to add floating point in
order to be compliant with the proposed SIMD/Simpler-V, they would
simply extend support for all *existing* types into vectors.

2 rather than have separate vector register banks, i'd advocate that
the vector operations operate on the *standard* register bank. if a
vector length is set to 3, and an operation is requested to be
performed on register 0, it is implied to be carried out on registers
1 and 2 as well. obviously the maximum vector length would then be
defined by the size of the register set [of a given implementation].

3 whilst polymorphism goes (sadly), the right of implementors to
choose whether to do full parallelism, full sequential operations, or
anything in between, would remain, as would the ability to operate
scalar-vector as well as vector-vector.

4 *no* new arithmetic instructions would be added (or be mandatory)
beyond those *already* part of the existing (non-vector-based) RISC-V
G standard.

so aside from 8-bit integer operations, existing implements can "slide
in" the proposed SIMD/Simpler-V "under the rug" so to speak. meaning:
the perceived advantage of such a proposal, apart from the
all-important addition of half-precision and the division of registers
down to the 8-bit level, is that it could be implemented by leveraging
(and incrementally improving) existing implementations.

also by tagging registers (making them into groups / vectors) and yet
allowing the same instructions to be executed, *including* Compressed
ones, the instruction decode engine is not even modified! it's pretty
much *one* extra instruction: set (or clear) the vector "tag" (and
vector length) on a given register! i.e. if the length is set to 1
it's clearly back to normal.

also, interestingly, even on implementations that stuck purely to a
sequential way to process vectors, there may actually be a performance
increase and code size reduction due to not having to have an explicit
for-loop in certain situations.

the only other consideration - one which would make the implementation
a bit more complex but would be extremely powerful - would be to add a
way to set one register as a tag/mask. bits in that register would
indicate whether a given indexed operand in the vector was to be
entirely skipped or not. whilst i recognise just how powerful
tag/masking really is, i'm reluctant to advocate it due to the
greatly-enhanced simplicity.

are there any major pitfalls in such a proposal? thoughts appreciated
on the viability of the above [2].

l.

[0] minimum acceptable level of performance would be comparable to the
original MALI400. roughly of the order of 100 milion triangles/sec
would be a reasonable target.
[1] defined as being capable of doing 1080p60 MPEG, H264, H265 etc.
real-time decode.
[2] https://www.youtube.com/channel/UCnFP0IU4gpnmcLnVzDLUtfw

Jacob Bachmeyer

unread,
Apr 5, 2018, 1:17:17 AM4/5/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> mindful of the lessons from this:
> https://www.sigarch.org/simd-instructions-considered-harmful/
>
> i have been looking at the bare minimum to get a
> commercially-acceptable (embedded) GPU [0] and Video Decode Engine [1]
> made from RISC-V and it's quite tricky. software-defined GPUs
> typically do not do so well: see Intel Larrabee for an example, which
> Nyuzi replicated (the difference being that Jeff Bush published
> exactly *where* Nyuzi's resources disappeared: 40% on rasterisation
> alone).
>
> so i looked at the Vector Extension, and was struck by how
> comprehensive it is. its deployment in its full capacity would result
> in some of the most powerful supercomputers in the world. that it is
> based on lessons from Cray make the proposal an amazing piece of
> inspirational work. polymorphic tagging to keep the ISA clean and
> minimalist. variable-length vectors so that (just as shown in the
> article above) there's no ridiculous "cleanup" or "edge cases" as
> there are in SIMD.
>
> ... so the question is: why is there only one known implementation
> (hwacha), its design source code not available for further
> enhancement, study, analysis and ultimately wide-spread deployment?
> more to the point: why has *nobody else* created an interoperable
> implementation?

The main reason that there is no RVV implementation is that RVV is not
finalized yet. RVXhwacha is actually subtly different: Hwacha is a
*vector-fetch* architecture, where the vector unit has its own
instruction fetch/decode and is its own specialized hart. RVV will use
the scalar unit's instruction fetch and decode resources.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Apr 5, 2018, 10:41:39 AM4/5/18
to Jacob Bachmeyer, RISC-V ISA Dev, Richard Herveille, Clifford Wolf
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Thu, Apr 5, 2018 at 6:17 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The main reason that there is no RVV implementation is that RVV is not
> finalized yet. RVXhwacha is actually subtly different: Hwacha is a
> *vector-fetch* architecture, where the vector unit has its own instruction
> fetch/decode and is its own specialized hart. RVV will use the scalar
> unit's instruction fetch and decode resources.

indeed. i genuinely mean it when i say that i am deeply impressed by
how powerful and comprehensive RVV is. i did notice that 17.12
V2.3-Draft (vector instruction formats) effectively means that RVV is
an entirely independent processor / ALU, with nothing really in common
with the rest of RISC-V's instruction set, at all.

and that's really my point: that if there's nothing in common in the
ALU with *existing* RISC-V ALUs, that makes for an awful lot of work
(and money, and risk) for anyone wishing to contribute and help
accelerate RISC-V vector adoption.

hence why i proposed a much simpler idea that borrows *directly* from
section 17.4 (Representation Encoding) and section 17.5 (Element
Bitwidth), and heck, if adopted and developed *before* RV-V, could
potentially be developed with RV-V in mind such that RV-V could
*extend* the (simpler) proposal's Rep-Encoding and El-Bwid to add the
extra data types (GF, complex numbers, etc.) i.e. make sure there's
plenty of "reserved words" in the (simpler) proposal.

basically the simpler proposal would utilise not just the scalar
unit's instruction fetch and decode resources but would utilise and
extend *all* RV-* register-based extensions including base ones, thus
saving implementors huge amounts of money, time and resources and
allowing RV-V to be added incrementally with much less risk and
development effort.

how? because if the simpler proposal is already in place,
potentially many of RV-V's more "basic" instructions would be
redundant (covered by other Extensions including Base), thus
potentially reducing the size of the RV-V instruction space.

also, with this simpler proposal, as richard pointed out, having an
element bitwidth instruction where the vector length is set to "1"
solves the issue of how to add 8-bit and 16-bit instructions to
standard RV32/64/128 without polluting or altering [frozen]
pre-existing Extensions.

l.

lkcl

unread,
Apr 6, 2018, 2:00:14 PM4/6/18
to RISC-V ISA Dev, chuanhua.chang, Richard Herveille, Jacob Bachmeyer
So the cross-discussion and input from different sources (P-Ext,
B-Ext) was extremely useful but getting a little hairy, so I began an
online (open editable, open contributable, feel free to do exactly
that) document here:

http://libre-riscv.org/simple_v_extension/

I have included cross-references at the bottom to various relevant
discussions so that even I, who loves considering things from a
holistic perspective, don't lose track and instead go completely round
the bend.

i've begun with a rationale (always start with "why?"), then moved on
to analyse four out of five of the main strategic beneficial aspects
of both the P and V extensions, to ascertain whether and why each of
those aspects should be included:

* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
* Implicit vs fixed instruction bit-width (integral to instruction or not)
* Implicit vs explicit type-conversion (compounded on bit-width)
* Implicit vs explicit inner loops.
* Masks / tagging (selecting/preventing certain indexed elements from execution)

Masks / tagging is something that needs a bit more thought: the others
are really quite clear i.e. in the context of what i understand RISC-V
to be about, there's clear and obvious winners (and one clear and
obvious loser, sadly much as i love it: polymorphism). Masks /
tagging is *not* so clear [to me].

Again, still on the "why", outlined towards the end of this post [1] I
make a case for splitting out the best of the parallelistic aspects of
V and P extension proposals, for *general-purpose* use across *all*
RISC-V Extensions. This in turn would make both V and P much cleaner,
as well as reduce the number of opcodes that each requires (dealing
with the instruction duplication issue present in both V and P as they
stand), and many other benefits besides.

I've not yet gone into depth of adding into the online editable
document the full rationale, which is still at the moment embedded in
this thread: I was at the time just going "hmmm...."

Also I've specifically avoided going into details of what the encoding
of the instructions should actually *be*, as that would not make sense
yet without some in-depth analysis and consensus of what the
instructions should actually do. Not least here, if those
instructions were to be designed to be *extended* by for example V (to
add Matricies or N-way Hypercube support for example), that's the
point at which I am both "out of scope" and "out of my depth".

thoughts and constructive feedback welcomed.

l.

[1] https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IkLkQn3HvXQ/SEMyC9IlAgAJ

lk...@lkcl.net

unread,
Apr 8, 2018, 11:00:05 AM4/8/18
to RISC-V ISA Dev, chuanhu...@gmail.com, richard....@roalogic.com, jcb6...@gmail.com, wate...@eecs.berkeley.edu
With thanks to input from Jacob, Richard, C. Cesar and Andrew, this proposal has got to "crunch-point" phase, with enough fledgeling understanding of what CSRs would be needed, in order to assess complexity and latency.  The relevant section is here:


It may take some time, assistance greatly appreciated in evaluating this section (at the thread below), particularly from a hardware and implementation overview perspective, as it could potentially require an additional pipeline phase.  With apologies in advance to Jacob as I have not yet worked out how to do explicit "msgid" links yet, the online (google web-interface) link is here:


lk...@lkcl.net

unread,
Apr 8, 2018, 2:56:18 PM4/8/18
to RISC-V ISA Dev, chuanhu...@gmail.com, richard....@roalogic.com, jcb6...@gmail.com, wate...@eecs.berkeley.edu
Oh.  I just realised a really big "plus" of Simple-V [*] proposing to re-use standard RV opcodes for 85% of V and about 10% of P: it becomes possible to use some of the Compressed instructions.

l.

[*] don't you just love it when you use the word "Simple" and things
     turn out to be nothing of the sort? :)
     $ wc simple_v_extension.mdwn
     879  5994 38239 simple_v_extension.mdwn

lk...@lkcl.net

unread,
Apr 16, 2018, 11:32:06 PM4/16/18
to RISC-V ISA Dev
On the basis of comments (Michael, Jacob, Andrew, others: many many thanks) I have been slowly crystallising on a plan that would, crucially, allow RVV to be topologically morphed to fit on top of Simple-V (and cover SIMD as well).  This involves keeping absolutely every single feature of RVV as it stands, and abstracting out the parallelism to an "indirect" model which marks registers as being "scalar or vector", and hides predication as well.

* Instructions (present *and future) including Compressed ones become implicitly parallel
* SIMD ISA proliferation is greatly reduced
* Many other side-by-side benefits outlined here http://libre-riscv.org/simple_v_extension/#parallelism_comparisons

A key insight is that, whilst absolutely superb and without which this entire Abstracted-Parallelism proposal would simply not be possible to even begin, RVV's benefits and power are extremely likely to be ignored by SIMD advocates, resulting in RISC-V "by default" falling into the usual massively seductive proliferation of instructions seen in the majority of processors and DSPs today.

I've therefore worked out:

* a scheme to indirectly overload (retro-fit) the standard branch operands as "predicated equivalents" - including the Compressed variants (making room in the process for *four* branch opcodes)
* worked out how to fit RVV / Hwacha's LD/LD.S/LD.X and ST/ST.S/ST.X into the standard LOAD/STORE instruction space (including indirectly fitting predication), and
* worked out a way for SIMD to be vectorised that allows implementors the freedom to choose whether to actually split the register file into small bitwidth chunks (with associated ALUs of the same width) or whether to pass unaltered (unsplit) register files down to ALUs that then perform (predicated) SIMD duplicated-ALU parallel operations, or anything in between, whilst *also keeping all the benefits of Vectors*.

Many of the notes taken over the past few weeks have been moved to separate pages (one for RVV, another for RVP); much of the rest has been moved to an "Appendix".  It's slowly morphing into a useful document, albeit primarily one that is, out of respect for the extremely comprehensive technical capabilities of the people on this list, rather over-zealous in the "Comparative Analysis" department.

Feedback and insights always appreciated.

Jacob Bachmeyer

unread,
Apr 17, 2018, 11:58:59 PM4/17/18
to lk...@lkcl.net, RISC-V ISA Dev
lk...@lkcl.net wrote:
> On the basis of comments (Michael, Jacob, Andrew, others: many many
> thanks) I have been slowly crystallising on a plan that would,
> crucially, allow RVV to be topologically morphed to fit on top of
> Simple-V (and cover SIMD as well). This involves keeping absolutely
> every single feature of RVV as it stands, and abstracting out the
> parallelism to an "indirect" model which marks registers as being
> "scalar or vector", and hides predication as well.
>
> * Instructions (present *and future) including Compressed ones become
> implicitly parallel
> * SIMD ISA proliferation is greatly reduced
> * Many other side-by-side benefits outlined here
> http://libre-riscv.org/simple_v_extension/#parallelism_comparisons
>
> A key insight is that, whilst absolutely superb and without which this
> entire Abstracted-Parallelism proposal would simply not be possible to
> even begin, RVV's benefits and power are extremely likely to be
> ignored by SIMD advocates, resulting in RISC-V "by default" falling
> into the usual massively seductive proliferation of instructions seen
> in the majority of processors and DSPs today.

The main problem (and reason for RVP to exist) is that RVV leads to a
rather large vector unit, while RVP provides a better fit to embedded
DSPs and similar small systems. The major concern that I have with
Simple-V is the added complexity to the scalar unit and the very short
maximum vector length. (Even the alternate RVP that I have proposed
allows up to XLEN lanes, but Simple-V is limited to less than 16
elements and uses the entire register file for a single operation at
that. RVV has no architectural limit on vector length.)

> [...]
>
> Feedback and insights always appreciated.
>
> http://libre-riscv.org/simple_v_extension/

I have been remiss and will expand the alternate RVP page on that wiki.


-- Jacob

lk...@lkcl.net

unread,
Apr 18, 2018, 12:49:30 AM4/18/18
to RISC-V ISA Dev, lk...@lkcl.net, jcb6...@gmail.com


On Wednesday, April 18, 2018 at 4:58:59 AM UTC+1, Jacob Bachmeyer wrote:

The main problem (and reason for RVP to exist) is that RVV leads to a
rather large vector unit, while RVP provides a better fit to embedded
DSPs and similar small systems.  The major concern that I have with
Simple-V is the added complexity to the scalar unit and the very short
maximum vector length.

It's not _that_ short!  :)  (and Hwhacha's MAXVECTORLEN was actually = 4).  16 64-bit floating-point registers when used with SIMD bitwidth of 32 means *thirty two* separate floating-point operations (all potentially simultaneously if an implementor actually laid out 32 parallel FP ALUs internally).

When you set a SIMD bitwidth of 16 that becomes a total SIMD width of *sixty four* elements!  A total of 1024-bit SIMD!  For RV128 that's doubled again, which is just... mad numbers.  2048-bit SIMD operations.

Bypassing the vectorisation is fairly trivial to detect and can be done with single bits worth of internal state, created from the CSRs, meaning that scalar execution doesn't have to be adversely affected.  Put another way: is it _really_ so burdensome to have a loop going (in effect) "for i = 0; i < 1; i++)" ? :)

 (Even the alternate RVP that I have proposed
allows up to XLEN lanes, but Simple-V is limited to less than 16
elements and uses the entire register file for a single operation at
that.  RVV has no architectural limit on vector length.)

There's some things which put this into context:

* Simple-V is designed even for the "base" variant of RVV to "take over" i.e. to have totally separate vector register files (and no... or a *different*... architectural limit on vector length)

* There is actually an architectural limit on vector length in RVV: this however sets no limit on the outer loop *using* VSETL (and likewise in Simple-V).

* Even in Hwacha the design only has 4 actual internal parallel lanes (an architectural limit on the value that VSETL can ever be set to being no greater than 4).

So even in Simple-V I think if you went up to the maximum of 16 elements as *actual* internal hardware parallelism you would end up with a screamingly-fast machine.

Secondly, the main context in which I am considering this (at all) is as the basis for an *embedded* 3D GPU (not an AMD Radeon 850 or an NVidia 1070).  4-wide internal parallelism (per core that is) would be perfectly adequate (and is what the Broadcom VideoCore-IV has).

Summary: there's a way for RVV to take over and get much better performance (with associated costs, reduced greatly by the half-way-house that Simple-V represents), and Simple-V is designed to take into account intermediary targets (embedded, DSP) anyway.

l.

Jacob Bachmeyer

unread,
Apr 18, 2018, 1:13:19 AM4/18/18
to lk...@lkcl.net, RISC-V ISA Dev
lk...@lkcl.net wrote:
> On Wednesday, April 18, 2018 at 4:58:59 AM UTC+1, Jacob Bachmeyer wrote:
>
>
> The main problem (and reason for RVP to exist) is that RVV leads to a
> rather large vector unit, while RVP provides a better fit to embedded
> DSPs and similar small systems. The major concern that I have with
> Simple-V is the added complexity to the scalar unit and the very
> short
> maximum vector length.
>
>
> It's not _that_ short! :) (and Hwhacha's MAXVECTORLEN was actually =
> 4). 16 64-bit floating-point registers when used with SIMD bitwidth
> of 32 means *thirty two* separate floating-point operations (all
> potentially simultaneously if an implementor actually laid out 32
> parallel FP ALUs internally).

This is another line I would prefer to draw -- RVP should only support
integer and fixed-point. For floating-point, RVV should be required. I
am trying to ensure that RVV will actually be used and available, rather
than ignored. I want a RISC-V PC with RVV (and a bunch of other stuff,
off-topic in this thread), dammit! :-)

> When you set a SIMD bitwidth of 16 that becomes a total SIMD width of
> *sixty four* elements! A total of 1024-bit SIMD! For RV128 that's
> doubled again, which is just... mad numbers. 2048-bit SIMD operations.

Yes, but you can only have *two* effective vector registers at those
lengths, and that can only work in the integer register file if you can
tolerate one of your vector elements being the hardwired zero from x0.
This means that you lose most of the benefit of RISC-V's three operand
architecture.

> Bypassing the vectorisation is fairly trivial to detect and can be
> done with single bits worth of internal state, created from the CSRs,
> meaning that scalar execution doesn't have to be adversely affected.
> Put another way: is it _really_ so burdensome to have a loop going (in
> effect) "for i = 0; i < 1; i++)" ? :)

No, that is a perfectly valid implementation of RVP lanes. The big
sticking point I have with the Simple-V concept is the constraints it
imposes on register allocation.

> (Even the alternate RVP that I have proposed
> allows up to XLEN lanes, but Simple-V is limited to less than 16
> elements and uses the entire register file for a single operation at
> that. RVV has no architectural limit on vector length.)
>
>
> There's some things which put this into context:
>
> * Simple-V is designed even for the "base" variant of RVV to "take
> over" i.e. to have totally separate vector register files (and no...
> or a *different*... architectural limit on vector length)

I will need to read it again.

> * There is actually an architectural limit on vector length in RVV:
> this however sets no limit on the outer loop *using* VSETL (and
> likewise in Simple-V).

Ack! You are right -- vector length is at most 2^XLEN. :-)

> * Even in Hwacha the design only has 4 actual internal parallel lanes
> (an architectural limit on the value that VSETL can ever be set to
> being no greater than 4).

The Hwacha unit fabricated had only 4 lanes, but this does *not* limit
the vector length in Hwacha at all. Hwacha is a mixed-parallel-serial
vector engine -- more than four elements in a vector simply means that
the vector unit requires multiple cycles to operate on that vector.

> So even in Simple-V I think if you went up to the maximum of 16
> elements as *actual* internal hardware parallelism you would end up
> with a screamingly-fast machine.
>
> Secondly, the main context in which I am considering this (at all) is
> as the basis for an *embedded* 3D GPU (not an AMD Radeon 850 or an
> NVidia 1070). 4-wide internal parallelism (per core that is) would be
> perfectly adequate (and is what the Broadcom VideoCore-IV has).

This would be RVP lanes with four writable bits in the plane CSR and RVP
partitioning to divide those elements into pixel channels, either RGB565
(2/group in RV32) or RGBA8888 (1/group in RV32). RGBZ8888 is similarly
available if depth-buffer values are integers.

> Summary: there's a way for RVV to take over and get much better
> performance (with associated costs, reduced greatly by the
> half-way-house that Simple-V represents), and Simple-V is designed to
> take into account intermediary targets (embedded, DSP) anyway.

The big problem that I keep coming back to with Simple-V is that it
seems to more-or-less need most of an out-of-order superscalar processor
to achieve a parallel implementation. Using contiguous groups out of
the register file is *really* complex if the simple serial
implementation is not good enough, even though the simple serial
implementation is *really* simple.


-- Jacob

lk...@lkcl.net

unread,
Apr 20, 2018, 10:09:55 AM4/20/18
to RISC-V ISA Dev, lk...@lkcl.net, jcb6...@gmail.com


On Wednesday, April 18, 2018 at 6:13:19 AM UTC+1, Jacob Bachmeyer wrote:
 
> It's not _that_ short!  :)  (and Hwhacha's MAXVECTORLEN was actually =
> 4).  16 64-bit floating-point registers when used with SIMD bitwidth
> of 32 means *thirty two* separate floating-point operations (all
> potentially simultaneously if an implementor actually laid out 32
> parallel FP ALUs internally).

This is another line I would prefer to draw -- RVP should only support
integer and fixed-point.  For floating-point, RVV should be required.  I
am trying to ensure that RVV will actually be used and available, rather
than ignored.

 each of us in separate (overlapping) ways, I think are working towards that goal, in recognition that RVV is, whilst extremely powerful, also a bit too large a step.
 
 I want a RISC-V PC with RVV (and a bunch of other stuff,
off-topic in this thread), dammit!  :-)

 :)
 

> When you set a SIMD bitwidth of 16 that becomes a total SIMD width of
> *sixty four* elements!  A total of 1024-bit SIMD!  For RV128 that's
> doubled again, which is just... mad numbers.  2048-bit SIMD operations.

Yes, but you can only have *two* effective vector registers at those
lengths,

 Ok I would not realistically recommend using that full length or even anything beyond the actual inherent underlying parallelism.  Using four F.P. registers for example when there were exactly 4 parallel FPUs (mirroring how Broadcom's VideoCore-IV is set up, and also Nyuzi) for example.

and that can only work in the integer register file if you can
tolerate one of your vector elements being the hardwired zero from x0.  
This means that you lose most of the benefit of RISC-V's three operand
architecture.

 Only if you really *really* wanted to use absolutely all of the registers, yes.  There are however plenty of instances where 16 at once would be incredibly useful: LOAD/STORE, or FCVT and so on.  I'm almost tempted to make it 32-wide just so that a full set of 32 registers could be LOAD/STOREd with a single instruction.


> Bypassing the vectorisation is fairly trivial to detect and can be
> done with single bits worth of internal state, created from the CSRs,
> meaning that scalar execution doesn't have to be adversely affected.  
> Put another way: is it _really_ so burdensome to have a loop going (in
> effect) "for i = 0; i < 1; i++)" ? :)

No, that is a perfectly valid implementation of RVP lanes.  The big
sticking point I have with the Simple-V concept is the constraints it
imposes on register allocation.

 that's where I would envision RVV taking over.
 
> * Simple-V is designed even for the "base" variant of RVV to "take
> over" i.e. to have totally separate vector register files (and no...
> or a *different*... architectural limit on vector length)

I will need to read it again.


I'm not sure if I've made more than a cursory mention (and a stub somewhere in the Appendix), it's been a really hectic couple of weeks.  The main thing to remember (as much a reminder to myself) is, there's a difference between:

 * The CSR which sets vector length per register (up to 16)
 * The VSETL instruction's "vector length"
 * The number of (internal) parallel ALUs.

At first glance we *might* think that having VSETL (for whatever reason) be capable of going up to only say 4 yet the number of internal parallel ALUs would be pointless, but if they're complex functions that take longer than a single CPU cycle to complete then suddenly it makes sense.

It's complicated!!

> * There is actually an architectural limit on vector length in RVV:
> this however sets no limit on the outer loop *using* VSETL (and
> likewise in Simple-V).

Ack!  You are right -- vector length is at most 2^XLEN.  :-)

 :)  and, interestingly, just like in RVV, nothing to do with the level of internal parallelism.
 

> * Even in Hwacha the design only has 4 actual internal parallel lanes
> (an architectural limit on the value that VSETL can ever be set to
> being no greater than 4).

The Hwacha unit fabricated had only 4 lanes, but this does *not* limit
the vector length in Hwacha at all.  

That's right.  and not in Simple-V either.  For exactly the same reason(s).  I am quite literally taking concept-for-concept exactly RVV (which is Hwacha-like) and transplanting those concepts into the standard register files.
 
Hwacha is a mixed-parallel-serial
vector engine -- more than four elements in a vector simply means that
the vector unit requires multiple cycles to operate on that vector.


Really?  I was under the impression that it actually set MAXVECTORLEN=4. Or, the Hwacha equivalent of Section 17.11 (0.4-Draft V2.3-Draft).

If it _doesn't_ then that would actually, interestingly, mean that it's a hybrid sequential-loop plus parallel hardware system, just like in the Broadcom VideoCore-IV.  Broadcom call the hybrid concept "Virtual Parallelism".

And that would be absolutely fine under Simple-V, too.
 
> So even in Simple-V I think if you went up to the maximum of 16
> elements as *actual* internal hardware parallelism you would end up
> with a screamingly-fast machine.

 *click* I keep making the same mistake, don't I!  I'm also confusing the 3 different lengths!
 
>
> Secondly, the main context in which I am considering this (at all) is
> as the basis for an *embedded* 3D GPU (not an AMD Radeon 850 or an
> NVidia 1070).  4-wide internal parallelism (per core that is) would be
> perfectly adequate (and is what the Broadcom VideoCore-IV has).

This would be RVP lanes with four writable bits in the plane CSR and RVP
partitioning to divide those elements into pixel channels, either RGB565
(2/group in RV32) or RGBA8888 (1/group in RV32).  RGBZ8888 is similarly
available if depth-buffer values are integers.

Interestingly, Z-buffers I believe can be implemented with predication in very few instructions.  A predicated compare, and two register MVs (conditionally carried out based on the compare).  One of those predicated MV operations would be the Z value, the other would be the pixel.

I love predicated vectors!
 

> Summary: there's a way for RVV to take over and get much better
> performance (with associated costs, reduced greatly by the
> half-way-house that Simple-V represents), and Simple-V is designed to
> take into account intermediary targets (embedded, DSP) anyway.

The big problem that I keep coming back to with Simple-V is that it
seems to more-or-less need most of an out-of-order superscalar processor
to achieve a parallel implementation.  Using contiguous groups out of
the register file is *really* complex if the simple serial
implementation is not good enough,

 Y'know I thought about that, and want to raise a question with Andrew at an appropriate moment: if an implementor does have an OoO implementation with register-renaming, Simple-V could hypothetically quite literally go over the "simple" loop, instead of executing instructions just stuffing them into the OoO FIFO and leaving it at that.  The only fly in the ointment at that point is: Exceptions (on LOAD/STORE in particular.  F.P. ones can be turned into an array, and the actual exception thrown only after all elements in a given VSETL length have all been executed).
 
even though the simple serial
implementation is *really* simple.

 yehyeh it is.

 I'm going to have to give this some thought: it's a good point.  Remember that I'm endeavouring to cover the DSP (SIMD) case as well, which is (as best I understand it) uuusually a single-issue in-order architecture with relatively short pipelines, where typically every operation completes in the same amount of time.  Really *really* easy to design.   From what I'm seeing of RISC-V, even without RVV or *any* of these proposals, it's not really a good fit to the traditional (single-issue, in-order) SIMD/DSP model.  I'd be really happy to be informed otherwise.

 Anyway, Jacob: one for you to consider (on a separate thread? or the RVP one?) - how would RVP deal with LOAD/STORE exceptions, and Floating-point ones too.  Also I'd be interested to help evaluate how RVP fits with different implementation paradigms (OoO, single-issue, in-order and so on).

l.

Jacob Bachmeyer

unread,
Apr 21, 2018, 12:58:32 AM4/21/18
to lk...@lkcl.net, RISC-V ISA Dev
lk...@lkcl.net wrote:
> On Wednesday, April 18, 2018 at 6:13:19 AM UTC+1, Jacob Bachmeyer wrote:
>
> [...]
>
> > Bypassing the vectorisation is fairly trivial to detect and can be
> > done with single bits worth of internal state, created from the
> CSRs,
> > meaning that scalar execution doesn't have to be adversely
> affected.
> > Put another way: is it _really_ so burdensome to have a loop
> going (in
> > effect) "for i = 0; i < 1; i++)" ? :)
>
> No, that is a perfectly valid implementation of RVP lanes. The big
> sticking point I have with the Simple-V concept is the constraints it
> imposes on register allocation.
>
>
> that's where I would envision RVV taking over.

The problem is how quickly RVV would need to take over from Simple-V:
the standard RISC-V POSIX ABI places the frame pointer at x8, so
registers x9 - x31 are readily available. If we stretch (and spill the
frame and thread pointers to the stack) we can extend that to x4 - x31,
but must not use millicode (x5 is the millicode link register) while
doing this. At most, we can have a total of 28 vector *elements* in
Simple-V, although some programs could potentially exploit the fact that
vectors in Simple-V can overlap. If we want 8-element vectors, we can
only have 3 that do not overlap.

> [...]
>
> > * There is actually an architectural limit on vector length in RVV:
> > this however sets no limit on the outer loop *using* VSETL (and
> > likewise in Simple-V).
>
> Ack! You are right -- vector length is at most 2^XLEN. :-)
>
>
> :) and, interestingly, just like in RVV, nothing to do with the
> level of internal parallelism.

Also just like RVP lanes.

> > * Even in Hwacha the design only has 4 actual internal parallel
> lanes
> > (an architectural limit on the value that VSETL can ever be set to
> > being no greater than 4).
>
> The Hwacha unit fabricated had only 4 lanes, but this does *not*
> limit
> the vector length in Hwacha at all.
>
>
> That's right. and not in Simple-V either. For exactly the same
> reason(s). I am quite literally taking concept-for-concept exactly
> RVV (which is Hwacha-like) and transplanting those concepts into the
> standard register files.

The main problem with Simple-V is the harsh limit on the total number of
vector elements. Hwacha and RVV allow up to 2^XLEN vector length (in
practice, no limit -- hardware is highly unlikely to have 16GiB of
vector register memory). The RVP lanes that I propose allow up to XLEN
lanes. Simple-V allows up to 28 vector *elements*. Each RVP lane has
more data registers than that. (OK, so in practice 28 will be the
number of usable registers in an RVP lane implementation.)

> Hwacha is a mixed-parallel-serial
> vector engine -- more than four elements in a vector simply means
> that
> the vector unit requires multiple cycles to operate on that vector.
>
>
> Really? I was under the impression that it actually set
> MAXVECTORLEN=4. Or, the Hwacha equivalent of Section 17.11 (0.4-Draft
> V2.3-Draft).
>
> If it _doesn't_ then that would actually, interestingly, mean that
> it's a hybrid sequential-loop plus parallel hardware system, just like
> in the Broadcom VideoCore-IV. Broadcom call the hybrid concept
> "Virtual Parallelism".

Yes, that seems to be the intent. The slides from an earlier RISC-V
workshop mentioned a systolic vector unit topology as well.

> And that would be absolutely fine under Simple-V, too.

Also an acceptable implementation for RVP lanes.

> [...]
>
> > Secondly, the main context in which I am considering this (at
> all) is
> > as the basis for an *embedded* 3D GPU (not an AMD Radeon 850 or an
> > NVidia 1070). 4-wide internal parallelism (per core that is)
> would be
> > perfectly adequate (and is what the Broadcom VideoCore-IV has).
>
> This would be RVP lanes with four writable bits in the plane CSR
> and RVP
> partitioning to divide those elements into pixel channels, either
> RGB565
> (2/group in RV32) or RGBA8888 (1/group in RV32). RGBZ8888 is
> similarly
> available if depth-buffer values are integers.
>
>
> Interestingly, Z-buffers I believe can be implemented with predication
> in very few instructions. A predicated compare, and two register MVs
> (conditionally carried out based on the compare). One of those
> predicated MV operations would be the Z value, the other would be the
> pixel.
>
> I love predicated vectors!

Even without predication, Z-buffers are easily enough implemented with
RVP partitioned elements: a packed-data compare, mask the result to
extract the Z channels, then a packed-subtract (using different
boundaries that treat entire pixels as elements) to form an inverse mask
by subtracting 1, AND original with inverse mask, invert mask, AND new
data with mask, OR masked original with masked new. 9 instructions
total, including the CSR writes to change the partitioning boundaries.
Total data needed is 2 mutable inputs and 1 temporary for the
"predicate" mask. Overall, an additional register is needed for
exchanging the partitioning boundaries and another to store the constant
-1, since PADDI will not exist. We could use x9 - x31 (23 registers)
for the partition boundaries, constants, and working data, allowing
seven iterations of this loop to be interleaved and the overhead of the
CSR swap to be amortized over seven groups. In an implementation with
RVP lanes, each group is also replicated across the N lanes, so the
processor could have 7*N groups "in the air" at once.

On RV32, using RGBZ8888 data, this can be even easier: since there is
only one pixel per group, we can use only a single set of partition
boundaries and the RVI SUBI instruction. This eliminates the "global"
registers and two CSR swap instructions, and might allow us to have an
*eighth* group up in the air, for up to 8*32 = 256 pixels processed for
each iteration of the loop on RV32P with 32 lanes.

> > Summary: there's a way for RVV to take over and get much better
> > performance (with associated costs, reduced greatly by the
> > half-way-house that Simple-V represents), and Simple-V is
> designed to
> > take into account intermediary targets (embedded, DSP) anyway.
>
> The big problem that I keep coming back to with Simple-V is that it
> seems to more-or-less need most of an out-of-order superscalar
> processor
> to achieve a parallel implementation. Using contiguous groups out of
> the register file is *really* complex if the simple serial
> implementation is not good enough,
>
>
> Y'know I thought about that, and want to raise a question with Andrew
> at an appropriate moment: if an implementor does have an OoO
> implementation with register-renaming, Simple-V could hypothetically
> quite literally go over the "simple" loop, instead of executing
> instructions just stuffing them into the OoO FIFO and leaving it at
> that. The only fly in the ointment at that point is: Exceptions (on
> LOAD/STORE in particular. F.P. ones can be turned into an array, and
> the actual exception thrown only after all elements in a given VSETL
> length have all been executed).

Instructions must be atomic; they either complete entirely or fault,
with no effect on the register file.

> even though the simple serial
> implementation is *really* simple.
>
>
> yehyeh it is.
>
> I'm going to have to give this some thought: it's a good point.
> Remember that I'm endeavouring to cover the DSP (SIMD) case as well,
> which is (as best I understand it) uuusually a single-issue in-order
> architecture with relatively short pipelines, where typically every
> operation completes in the same amount of time. Really *really* easy
> to design. From what I'm seeing of RISC-V, even without RVV or *any*
> of these proposals, it's not really a good fit to the traditional
> (single-issue, in-order) SIMD/DSP model. I'd be really happy to be
> informed otherwise.

My RVP proposal is specifically intended for DSP applications, including
a programming model that should be familiar to DSP programmers.

> Anyway, Jacob: one for you to consider (on a separate thread? or the
> RVP one?) - how would RVP deal with LOAD/STORE exceptions, and
> Floating-point ones too. Also I'd be interested to help evaluate how
> RVP fits with different implementation paradigms (OoO, single-issue,
> in-order and so on).

Group LOAD/STORE must be atomic -- essentially, all translations are
resolved and permissions verified *before* any actual memory access
occurs. RVP intentionally does not support floating-point; partitioning
an FP unit in the way RVP partitions an integer unit is a can of worms,
DSPs often do not support floating-point anyway, fixed-point is integer
arithmetic for all the hardware cares, and RVV should be needed at some
point. RVFP could have an additional pflane CSR for FP lanes, which
would *not* be conflated with integer lanes, and would *not* have
gather-LOAD/scatter-STORE capability, since only an integer lane can
address memory.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Apr 21, 2018, 1:51:19 AM4/21/18
to Jacob Bachmeyer, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 5:58 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The problem is how quickly RVV would need to take over from Simple-V:

Oh: if you're doing a supercomputer i'd say *immediately*. As in,
RVV (on top of Simple-V) would not touch the standard register files
*at all*.

With the benefit that adding for example special standards-approved
*or custom* instructions would end up *not needing to be added to
RVV*.

That's one of the key strategic down-sides of RVV. If you want to
have a special Vectorised operation (such as RECIP-SQRT) you're
absolutely hosed: stuck between having to go with a non-standard
CUSTOM version of RVV.. with associated duplication of the ENTIRE
VECTOR OPCODES INTO CUSTOM SPACE..... or you have to consider writing
a special parallel custom extension whose sole exclusive job is:
parallel RECIP-SQRT. or you have to have a custom Vector opcode.
where now you can't use that for scalar operations so have to
DUPLICATE it into the scalar opcode space as well.

RVV, for all its value, places people into the position of making
almost impossible choices.

RECIP-SQRT btw is a really important part of 3D (it's for
normalisation of 3D vectors - 1/(x^2 + y^2 + z^2)).


> The main problem with Simple-V is the harsh limit on the total number of
> vector elements. Hwacha and RVV allow up to 2^XLEN vector length (in
> practice, no limit --

ok, so this is where even I'm having difficulty keeping the 3
"lengths" clear and separate. When I said "concept-for-concept" I
meant to make it clear that *even allowing up to 2^XLEN vector length*
is also included in Simple-V*. Now, whether implementors choose to
actually do that, instead setting it at a much lower limit, that's
entirely up to them.

> Even without predication, Z-buffers are easily enough implemented with RVP
> partitioned elements: a packed-data compare, mask the result to extract the
> Z channels, then a packed-subtract (using different boundaries that treat
> entire pixels as elements) to form an inverse mask by subtracting 1, AND
> original with inverse mask, invert mask, AND new data with mask, OR masked
> original with masked new. 9 instructions total, including the CSR writes to
> change the partitioning boundaries. Total data needed is 2 mutable inputs

Ok, so 3D is a... harsh mistress. Let me find the link to Jeff's
Nyuzi paper... ok https://github.com/jbush001/NyuziProcessor/wiki, you
want the 2016 paper.

Basically he explains that the metric to aim for is "instructions per
pixel" (or if we want to get technical, "instructions per pixel per
watt"). And in the *FAILED* (for a given value of "fail") Nyuzi
implementation, that metric came out at somewhere around 50
instructions per pixel.

I spent some time with Jeff, evaluating what that would mean when
comparing area-for-area against MALI400. Even with 4 Nyuzi Cores
giving a similar area / power budget to MALI400, 50 instructions per
pixel was FOUR times too low.

four!!

Does that help put 9 instructions just for the Z-buffer into context?
:) I do appreciate it's 9 divided by the "level of parallelism".


> Instructions must be atomic; they either complete entirely or fault, with no
> effect on the register file.

Vector LOAD/STORE is... complicated, as Andrew kindly explained.
It's just not that cut-and-dried. Section 4.6 here:
http://people.eecs.berkeley.edu/~krste/thesis.pdf

and Jeff Bush's documented exploration into the same space:
https://jbush001.github.io/2015/11/03/lost-in-translation.html

It's that complicated that I'm giving serious consideration to making
a recommendation that Simple-V implementors be permitted to implement
Vector LOAD/STORE in software as an Exception/Trap. The pseudo-code's
pretty straightforward.

l.

Jacob Bachmeyer

unread,
Apr 21, 2018, 2:34:36 AM4/21/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> On Sat, Apr 21, 2018 at 5:58 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> The problem is how quickly RVV would need to take over from Simple-V:
>>
>
> Oh: if you're doing a supercomputer i'd say *immediately*. As in,
> RVV (on top of Simple-V) would not touch the standard register files
> *at all*.
>
> With the benefit that adding for example special standards-approved
> *or custom* instructions would end up *not needing to be added to
> RVV*.
>
> That's one of the key strategic down-sides of RVV. If you want to
> have a special Vectorised operation (such as RECIP-SQRT) you're
> absolutely hosed: stuck between having to go with a non-standard
> CUSTOM version of RVV.. with associated duplication of the ENTIRE
> VECTOR OPCODES INTO CUSTOM SPACE..... or you have to consider writing
> a special parallel custom extension whose sole exclusive job is:
> parallel RECIP-SQRT. or you have to have a custom Vector opcode.
> where now you can't use that for scalar operations so have to
> DUPLICATE it into the scalar opcode space as well.
>
> RVV, for all its value, places people into the position of making
> almost impossible choices.
>

Why not a custom extension that just adds VRSQRT to RVV? Encode VRSQRT
in the custom opcode space, while keeping RVV in its standard encoding.

> RECIP-SQRT btw is a really important part of 3D (it's for
> normalisation of 3D vectors - 1/(x^2 + y^2 + z^2)).
>
>
>
>> The main problem with Simple-V is the harsh limit on the total number of
>> vector elements. Hwacha and RVV allow up to 2^XLEN vector length (in
>> practice, no limit --
>>
>
> ok, so this is where even I'm having difficulty keeping the 3
> "lengths" clear and separate. When I said "concept-for-concept" I
> meant to make it clear that *even allowing up to 2^XLEN vector length*
> is also included in Simple-V*. Now, whether implementors choose to
> actually do that, instead setting it at a much lower limit, that's
> entirely up to them.
>

How? Simple-V uses groups from the integer register file to store
vectors. The outer software loop can of course iterate for longer
vectors (arguably even beyond 2^XLEN if available disk storage exceeds
the address space), but Hwacha and RVV (in theory) permit up to 2^XLEN
elements to be processed with a *single* instruction using hardware
iteration.

>> Even without predication, Z-buffers are easily enough implemented with RVP
>> partitioned elements: a packed-data compare, mask the result to extract the
>> Z channels, then a packed-subtract (using different boundaries that treat
>> entire pixels as elements) to form an inverse mask by subtracting 1, AND
>> original with inverse mask, invert mask, AND new data with mask, OR masked
>> original with masked new. 9 instructions total, including the CSR writes to
>> change the partitioning boundaries. Total data needed is 2 mutable inputs
>>
>
> Ok, so 3D is a... harsh mistress. Let me find the link to Jeff's
> Nyuzi paper... ok https://github.com/jbush001/NyuziProcessor/wiki, you
> want the 2016 paper.
>
> Basically he explains that the metric to aim for is "instructions per
> pixel" (or if we want to get technical, "instructions per pixel per
> watt"). And in the *FAILED* (for a given value of "fail") Nyuzi
> implementation, that metric came out at somewhere around 50
> instructions per pixel.
>
> I spent some time with Jeff, evaluating what that would mean when
> comparing area-for-area against MALI400. Even with 4 Nyuzi Cores
> giving a similar area / power budget to MALI400, 50 instructions per
> pixel was FOUR times too low.
>
> four!!
>
> Does that help put 9 instructions just for the Z-buffer into context?
> :) I do appreciate it's 9 divided by the "level of parallelism".
>

There are a few further optimizations possible on RV32P with RGBZ8888,
due to the Z-buffer affecting the entire XLEN-bit element-group. First,
only one partitioning is needed; the mask construction can use the
scalar SUBI. This reduces the instruction count to seven. Second, if
we add a per-lane-predicated STORE that checks a flag in each lane, we
can further reduce this to 2 instructions (packed-compare, mask result),
possibly at the cost of moving the read-modify-write into hardware,
since I expect N*XLEN-bit cachelines. In other words, we may lose what
we gain from predicated STORE in additional hardware cycles required to
*execute* the predicated STORE. (This also ignores the difficulties
with encoding predicated STORE, although bit-gather and writing the
plane CSR is an option. That would make predicated STORE itself
effectively four instructions: bit-gather, swap plane, STORE, swap plane.)

So, at best, 7/32 on RV32P with RGBZ8888 and RVP with both partitioning
and lanes. That is 7/32 out of ... 12.5 instructions per pixel or 1.75%
of the total calculation budget for the Z-buffer on a 32-lane
processor. RV64P is a bit different, since now there can be 2 pixels
per XLEN-bit element-group *and* twice as many lanes, but requiring nine
instructions due to needing two different sets of partition boundaries,
yielding 9/128 out of 12.5 or 0.56% of the total calculation budget for
the Z-buffer on a 64-lane processor. RV128P extends RV64P, now
permitting 128 lanes and 4 RGBZ8888 pixels per XLEN-bit element group,
with the same main loop as RV64P, yielding 9/512 out of 12.5 or 0.14% of
the total calculation budget. ... Congratulations, I believe we have
re-invented the GPU. :-)

>> Instructions must be atomic; they either complete entirely or fault, with no
>> effect on the register file.
>>
>
> Vector LOAD/STORE is... complicated, as Andrew kindly explained.
> It's just not that cut-and-dried. Section 4.6 here:
> http://people.eecs.berkeley.edu/~krste/thesis.pdf
>
> and Jeff Bush's documented exploration into the same space:
> https://jbush001.github.io/2015/11/03/lost-in-translation.html
>
> It's that complicated that I'm giving serious consideration to making
> a recommendation that Simple-V implementors be permitted to implement
> Vector LOAD/STORE in software as an Exception/Trap. The pseudo-code's
> pretty straightforward.
>

There goes nearly all of your performance. Simple-V has a painfully
limited hardware vector length of 28 elements *total*, which
necessitates frequent access to main memory. If those memory operations
trap... it will be very slow.


-- Jacob

Richard Herveille

unread,
Apr 21, 2018, 2:46:43 AM4/21/18
to jcb6...@gmail.com, lk...@lkcl.net, RISC-V ISA Dev
TLDNR
For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs

Richard


Sent from my iPhone
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5ADAC4F4.5000606%40gmail.com.

Luke Kenneth Casson Leighton

unread,
Apr 21, 2018, 2:56:29 AM4/21/18
to Jacob Bachmeyer, RISC-V ISA Dev
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Apr 21, 2018 at 7:34 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> How? Simple-V uses groups from the integer register file to store vectors.

which... *click* yes gotcha. sorry for being dense.

> The outer software loop can of course iterate for longer vectors (arguably
> even beyond 2^XLEN if available disk storage exceeds the address space), but
> Hwacha and RVV (in theory) permit up to 2^XLEN elements to be processed with
> a *single* instruction using hardware iteration.

... so in this case there would be a vector register file of length
2^XLEN, if i am understanding correctly.

> There are a few further optimizations possible on RV32P with RGBZ8888, due
> to the Z-buffer affecting the entire XLEN-bit element-group.
> [...]
> Congratulations, I believe we have re-invented the GPU. :-)

:) bit long for me, Jacob - I trust your calculations.

>> It's that complicated that I'm giving serious consideration to making
>> a recommendation that Simple-V implementors be permitted to implement
>> Vector LOAD/STORE in software as an Exception/Trap. The pseudo-code's
>> pretty straightforward.

> There goes nearly all of your performance. Simple-V has a painfully limited
> hardware vector length of 28 elements *total*, which necessitates frequent
> access to main memory. If those memory operations trap... it will be very
> slow.

it's a legitimate compliant implementation, that's what really
matters. remember, RVV is *really* heavy-duty. i'd like there to be
a way to get parallelism that is "bare minimum compliant with minimum
effort" that can scale half the way up to what RVV can achieve... then
RVV take over seamlessly up to super-computer performance.

l.

Luke Kenneth Casson Leighton

unread,
Apr 21, 2018, 2:57:25 AM4/21/18
to Richard Herveille, jcb6...@gmail.com, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 7:46 AM, Richard Herveille
<richard....@roalogic.com> wrote:
> TLDNR
> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs

thanks richard will take a look.

Michael Clark

unread,
Apr 21, 2018, 8:31:25 AM4/21/18
to Luke Kenneth Casson Leighton, Jacob Bachmeyer, RISC-V ISA Dev


On 21/04/2018, at 5:50 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:

RECIP-SQRT btw is a really important part of 3D (it's for
normalisation of 3D vectors - 1/(x^2 + y^2 + z^2)).

We need dot product and this is a cross lane operation that results in a scalar [1].

Folk will be able to extend RVV.

Luke Kenneth Casson Leighton

unread,
Apr 21, 2018, 6:57:11 PM4/21/18
to Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 1:31 PM, Michael Clark <michae...@mac.com> wrote:

> We need dot product and this is a cross lane operation that results in a
> scalar [1].

Nice reference, good find, thank you Michael.

> Folk will be able to extend RVV.

and have to duplicate same opcodes in scalar space as well. Simple-V
reduces RVV opcode space by 80% (saving overall opcode space). 98% of
arith/fp opcodes are duplicates of RV*. see
http://libre-riscv.org/simple_v_extension/v_comparative_analysis/#duplication_analysis


> [1] https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf

Luke Kenneth Casson Leighton

unread,
Apr 24, 2018, 7:38:11 AM4/24/18
to Richard Herveille, jcb6...@gmail.com, RISC-V ISA Dev
http://www.ti.com/lit/ug/sprufe8b/sprufe8b.pdf

ooo that's a... deeply impressively comprehensive, powerful (and
messy) architecture that could only come into existence by way of
evolution to meet customer's needs over a looong tiiime.

predication is built-in, VLIW is an inherent part of the design, SIMD
is built-in, Zero-overhead loops are there (not quite as powerful as
ZOLC)... and that's just from getting to page 199 of the *770* page
ISA manual. zowee this is gonna take a while :)

l.
Reply all
Reply to author
Forward
0 new messages