lk...@lkcl.net wrote:
> On Wednesday, April 18, 2018 at 6:13:19 AM UTC+1, Jacob Bachmeyer wrote:
>
> [...]
>
> > Bypassing the vectorisation is fairly trivial to detect and can be
> > done with single bits worth of internal state, created from the
> CSRs,
> > meaning that scalar execution doesn't have to be adversely
> affected.
> > Put another way: is it _really_ so burdensome to have a loop
> going (in
> > effect) "for i = 0; i < 1; i++)" ? :)
>
> No, that is a perfectly valid implementation of RVP lanes. The big
> sticking point I have with the Simple-V concept is the constraints it
> imposes on register allocation.
>
>
> that's where I would envision RVV taking over.
The problem is how quickly RVV would need to take over from Simple-V:
the standard RISC-V POSIX ABI places the frame pointer at x8, so
registers x9 - x31 are readily available. If we stretch (and spill the
frame and thread pointers to the stack) we can extend that to x4 - x31,
but must not use millicode (x5 is the millicode link register) while
doing this. At most, we can have a total of 28 vector *elements* in
Simple-V, although some programs could potentially exploit the fact that
vectors in Simple-V can overlap. If we want 8-element vectors, we can
only have 3 that do not overlap.
> [...]
>
> > * There is actually an architectural limit on vector length in RVV:
> > this however sets no limit on the outer loop *using* VSETL (and
> > likewise in Simple-V).
>
> Ack! You are right -- vector length is at most 2^XLEN. :-)
>
>
> :) and, interestingly, just like in RVV, nothing to do with the
> level of internal parallelism.
Also just like RVP lanes.
> > * Even in Hwacha the design only has 4 actual internal parallel
> lanes
> > (an architectural limit on the value that VSETL can ever be set to
> > being no greater than 4).
>
> The Hwacha unit fabricated had only 4 lanes, but this does *not*
> limit
> the vector length in Hwacha at all.
>
>
> That's right. and not in Simple-V either. For exactly the same
> reason(s). I am quite literally taking concept-for-concept exactly
> RVV (which is Hwacha-like) and transplanting those concepts into the
> standard register files.
The main problem with Simple-V is the harsh limit on the total number of
vector elements. Hwacha and RVV allow up to 2^XLEN vector length (in
practice, no limit -- hardware is highly unlikely to have 16GiB of
vector register memory). The RVP lanes that I propose allow up to XLEN
lanes. Simple-V allows up to 28 vector *elements*. Each RVP lane has
more data registers than that. (OK, so in practice 28 will be the
number of usable registers in an RVP lane implementation.)
> Hwacha is a mixed-parallel-serial
> vector engine -- more than four elements in a vector simply means
> that
> the vector unit requires multiple cycles to operate on that vector.
>
>
> Really? I was under the impression that it actually set
> MAXVECTORLEN=4. Or, the Hwacha equivalent of Section 17.11 (0.4-Draft
> V2.3-Draft).
>
> If it _doesn't_ then that would actually, interestingly, mean that
> it's a hybrid sequential-loop plus parallel hardware system, just like
> in the Broadcom VideoCore-IV. Broadcom call the hybrid concept
> "Virtual Parallelism".
Yes, that seems to be the intent. The slides from an earlier RISC-V
workshop mentioned a systolic vector unit topology as well.
> And that would be absolutely fine under Simple-V, too.
Also an acceptable implementation for RVP lanes.
> [...]
>
> > Secondly, the main context in which I am considering this (at
> all) is
> > as the basis for an *embedded* 3D GPU (not an AMD Radeon 850 or an
> > NVidia 1070). 4-wide internal parallelism (per core that is)
> would be
> > perfectly adequate (and is what the Broadcom VideoCore-IV has).
>
> This would be RVP lanes with four writable bits in the plane CSR
> and RVP
> partitioning to divide those elements into pixel channels, either
> RGB565
> (2/group in RV32) or RGBA8888 (1/group in RV32). RGBZ8888 is
> similarly
> available if depth-buffer values are integers.
>
>
> Interestingly, Z-buffers I believe can be implemented with predication
> in very few instructions. A predicated compare, and two register MVs
> (conditionally carried out based on the compare). One of those
> predicated MV operations would be the Z value, the other would be the
> pixel.
>
> I love predicated vectors!
Even without predication, Z-buffers are easily enough implemented with
RVP partitioned elements: a packed-data compare, mask the result to
extract the Z channels, then a packed-subtract (using different
boundaries that treat entire pixels as elements) to form an inverse mask
by subtracting 1, AND original with inverse mask, invert mask, AND new
data with mask, OR masked original with masked new. 9 instructions
total, including the CSR writes to change the partitioning boundaries.
Total data needed is 2 mutable inputs and 1 temporary for the
"predicate" mask. Overall, an additional register is needed for
exchanging the partitioning boundaries and another to store the constant
-1, since PADDI will not exist. We could use x9 - x31 (23 registers)
for the partition boundaries, constants, and working data, allowing
seven iterations of this loop to be interleaved and the overhead of the
CSR swap to be amortized over seven groups. In an implementation with
RVP lanes, each group is also replicated across the N lanes, so the
processor could have 7*N groups "in the air" at once.
On RV32, using RGBZ8888 data, this can be even easier: since there is
only one pixel per group, we can use only a single set of partition
boundaries and the RVI SUBI instruction. This eliminates the "global"
registers and two CSR swap instructions, and might allow us to have an
*eighth* group up in the air, for up to 8*32 = 256 pixels processed for
each iteration of the loop on RV32P with 32 lanes.
> > Summary: there's a way for RVV to take over and get much better
> > performance (with associated costs, reduced greatly by the
> > half-way-house that Simple-V represents), and Simple-V is
> designed to
> > take into account intermediary targets (embedded, DSP) anyway.
>
> The big problem that I keep coming back to with Simple-V is that it
> seems to more-or-less need most of an out-of-order superscalar
> processor
> to achieve a parallel implementation. Using contiguous groups out of
> the register file is *really* complex if the simple serial
> implementation is not good enough,
>
>
> Y'know I thought about that, and want to raise a question with Andrew
> at an appropriate moment: if an implementor does have an OoO
> implementation with register-renaming, Simple-V could hypothetically
> quite literally go over the "simple" loop, instead of executing
> instructions just stuffing them into the OoO FIFO and leaving it at
> that. The only fly in the ointment at that point is: Exceptions (on
> LOAD/STORE in particular. F.P. ones can be turned into an array, and
> the actual exception thrown only after all elements in a given VSETL
> length have all been executed).
Instructions must be atomic; they either complete entirely or fault,
with no effect on the register file.
> even though the simple serial
> implementation is *really* simple.
>
>
> yehyeh it is.
>
> I'm going to have to give this some thought: it's a good point.
> Remember that I'm endeavouring to cover the DSP (SIMD) case as well,
> which is (as best I understand it) uuusually a single-issue in-order
> architecture with relatively short pipelines, where typically every
> operation completes in the same amount of time. Really *really* easy
> to design. From what I'm seeing of RISC-V, even without RVV or *any*
> of these proposals, it's not really a good fit to the traditional
> (single-issue, in-order) SIMD/DSP model. I'd be really happy to be
> informed otherwise.
My RVP proposal is specifically intended for DSP applications, including
a programming model that should be familiar to DSP programmers.
> Anyway, Jacob: one for you to consider (on a separate thread? or the
> RVP one?) - how would RVP deal with LOAD/STORE exceptions, and
> Floating-point ones too. Also I'd be interested to help evaluate how
> RVP fits with different implementation paradigms (OoO, single-issue,
> in-order and so on).
Group LOAD/STORE must be atomic -- essentially, all translations are
resolved and permissions verified *before* any actual memory access
occurs. RVP intentionally does not support floating-point; partitioning
an FP unit in the way RVP partitions an integer unit is a can of worms,
DSPs often do not support floating-point anyway, fixed-point is integer
arithmetic for all the hardware cares, and RVV should be needed at some
point. RVFP could have an additional pflane CSR for FP lanes, which
would *not* be conflated with integer lanes, and would *not* have
gather-LOAD/scatter-STORE capability, since only an integer lane can
address memory.
-- Jacob