On Monday, July 1, 2019 at 8:09:13 PM UTC+1, Ivan Godard wrote:
> Yes, I know. But how about:
> float af[];
> double ad[];
> for (int i=0; i<N; ++i) {
> af[i] = 0.0;
> ad[i] = 1.0;
> }
>
> This is trivial in a Mill, but I don't see how to use VBLOCK to get it,
> other than by splitting it into two loops.
the per-register tags that are set up by the VBLOCK contain independent
elwidth overrides, and VLEN is, again, independent of that (or, the
VLEN *incrementing* is independent of that. so it's really straightforward.
two ways to do it. one involves just using FADD.S and FADD.D, the other
would involve an over-ride on 32-bit FP widths.
i don't know how to get constants efficiently into RISC-V FP nums,
so please excuse the mess at the start
RegCSR[f4] = 32bit, f4, vector # override use of f4 to be 32-bit vector
RegCSR[f8] = dflt, f8, vector # override use of f8 to be vector @ op-size
fcvt.s f20, x0 # 0 into f20
add x1, x0, 1 # 1 into x1
fcvt.d f21, x1 # convert 1 into 1.0 (double)
loop:
sv.setvl a2, t4, 8
FADD.S f4, f20, f20 # add zero to zero,
FADD.D f8, f21, f20 # add 1.0 to zero
sub a2, a2, t4 # decrement loop
bnez a2, loop
the MAXVL is 8, here, and bear in mind that the scalar regfile is *64*
bit wide, therefore,
* FADD.S results in *8* operations (spanning 4 64-bit regs):
FADD.S f4[LO-32], f20, f20
FADD.S f4[HI-32], f20, f20
FADD.S f5[LO-32], f20, f20
FADD.S f5[HI-32], f20, f20
FADD.S f6[LO-32], f20, f20
FADD.S f6[HI-32], f20, f20
FADD.S f7[LO-32], f20, f20
FADD.S f7[HI-32], f20, f20
* FADD.W *ALSO* results in 8 operations, however they span *8* 64-bit regs:
FADD.D f8, f21, f20
FADD.D f9, f21, f20
FADD.D f10, f21, f20
FADD.D f11, f21, f20
FADD.D f12, f21, f20
FADD.D f13, f21, f20
FADD.D f14, f21, f20
FADD.D f15, f21, f20
no need for separate VL loops.
so this emphasises: every operation is independent, it's *literally* like
a for-loop. every "tag" is independent, and those multi-element
operations are *direct* substitutes in-stream for the "one" (scalar)
operation, as if the multi-element expansion had *literally* been in
the instruction stream instead of the (one) scalar op.
the dependencies come if any of those registers... *after* the expansion
phase... overlap in any way (part or full).
dependencies are dealt with in the standard way because they are *literally*
instructions in a standard sequential program-order.
no "actual" vectors, at all. SV unrolls the hardware-for-loop at the
instruction *issue* phase.
where it gets more complicated to mentally handle is in two cases:
* two registers with different element widths that have overrides.
some polymorphic rules had to be made to deal with this.
* two registers used in one opcode that *overlap* in their target ranges
onto the *SAME* scalar registers.
this latter is useful for carrying out reduce operations. you very
deliberately create two "vectors" whose target "real" register differ by
one.
RegCSR[f4] = dflt, f4, vector # override use of f4
RegCSR[f5] = dflt, f5, vector # override use of f5
a loop that does FADD f5, f5, f4 would issue the following:
FADD f5, f5, f4
FADD f6, f6, f5
FADD f7, f7, f6
..
..
and that's basically a Vector Reduce.
[it's not optimised (no parallelism), however given that it doesn't need
any opcodes added to RISC-V *at all*, i am having a hard time caring that
it's not optimal :) ]
it's almost the same, in effect, as the Mill: if RISC-V had polymorphic
INT/FP types, and had "widen" and "narrow" opcodes, it would be
a lot closer.
i haven't thought through what happens if you try to create one
register with a different elwidth pointing to the exact same
target: it probably makes a mess.
:)
l.