vectorisation data-dependent fail-on-first conditional test

lkcl

unread,

Jun 30, 2019, 4:14:27 AM6/30/19

to

https://libre-riscv.org/simple_v_extension/appendix/#ffirst

In the above spec I am putting together the options which, when combined, I believe would be extremely flexible and powerful.

The use case for fail on first is typically illustrated with vectorised strncpy and strlen. Without ffirst, a parallel set of LDs will cross a page boundary and the only way to continue if the LDs are failed 'as a block' is to use scalar strip-mining (the anathema and bane of existential vectorisation existence)

With ffirst, the page fault is used to *truncate* vector processing such that the remainder of the parallel loop operations only apply to the *valid* LDed data.

Soneone kindly posted a link last week to ARM SVE2, it also contained data-dependent fail-on-first, as does RISC-V RVV. As best I can tell however they are limited to LD/ST.

This got me thinking (a dangerous activity).

What if ffirst was extended to not just occur on LD/ST hard exceptions, what if ffirst could be applied to stop (truncate) vector processing on actual *data* values as well?

The simplest test would be "is element equal to zero, if yes, activate fail on first and truncate VLEN to cover last nonzero element".

Thus, if a FP Compare FNE, FEQ etc were to return integer 1 on success and 0 on fail, we can get vectorisation to ignore the sequential elements within a given loop at the point at which a *parallel* test occurs.

This alone I believe would save a lot of messing about with predicate masks. To recreate the above behaviour with predicate masks would require a special opcode equivalent to xBitManip "count leading ones", in order to detect where in the parallel sequence the first FEQ/FNE failed.

It gets even more interesting when zeroing and nonzeroing are added to the mix. Typical Vector Processors choose either zeroing on predicate masked out elements (an architectural optimisation that allows some implementations to save a register read), or nonzeroing (skipping the element and leaving it untouched if masked out).

SV adds support for both, very deliberately, because at the *ISA* level (not the architectural level) both are useful.

When combined with data-dependent ffirst conditional testing, zeroing and nonzeroing provide significant extra flexibility.

Nonzeroing simply skips predicated masked-out elements. There's no "fail" because nothing was tested if the predicate bit was zero.

However on zeroing, *now* the zero automatically and implicitly (unconditionally) fails on the first masked-out element, because the result is zero (duh) which (duh) meets the "fail" condition (duh).

In combination with a test that can never fail (is x0 = x0) this can be used to (unconditionally) make the predicate mask a way to alter VLEN, turning the parallel engine itself into a hidden "count leading ones" operation!

That means that the predicate mask can be used to *unconditionally* mask out elements beyond a certain point, again saving on instructions that would normally be needed, involving predicate bitmask manipulation.

Questions

1. Has anyone encountered anything like this before?

2. What historical architectures exist that have similar features?

3. How could it be improved, without too much being added to the gate count? For example, whilst fail on first is useful it might be useful to have "fail on second" or "fail on first then go back one", performing some kind of offset adjustment on the vector length.

4. What instructions (bitmanip, tests) could be usefully added that could take advantage of this capability?

https://libre-riscv.org/simple_v_extension/appendix/#strncpy

Regarding (3) I have noticed from the RVV strncpy example that it may be useful to have arithmetic operations on the VLEN, because whilst the fail on first can detect the zero, VLEN gets set to *exclude* that zero, and thus requires incrementing by 1. This is a pain as it requires 3 operations: get VLEN into a register, add 1 to register, put reg back into VLEN.

Thoughts and discussion appreciated as always.

In particular from Ivan as I recall that the Mill has something similar in the form of a "data is not valid" tag, which propagates along the belt and thus has a similar data-dependent effect. Is there something similar to sequential "fail first" data testing in the Mill?

L.

Ivan Godard

unread,

Jun 30, 2019, 11:34:52 AM6/30/19

to

On 6/30/2019 1:14 AM, lkcl wrote:

> In particular from Ivan as I recall that the Mill has something similar in the form of a "data is not valid" tag, which propagates along the belt and thus has a similar data-dependent effect. Is there something similar to sequential "fail first" data testing in the Mill?

See https://millcomputing.com/docs/metadata/. The discussion of strncpy
starting at slide 55 is a worked-out example, but needs the earlier
material for understanding.

Note: you *MUST* use real MS Windows Powerpoint for the slides; Mac and
the free alternatives screw up the animations, which are critical. If
you don't have MS Windows Powerpoint, just watch the video instead.

MitchAlsup

unread,

Jun 30, 2019, 11:36:38 AM6/30/19

to

On Sunday, June 30, 2019 at 3:14:27 AM UTC-5, lkcl wrote:
> https://libre-riscv.org/simple_v_extension/appendix/#ffirst
>
> In the above spec I am putting together the options which, when combined, I believe would be extremely flexible and powerful.
>
> The use case for fail on first is typically illustrated with vectorised strncpy and strlen. Without ffirst, a parallel set of LDs will cross a page boundary and the only way to continue if the LDs are failed 'as a block' is to use scalar strip-mining (the anathema and bane of existential vectorisation existence)
>
> With ffirst, the page fault is used to *truncate* vector processing such that the remainder of the parallel loop operations only apply to the *valid* LDed data.
>
> Soneone kindly posted a link last week to ARM SVE2, it also contained data-dependent fail-on-first, as does RISC-V RVV. As best I can tell however they are limited to LD/ST.
>
> This got me thinking (a dangerous activity).
>
> What if ffirst was extended to not just occur on LD/ST hard exceptions, what if ffirst could be applied to stop (truncate) vector processing on actual *data* values as well?

I think what you want is the ability to calculate a fail, and allow LD/STs to automatically fail (on page fault,...).

In VVM I don't think I need this because I vectorize loops not instructions.

>
> The simplest test would be "is element equal to zero, if yes, activate fail on first and truncate VLEN to cover last nonzero element".
>
> Thus, if a FP Compare FNE, FEQ etc were to return integer 1 on success and 0 on fail, we can get vectorisation to ignore the sequential elements within a given loop at the point at which a *parallel* test occurs.
>
> This alone I believe would save a lot of messing about with predicate masks. To recreate the above behaviour with predicate masks would require a special opcode equivalent to xBitManip "count leading ones", in order to detect where in the parallel sequence the first FEQ/FNE failed.

You are just predicating in a way that is not SW accessible; it remains predication.

lkcl

unread,

Jun 30, 2019, 5:44:31 PM6/30/19

to

On Sunday, June 30, 2019 at 11:36:38 PM UTC+8, MitchAlsup wrote:

> >
> > What if ffirst was extended to not just occur on LD/ST hard exceptions, what if ffirst could be applied to stop (truncate) vector processing on actual *data* values as well?
>
> I think what you want is the ability to calculate a fail, and allow LD/STs to automatically fail (on page fault,...).

The full ddep rules for LD/ST ffirst are:
* first element must be treated normally. Page faults can be dealt with by VM load faults, but illegal memory accesses definitely must throw a hard fault.
* Subsequent elements, instead of illegal memory accesses, they *end* the Vectorisation, *truncate* the Vector Length at the point right *BEFORE* the illegal memory access.

Thus a parallel operation has the same characteristics as a sequential one. It is kind of like speculative execution.

> In VVM I don't think I need this because I vectorize loops not instructions.

I realised just after I posted that VVM would share some of the characteristics, because if loops are to be done in parallel, of course it is not safe to try to do computations on illegal memory accesses.

So yes the amount of parallelism in each VVM loop would have to fit what is *safe* to carry out.

I would be interested to hear if you concur with that assessment

> >
> > The simplest test would be "is element equal to zero, if yes, activate fail on first and truncate VLEN to cover last nonzero element".
> >
> > Thus, if a FP Compare FNE, FEQ etc were to return integer 1 on success and 0 on fail, we can get vectorisation to ignore the sequential elements within a given loop at the point at which a *parallel* test occurs.
> >
> > This alone I believe would save a lot of messing about with predicate masks. To recreate the above behaviour with predicate masks would require a special opcode equivalent to xBitManip "count leading ones", in order to detect where in the parallel sequence the first FEQ/FNE failed.
>
> You are just predicating in a way that is not SW accessible; it remains predication.

It is indeed software-accessible, because the ffirsted operation modifies the VLEN. and, if integer branch operations are involved, I augment them (with vector state) to store the results of each compare in a target predicate destination mask.

Forgot to mention that bit.

The branch only goes ahead if all compares are carried out and successful. If no branch is needed the br addr may be set to the PC+4 ie next instruction!

A ffirst-ed branch/compare would therefore not only result in VLEN being truncated, such that subsequent parallel operations would only be done on non-failing elements, there would be a predicate record of the successes, as well.

The thing is, all of this has not really been thought through at all. SV is a constant series of surprising accidental beneficial side effects of the unusual idea to take a scalar ISA and "tag" registers as scalar or vector. Predicated LD/ST-MULTI for example happened completely by accident! It was just a byproduct of "tag as vector".

I am left in the unusual position, for me, of wondering how to make sense of this and add some direction and metrics by which these various ideas and beneficial side effects can be properly assessed.

This idea in this thread is a classic example. To me, it's a really obvious extension of ffirst, to make it conditional test (zero) data dependent, however aside from one / two standard functions (strncpy being one) i have *no way* of assessing the strategic long term benefit of the concept, and I certainly don't have a clue as to what additional instructions might work really well with the data dependent conditional ffirst concept.

If it was a specific opcode such as "clz" or something, this would be a dead easy assessment, and there would be no need to ask here.

I guess it is just down to the fact that normally, this kind of design work would be done by large well funded teams of people, working together, each bringing different skills to the table, and I am slightly freaked out by the responsibility: this is a long term strategic project designed to go into mass produced products, so has to be gotten right.

L.

lkcl

unread,

Jun 30, 2019, 5:45:09 PM6/30/19

to

Thx Ivan will watch it.

lkcl

unread,

Jul 1, 2019, 3:53:24 AM7/1/19

to

On Sunday, June 30, 2019 at 4:34:52 PM UTC+1, Ivan Godard wrote:
> On 6/30/2019 1:14 AM, lkcl wrote:
>
> > In particular from Ivan as I recall that the Mill has something similar in the form of a "data is not valid" tag, which propagates along the belt and thus has a similar data-dependent effect. Is there something similar to sequential "fail first" data testing in the Mill?
>
>
> See https://millcomputing.com/docs/metadata/.

https://youtu.be/DZ8HN9Cnjhc?t=665
11:14 - cool! the "tag" - containing the element width, length
and whether it's a vector or a scalar - is conceptually near-identical
to SV VBLOCK. cool!

i like that ST doesn't need anything in the opcode (at all), because
the tag is carried through all the way from the LD.

widen and narrow operators. make a lot of sense.

love the FP metadata (FP flags "tag" the data) idea.

"pick" (p ? a : b) is kiinda not-exactly-cheating, it's kiinda
the same as a branch? except the range on "pick" is lower.
branch that only jumped 1 instruction would be easier to
consider doing both paths.

NULL pointers, again adding metadata: now we're getting into
"data-dependency" (in the form of tags).

slide 53: smear. this is near-identical to the Aspex Semiconductors
operator. the ASP variant *terminates* at the next 1. so:

00100001000010001000 -->
00111111000011111000

it's been a while (like... 2003) so that *might* have been a
separate register that indicated "end the smear" point(s)

basically this was the core critical operator that allowed them
to do ones "carry" propagation, of arbitrary size.

so if you really wanted to, you could use the entire processor
to do a 8192-bit integer ADD. or in the G Architecture, a 131072-bit
integer add. or, because there was an in-string bit and an out-string
bit, chains of processors could do near-unlimited bit ADDs

ah finally, slide 55 :)

ok so you use the "smear.x" to offset to include the "0".

then the "pick" is inversion of that zeroing test.

and because the pick is done against "none", the ST doesn't
damage beyond end-of-string.

i like it!

and because of the smear.x the last tested thing, you can
branch based on that, so the loop is good.

oh - btw - how do you increment the pointer(s) by the correct
amount?

fascinatingly, then, the Mill achieves the same semantics
as the data-dependent fail-on-first, combined with the
"augmentation" of standard scalar BEQ/BNE into the vector
domain (which i forgot to describe in the OP).

the NaRs (not-a-result) are the direct equivalent of the
"illegal memory load changes VL", except that NaRs would
work even on arbitrary (indirect-addressed) LDs.

what i haven't worked out is how to do the arithmetic
that would increment the VLEN to cover the zero.
smear.x is the offset-by-one that helps with that.

slide 68: "remaining" operation. hmmmm... it's kiiinda
equivalent to ~((1<<x)-1), is that right?

vector-remaining (the other way round) is kinda like CLZ
(count leading zeros). or POPCNT (population count)

this is why i decided to use *scalar* registers as
predicate masks, because then dropping in xBitManip
operations you don't need special opcodes.

overall, i really like it, and it's precisely the kind of
"this-has-actually-been-designed-rather-than-done-by-accident"
inspiration i was looking for :)

the main thing i'm missing is the ability to do arithmetic
on the VLEN "control" register.

l.

Ivan Godard

unread,

Jul 1, 2019, 1:41:10 PM7/1/19

to

On 7/1/2019 12:53 AM, lkcl wrote:
> On Sunday, June 30, 2019 at 4:34:52 PM UTC+1, Ivan Godard wrote:
>> On 6/30/2019 1:14 AM, lkcl wrote:
>>
>>> In particular from Ivan as I recall that the Mill has something similar in the form of a "data is not valid" tag, which propagates along the belt and thus has a similar data-dependent effect. Is there something similar to sequential "fail first" data testing in the Mill?
>>
>>
>> See https://millcomputing.com/docs/metadata/.
>
> https://youtu.be/DZ8HN9Cnjhc?t=665
> 11:14 - cool! the "tag" - containing the element width, length
> and whether it's a vector or a scalar - is conceptually near-identical
> to SV VBLOCK. cool!

Except you can mix vectors of different widths in a single loop using
tagging, whereas (if I understand it) VBLOCK is loop-global.

> i like that ST doesn't need anything in the opcode (at all), because
> the tag is carried through all the way from the LD.
>
> widen and narrow operators. make a lot of sense.
>
> love the FP metadata (FP flags "tag" the data) idea.
>
> "pick" (p ? a : b) is kiinda not-exactly-cheating, it's kiinda
> the same as a branch? except the range on "pick" is lower.
> branch that only jumped 1 instruction would be easier to
> consider doing both paths.

Pick eliminates branches when results converge at a CFG join and input
sides were speculatively executed; we do a lot of that with our width.
Necessarily it removes any possibility of misprediction. In this it is
similar to move-conditional and select ops in other architectures.
However, pick is not really an op like add; there's not pickFU and it
has zero latency, being done entirely in the crossbar as data flows from
cycle to cycle.

> NULL pointers, again adding metadata: now we're getting into
> "data-dependency" (in the form of tags).
>
> slide 53: smear. this is near-identical to the Aspex Semiconductors
> operator. the ASP variant *terminates* at the next 1. so:
>
> 00100001000010001000 -->
> 00111111000011111000
>
> it's been a while (like... 2003) so that *might* have been a
> separate register that indicated "end the smear" point(s)
>
> basically this was the core critical operator that allowed them
> to do ones "carry" propagation, of arbitrary size.
>
> so if you really wanted to, you could use the entire processor
> to do a 8192-bit integer ADD. or in the G Architecture, a 131072-bit
> integer add. or, because there was an in-string bit and an out-string
> bit, chains of processors could do near-unlimited bit ADDs

Do you know the IBM 1401? It did the same, only in decimal.

> ah finally, slide 55 :)
>
> ok so you use the "smear.x" to offset to include the "0".
>
> then the "pick" is inversion of that zeroing test.
>
> and because the pick is done against "none", the ST doesn't
> damage beyond end-of-string.
>
> i like it!

Thank you <modest bow>

> and because of the smear.x the last tested thing, you can
> branch based on that, so the loop is good.
>
> oh - btw - how do you increment the pointer(s) by the correct
> amount?

Static schedule, usually non-polymorphic so the compiler knows what the
run-type width/scalarity will be can can just use the right increment.
In polymorphic code there's an op you can inquire the width of an
operand with; we don't anticipate polymorphic code to be commonly useful.

> fascinatingly, then, the Mill achieves the same semantics
> as the data-dependent fail-on-first, combined with the
> "augmentation" of standard scalar BEQ/BNE into the vector
> domain (which i forgot to describe in the OP).

Fail-on-first is a special case of the Mill SIMD.

> the NaRs (not-a-result) are the direct equivalent of the
> "illegal memory load changes VL", except that NaRs would
> work even on arbitrary (indirect-addressed) LDs.

Yes, and all other possible predicates too, besides illegal load.

> what i haven't worked out is how to do the arithmetic
> that would increment the VLEN to cover the zero.
> smear.x is the offset-by-one that helps with that.
>
>
> slide 68: "remaining" operation. hmmmm... it's kiiinda
> equivalent to ~((1<<x)-1), is that right?

Yes, it's basically a conversion from vector-of-bool to bitmask.

>
> vector-remaining (the other way round) is kinda like CLZ
> (count leading zeros). or POPCNT (population count)
>
> this is why i decided to use *scalar* registers as
> predicate masks, because then dropping in xBitManip
> operations you don't need special opcodes.
>
>
> overall, i really like it, and it's precisely the kind of
> "this-has-actually-been-designed-rather-than-done-by-accident"
> inspiration i was looking for :)

Thank you. That's just one of our videos, and not the most full of
novelties. You might consider browsing the rest; I recommend doing them
in order if you do.

lkcl

unread,

Jul 1, 2019, 2:43:17 PM7/1/19

to

On Monday, July 1, 2019 at 6:41:10 PM UTC+1, Ivan Godard wrote:

> > https://youtu.be/DZ8HN9Cnjhc?t=665
> > 11:14 - cool! the "tag" - containing the element width, length
> > and whether it's a vector or a scalar - is conceptually near-identical
> > to SV VBLOCK. cool!
>
> Except you can mix vectors of different widths in a single loop using
> tagging, whereas (if I understand it) VBLOCK is loop-global.

as in, the VBLOCK indicates and is restricted to be the start of a loop?
(this is not the case: all it does is provide the "tags"... oh and because
one of the most common things to do is to set the vector length VLEN, i
decided to let that be optional as well).

there's also no _actual_ concept of "vectors" as such: consequently, it's
up to the compiler to say "hey you know those registers x5-x8 you used just
now, to pretend that there was a 4-long vector in them? well, we're gonna
now set VLEN=3 and you're gonna do an ADD on registers x16-18"

switching VLEN is done sequentially, and it's very weird, i appreciate,
to think of vectors being literally on top of the scalar regfile(s).

this does have the advantage that if you want to ADD to elements 2 thru 4
of a "vector", where formerly there was a MUL done on elements 0 thru 4,
normally you would have to do a shuffle on them: well... err... all you do
is: redefine the "vector" to point 2 further along in the *scalar* regfile
and make it 2 shorter.

> > i like that ST doesn't need anything in the opcode (at all), because
> > the tag is carried through all the way from the LD.
> >
> > widen and narrow operators. make a lot of sense.
> >
> > love the FP metadata (FP flags "tag" the data) idea.
> >
> > "pick" (p ? a : b) is kiinda not-exactly-cheating, it's kiinda
> > the same as a branch? except the range on "pick" is lower.
> > branch that only jumped 1 instruction would be easier to
> > consider doing both paths.
>
> Pick eliminates branches when results converge at a CFG join and input
> sides were speculatively executed; we do a lot of that with our width.
> Necessarily it removes any possibility of misprediction. In this it is
> similar to move-conditional and select ops in other architectures.
> However, pick is not really an op like add; there's not pickFU and it
> has zero latency, being done entirely in the crossbar as data flows from
> cycle to cycle.

interesting approach. the routing being easy and fast enough not to need
its own cycle.

> Do you know the IBM 1401? It did the same, only in decimal.

cool. weird, but cool.

> > and because the pick is done against "none", the ST doesn't
> > damage beyond end-of-string.
> >
> > i like it!
>
> Thank you <modest bow>

:)

> > and because of the smear.x the last tested thing, you can
> > branch based on that, so the loop is good.
> >
> > oh - btw - how do you increment the pointer(s) by the correct
> > amount?
>
> Static schedule, usually non-polymorphic so the compiler knows what the
> run-type width/scalarity will be can can just use the right increment.

ok.

> In polymorphic code there's an op you can inquire the width of an
> operand with; we don't anticipate polymorphic code to be commonly useful.

that's quite intriguing. ohh, now i know what you were mentioning last
week about the polymorphic functions.

>
> > fascinatingly, then, the Mill achieves the same semantics
> > as the data-dependent fail-on-first, combined with the
> > "augmentation" of standard scalar BEQ/BNE into the vector
> > domain (which i forgot to describe in the OP).
>
> Fail-on-first is a special case of the Mill SIMD.
>
> > the NaRs (not-a-result) are the direct equivalent of the
> > "illegal memory load changes VL", except that NaRs would
> > work even on arbitrary (indirect-addressed) LDs.
>
> Yes, and all other possible predicates too, besides illegal load.
>
> > what i haven't worked out is how to do the arithmetic
> > that would increment the VLEN to cover the zero.
> > smear.x is the offset-by-one that helps with that.
> >
> >
> > slide 68: "remaining" operation. hmmmm... it's kiiinda
> > equivalent to ~((1<<x)-1), is that right?
>
> Yes, it's basically a conversion from vector-of-bool to bitmask.

at first i couldn't think of a way to do that (without adding
something to scalar RISC-V). then i realised, BEQ/BNE is perfect:
just store the results of the *scalar* compares into a predicate
bitmask.

> > overall, i really like it, and it's precisely the kind of
> > "this-has-actually-been-designed-rather-than-done-by-accident"
> > inspiration i was looking for :)
>
> Thank you. That's just one of our videos, and not the most full of
> novelties. You might consider browsing the rest; I recommend doing them
> in order if you do.

appreciated.

Ivan Godard

unread,

Jul 1, 2019, 3:09:13 PM7/1/19

to

On 7/1/2019 11:43 AM, lkcl wrote:
> On Monday, July 1, 2019 at 6:41:10 PM UTC+1, Ivan Godard wrote:
>
>>> https://youtu.be/DZ8HN9Cnjhc?t=665
>>> 11:14 - cool! the "tag" - containing the element width, length
>>> and whether it's a vector or a scalar - is conceptually near-identical
>>> to SV VBLOCK. cool!
>>
>> Except you can mix vectors of different widths in a single loop using
>> tagging, whereas (if I understand it) VBLOCK is loop-global.
>
> as in, the VBLOCK indicates and is restricted to be the start of a loop?
> (this is not the case: all it does is provide the "tags"... oh and because
> one of the most common things to do is to set the vector length VLEN, i
> decided to let that be optional as well).
>
> there's also no _actual_ concept of "vectors" as such: consequently, it's
> up to the compiler to say "hey you know those registers x5-x8 you used just
> now, to pretend that there was a 4-long vector in them? well, we're gonna
> now set VLEN=3 and you're gonna do an ADD on registers x16-18"
>
> switching VLEN is done sequentially, and it's very weird, i appreciate,
> to think of vectors being literally on top of the scalar regfile(s).
>
> this does have the advantage that if you want to ADD to elements 2 thru 4
> of a "vector", where formerly there was a MUL done on elements 0 thru 4,
> normally you would have to do a shuffle on them: well... err... all you do
> is: redefine the "vector" to point 2 further along in the *scalar* regfile
> and make it 2 shorter.

Yes, I know. But how about:
float af[];
double ad[];
for (int i=0; i<N; ++i) {
af[i] = 0.0;
ad[i] = 1.0;
}

This is trivial in a Mill, but I don't see how to use VBLOCK to get it,
other than by splitting it into two loops.

lkcl

unread,

Jul 1, 2019, 4:19:57 PM7/1/19

to

On Monday, July 1, 2019 at 8:09:13 PM UTC+1, Ivan Godard wrote:

> Yes, I know. But how about:
> float af[];
> double ad[];
> for (int i=0; i<N; ++i) {
> af[i] = 0.0;
> ad[i] = 1.0;
> }
>
> This is trivial in a Mill, but I don't see how to use VBLOCK to get it,
> other than by splitting it into two loops.

the per-register tags that are set up by the VBLOCK contain independent
elwidth overrides, and VLEN is, again, independent of that (or, the
VLEN *incrementing* is independent of that. so it's really straightforward.
two ways to do it. one involves just using FADD.S and FADD.D, the other
would involve an over-ride on 32-bit FP widths.

i don't know how to get constants efficiently into RISC-V FP nums,
so please excuse the mess at the start

RegCSR[f4] = 32bit, f4, vector # override use of f4 to be 32-bit vector
RegCSR[f8] = dflt, f8, vector # override use of f8 to be vector @ op-size
fcvt.s f20, x0 # 0 into f20
add x1, x0, 1 # 1 into x1
fcvt.d f21, x1 # convert 1 into 1.0 (double)
loop:
sv.setvl a2, t4, 8
FADD.S f4, f20, f20 # add zero to zero,
FADD.D f8, f21, f20 # add 1.0 to zero
sub a2, a2, t4 # decrement loop
bnez a2, loop

the MAXVL is 8, here, and bear in mind that the scalar regfile is *64*
bit wide, therefore,

* FADD.S results in *8* operations (spanning 4 64-bit regs):
FADD.S f4[LO-32], f20, f20
FADD.S f4[HI-32], f20, f20
FADD.S f5[LO-32], f20, f20
FADD.S f5[HI-32], f20, f20
FADD.S f6[LO-32], f20, f20
FADD.S f6[HI-32], f20, f20
FADD.S f7[LO-32], f20, f20
FADD.S f7[HI-32], f20, f20

* FADD.W *ALSO* results in 8 operations, however they span *8* 64-bit regs:
FADD.D f8, f21, f20
FADD.D f9, f21, f20
FADD.D f10, f21, f20
FADD.D f11, f21, f20
FADD.D f12, f21, f20
FADD.D f13, f21, f20
FADD.D f14, f21, f20
FADD.D f15, f21, f20

no need for separate VL loops.

so this emphasises: every operation is independent, it's *literally* like
a for-loop. every "tag" is independent, and those multi-element
operations are *direct* substitutes in-stream for the "one" (scalar)
operation, as if the multi-element expansion had *literally* been in
the instruction stream instead of the (one) scalar op.

the dependencies come if any of those registers... *after* the expansion
phase... overlap in any way (part or full).

dependencies are dealt with in the standard way because they are *literally*
instructions in a standard sequential program-order.

no "actual" vectors, at all. SV unrolls the hardware-for-loop at the
instruction *issue* phase.

where it gets more complicated to mentally handle is in two cases:

* two registers with different element widths that have overrides.
some polymorphic rules had to be made to deal with this.

* two registers used in one opcode that *overlap* in their target ranges
onto the *SAME* scalar registers.

this latter is useful for carrying out reduce operations. you very
deliberately create two "vectors" whose target "real" register differ by
one.

RegCSR[f4] = dflt, f4, vector # override use of f4
RegCSR[f5] = dflt, f5, vector # override use of f5

a loop that does FADD f5, f5, f4 would issue the following:

FADD f5, f5, f4
FADD f6, f6, f5
FADD f7, f7, f6
..
..

and that's basically a Vector Reduce.

[it's not optimised (no parallelism), however given that it doesn't need
any opcodes added to RISC-V *at all*, i am having a hard time caring that
it's not optimal :) ]

it's almost the same, in effect, as the Mill: if RISC-V had polymorphic
INT/FP types, and had "widen" and "narrow" opcodes, it would be
a lot closer.

i haven't thought through what happens if you try to create one
register with a different elwidth pointing to the exact same
target: it probably makes a mess.

:)

l.

Bill Findlay

unread,

Jul 1, 2019, 6:02:17 PM7/1/19

to

On 30 Jun 2019, Ivan Godard wrote (in article<qfakqq$j7b$1...@dont-email.me>):

> See https://millcomputing.com/docs/metadata/. The discussion of strncpy
> starting at slide 55 is a worked-out example, but needs the earlier
> material for understanding.
>
> Note: you *MUST* use real MS Windows Powerpoint for the slides; Mac and
> the free alternatives screw up the animations, which are critical. If
> you don't have MS Windows Powerpoint, just watch the video instead.

Ivan, for info, the animations work perfectly on my Mac
withup-to-date macOS and PowerPoint.

--
Bill Findlay

MitchAlsup

unread,

Jul 1, 2019, 6:02:18 PM7/1/19

to

MY 66000 VVM

MOV r7,0
VEC {{SV}{SV},{VS}}
STW 0.0,[Raf+Ri<<2]
STD 1.0,[Rad+Ri<<3]
LOOP Ri,Rn,LT

MitchAlsup

unread,

Jul 1, 2019, 6:06:48 PM7/1/19

to

There are better means to perform a FP reduce, and ones that have greater intermediate FP width give better numeric outcomes.

>
> [it's not optimised (no parallelism), however given that it doesn't need
> any opcodes added to RISC-V *at all*, i am having a hard time caring that
> it's not optimal :) ]
>
> it's almost the same, in effect, as the Mill: if RISC-V had polymorphic
> INT/FP types, and had "widen" and "narrow" opcodes, it would be
> a lot closer.
>
>
> i haven't thought through what happens if you try to create one
> register with a different elwidth pointing to the exact same
> target: it probably makes a mess.

If the registers have a mental model as cached values that have backing memory, what to do is totally obvious.
>
> :)
>
> l.

Ivan Godard

unread,

Jul 1, 2019, 8:30:44 PM7/1/19

to

Good; Tthank you. We had trouble several years ago and I hadn't checked
to see if it had been fixed in the meantime. I wonder if libreOffice and
the other clones work now too?

Bill Findlay

unread,

Jul 1, 2019, 9:04:18 PM7/1/19

to

On 2 Jul 2019, Ivan Godard wrote
(in article <qfe8ji$t9i$1...@dont-email.me>):

LibreOffice 6.1.5.2 on macOS 10.14.5 seems to work,
but with a lethargy and clunkiness that try the patience.

--
Bill Findlay

lkcl

unread,

Jul 2, 2019, 4:56:25 AM7/2/19

to

true. there's also no reason why a programmer should not
do the vector reduce as a reduction tree (programmatically),
using the "accidental feature that's available for free with no
extra effort which didn't have to be added"

consequently my point is more: i'm not going to spend the time on
creating actual *hardware* vector-reduce operations, as they're
quite specialist, and a use-case for the 3D GPU hasn't been
demonstrated... and a bit of programming wouldn't hurt.

> and ones that have greater intermediate FP width give better numeric outcomes.

i read a fascinating paper about dotproduct, which was highly optimised
at the normalisation/de-normalisation stage.

> >
> > [it's not optimised (no parallelism), however given that it doesn't need
> > any opcodes added to RISC-V *at all*, i am having a hard time caring that
> > it's not optimal :) ]
> >
> > it's almost the same, in effect, as the Mill: if RISC-V had polymorphic
> > INT/FP types, and had "widen" and "narrow" opcodes, it would be
> > a lot closer.
> >
> >
> > i haven't thought through what happens if you try to create one
> > register with a different elwidth pointing to the exact same
> > target: it probably makes a mess.
>
> If the registers have a mental model as cached values that have
> backing memory, what to do is totally obvious.

yehyeh, i was trying to be funny and shorten the (already long)
post by not exploring that (complex) option as well. you're right:
it can be thought through, with enough care, because ultimately
it's just scalar registers. unusual uses of the same, but ultimately
just scalar registers (or byte-addressable parts thereof in this case).

l.

lkcl

unread,

Jul 2, 2019, 4:59:14 AM7/2/19

to

could reaally do with loading of FP immediates in RISC-V.... *sigh*

what's with the Ri<<2 and Ri<<3? Ri is the loop variable? so... shifted by 2 means "space of a single" and by 3 is "space of a double-word"...

that's ridiculously compact.

l.

lkcl

unread,

Jul 2, 2019, 7:01:55 AM7/2/19

to

i believe i have a solution to the problem of 3 opcodes extraneous
opcodes being needed due to VLEN being modified (and no arithmetic
operations being available to do so), and the zero needing to be
copied in strncpy.

it's quite simple: the SETVL actually marks the source
register as *becoming* VLEN:

loop:
SETVLI a2, t4, 8 # <<--- this says "t4 *IS* now VL"
ldb t0, (a1) # t0 fail first mode <<-- THIS NOW MODIFIES t4 AS WELL
bne t0, x0, allnonzero # still ff <<-- THIS NOW MODIFIES t4 AS WELL
# VL points to last nonzero
addi t4, t4, 1 # include zero <<--- THIS NOW MODIFIES VL AS WELL
stb t0, (a3) # store incl zero
ret # end subroutine
allnonzero:
stb t0, (a3) # VL legal range
# t4 *IS* VL therefore no need do do GETVL t4 here
add a1, a1, t4 # Bump src pointer
sub a2, a2, t4 # Decrement count.
add a3, a3, t4 # Bump dst pointer
bnez a2, loop # Any more?

if that is done inside a VBLOCK (if it fits), the context is destroyed
at the end of the block, so there is no need for an explicit instruction
that removes the "association" between t4 and VL.

this saves 3 instructions: one GETVL inside the main loop, and
two more in the exit path (one GETVL, one SETVL).

it's extending the "tag" (hidden state) concept one step further. all
vectorised operations now implicitly have an *additional* read-hazard (the
register indicated by SETVL), and in the case of fail-on-first, that's a
read *and* write hazard.

oof.

that's frickin complicated, as now t4 (or whatever is selected)
affects (specifies) the *number* of instructions that are issued
in any given cycle, because that's what VLEN does: it says *exactly*
how many *scalar* instructions are issued per "one-instruction-that-
has-a-register-marked-as-vectorised"

almost certainly, on changes to t4 (writes), the sane option is
to pause *all* instruction-issue on anything that's vectorised, waiting
for t4 to be written to.

still thinking it through.

l.

MitchAlsup

unread,

Jul 2, 2019, 1:17:18 PM7/2/19

to

Yes.
>
> that's ridiculously compact.

And yet so few follow .....
>
> l.