Issue / Clarification RVV vsetl uses some or all of the vector units? parallel workloads considered?

lk...@lkcl.net

unread,

Apr 8, 2018, 4:15:19 PM4/8/18

to RISC-V ISA Dev, wate...@eecs.berkeley.edu

Andrew, hi,

http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf

On page 19 the vsetv1 instruction is initialised from a scalar register (the number of iterations) which I now understand to be truncated to the maximum size of the vector execution length (yay).

I would imagine that it might be quite useful to consider interleaving vector instructions with FENCE, and performing at the very least some setup of another vector operation whilst another is completing, or, in some particularly exotic architectures to be able to vary the degree to which the *available* parallel engines are *subdivided* into lengths that best suit a particular application, and for each vector loop (in concert with assistance from FENCE in an OoO architecture) to take different workloads *in parallel*.

Having read the paper on MIAOW I know for example that they do precisely that: hardware threads have runtime-varying parallel workloads, which I appreciate is way more advanced than what is being discussed here but illustrates the point that tailoring parallel vector workloads to the available hardware is not unusual.

For that to work, vsetv1 must not be truncated to the *maximum* size of the vector execution length but instead to a length that is determined by an additional scalar (immediate or register).

Has this scenario and possibility been considered before and taken into consideration, or did I miss it?

l.

Andrew Waterman

unread,

Apr 8, 2018, 4:47:01 PM4/8/18

to lk...@lkcl.net, RISC-V ISA Dev

The cleanest way to do this sort of thing is to multithread the scalar processor and the vector unit. Then the programming model remains conventional: each thread appears to have its own vector unit. And no ISA changes are necessary.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/279227d4-9fc5-4267-9787-1208b5bb4f79%40groups.riscv.org.

Luke Kenneth Casson Leighton

unread,

Apr 8, 2018, 9:02:50 PM4/8/18

to Andrew Waterman, RISC-V ISA Dev

On Sun, Apr 8, 2018 at 9:46 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> The cleanest way to do this sort of thing is to multithread the scalar
> processor and the vector unit. Then the programming model remains
> conventional: each thread appears to have its own vector unit.

see below: each thread would, under certain workloads, still have
less than 100% utilisation. *sustained* less than 100% utilisation.

> And no ISA changes are necessary.

ok so allow me to come up with a use-case which may demonstrate.
there may be more. let's say the maximum vector length is 8. let's
say that the vector being processed is 6 wide, or 3 wide. 3 wide is
audio 24-bit or perhaps 3D is XYZ. so these are not obscure scenarios
they're quite likely.

now you have only 75% utilisation because 2 out of 8 of the vector
ALUs are never going to be active.

if you *know* that to be the case and there is some other
simultaneous vector operations that need to be performed, by
*explicitly* setting the maximum vector length on vsetl it becomes
possible to get to 100% utilisation of the vector ALUs, if FENCE is
deployed.

people with more expertise in parallel compute may be able to come up
with better examples.

l.

lk...@lkcl.net

unread,

Apr 9, 2018, 4:23:09 AM4/9/18

to RISC-V ISA Dev, wate...@eecs.berkeley.edu, jcb6...@gmail.com

cc'ing jacob as this links in with another thread.

jacob: you've studied hwacha so may know more. is it the case that RVV has a means to explicitly set the maximum vector length on a per-vector basis, it's completely unclear, even reading 17.11.

if there is i would then expect the implementation of vsetl vN, a0 to be:

CSRvector-length = min(min(CSRvcfglen[N], maxvectorALUs), a0)

l.

Jacob Bachmeyer

unread,

Apr 10, 2018, 6:39:52 PM4/10/18

to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev

Luke Kenneth Casson Leighton wrote:
> On Sun, Apr 8, 2018 at 9:46 PM, Andrew Waterman
> <wate...@eecs.berkeley.edu> wrote:
>
>
>> The cleanest way to do this sort of thing is to multithread the scalar
>> processor and the vector unit. Then the programming model remains
>> conventional: each thread appears to have its own vector unit.
>>
>
> see below: each thread would, under certain workloads, still have
> less than 100% utilisation. *sustained* less than 100% utilisation.
>
>
>> And no ISA changes are necessary.
>>
>
> ok so allow me to come up with a use-case which may demonstrate.
> there may be more. let's say the maximum vector length is 8. let's
> say that the vector being processed is 6 wide, or 3 wide. 3 wide is
> audio 24-bit or perhaps 3D is XYZ. so these are not obscure scenarios
> they're quite likely.
>

I think that I see a small misunderstanding of vectors here. No
surprise, since this is exactly the misunderstanding that SIMD marketing
efforts often promote.

For example, use packed RGB888 image data. An application processing
this data with RVV would configure vectors with U8 element type and use
total vector lengths that are multiples of 3, since the data is 3-tuples
of U8 elements. For RGBA8888, the element type is still U8, but now the
elements can be grouped into 4-tuples. For some operations, like adding
RGB888 buffers together, the tuple boundaries are insignificant. For
other operations, like alpha-compositing RGBA8888 data, the tuple
boundaries are significant and the application must check that the
effective vector length is a multiple of the tuple size and round the
application vector length down if needed.

This suggests possible "vector-tuple" operations, like "tuple insert"
(insert an alpha channel in RGB888 data, or unpack 24-bit RGB888 to
aligned 32-bit pixels), "tuple drop" (remove alpha channel from RGBA8888
data or pack aligned 32-bit pixels to 24-bit RGB888 or extract any
subset of channels from any of these by dropping all other channels),
and "tuple splat" (extend an alpha channel to prepare for scaling the
RGB pixel values). These would probably use vector predicates as
control inputs.

-- Jacob

Luke Kenneth Casson Leighton

unread,

Apr 10, 2018, 6:51:52 PM4/10/18

to Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Apr 10, 2018 at 11:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Luke Kenneth Casson Leighton wrote:
>>
>> On Sun, Apr 8, 2018 at 9:46 PM, Andrew Waterman
>> <wate...@eecs.berkeley.edu> wrote:
>>
>>
>>>
>>> The cleanest way to do this sort of thing is to multithread the scalar
>>> processor and the vector unit. Then the programming model remains
>>> conventional: each thread appears to have its own vector unit.
>>>
>>
>>
>> see below: each thread would, under certain workloads, still have
>> less than 100% utilisation. *sustained* less than 100% utilisation.
>>
>>
>>>
>>> And no ISA changes are necessary.
>>>
>>
>>
>> ok so allow me to come up with a use-case which may demonstrate.
>> there may be more. let's say the maximum vector length is 8. let's
>> say that the vector being processed is 6 wide, or 3 wide. 3 wide is
>> audio 24-bit or perhaps 3D is XYZ. so these are not obscure scenarios
>> they're quite likely.
>>
>
>
> I think that I see a small misunderstanding of vectors here. No surprise,
> since this is exactly the misunderstanding that SIMD marketing efforts often
> promote.

Apologies, I may not have been clear in the precise nature of the
example: the exact details of the example do not matter so much as the
mis-match between the number of parallel execution units and the
length of the "vectorised data" being such that the full capacity of
the underling hardware may never, at *any time* be fully 100%
utilised. ever.

If the examples that I provided were not of that type, please imagine
that there *do* exist examples that are of that type... or even more
usefully perhaps suggest some? :)

Vectors of maximum length 3 when the pipeline is of length 4 or 8 are
particularly nasty ones as they cannot possibly fit at anything over
75% hardware utilisation. In the more general case, vectors that are
a small prime number in length when the underlying hardware is a small
power of two are going to be an absolute nightmare.

No amount of threading / interleaving will help with such scenarios.

However... if there is control over how much of the underlying
parallel hardware is to be allocated to a given vector, the
applications writer may be able to tweak the parameters somewhat, and
possibly get closer to 100% sustained utilisation.

l.

Andrew Waterman

unread,

Apr 10, 2018, 7:13:32 PM4/10/18

to Luke Kenneth Casson Leighton, Jacob Bachmeyer, RISC-V ISA Dev

Actually, in many such cases, outer-loop vectorization is an effective strategy to get longer application vectors.

If the examples that I provided were not of that type, please imagine
that there *do* exist examples that are of that type... or even more
usefully perhaps suggest some? :)

Vectors of maximum length 3 when the pipeline is of length 4 or 8 are
particularly nasty ones as they cannot possibly fit at anything over
75% hardware utilisation. In the more general case, vectors that are
a small prime number in length when the underlying hardware is a small
power of two are going to be an absolute nightmare.

No amount of threading / interleaving will help with such scenarios.

However... if there is control over how much of the underlying
parallel hardware is to be allocated to a given vector, the
applications writer may be able to tweak the parameters somewhat, and
possibly get closer to 100% sustained utilisation.

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDyoFcxSz2TFnHx4Lb83%2Bt4jge4Lb4Mes08znFxeqJkSJg%40mail.gmail.com.

Jacob Bachmeyer

unread,

Apr 10, 2018, 11:54:25 PM4/10/18

to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev

So the scenario we are concerned about is data that forms K-tuples of
native vector elements on a machine with N vector lanes, where N and K
are relatively prime?

Presumably, if vector processing is to be useful, more than one such
K-tuple is needed in a calculation. In the trivial case, the data can
be processed in groups of N K-tuples, which are guaranteed to keep the
vector unit busy, requiring K cycles across the N lanes.

> No amount of threading / interleaving will help with such scenarios.
>

Rearranging the outer loop to group many tuples into a single vector
will help, though, as Dr. Waterman said.

> However... if there is control over how much of the underlying
> parallel hardware is to be allocated to a given vector, the
> applications writer may be able to tweak the parameters somewhat, and
> possibly get closer to 100% sustained utilisation.
>

If that outer-loop vectorization is not possible due to inter-lane
effects within tuples that must not occur between tuples, then we need
to add tuple-related features to some future extension of RVV. (RVVtuple?)

-- Jacob

Andrew Waterman

unread,

Apr 11, 2018, 12:09:57 AM4/11/18

to Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev

Or K vectors of N elements, when K is statically known (e.g. RGBA kind of stuff).

Jacob Bachmeyer

unread,

Apr 11, 2018, 1:00:15 AM4/11/18

to Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev

Andrew Waterman wrote:
>
>
> On Tue, Apr 10, 2018 at 8:54 PM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Luke Kenneth Casson Leighton wrote:
>
> ---
> crowd-funded eco-conscious hardware:
> https://www.crowdsupply.com/eoma68
> <https://www.crowdsupply.com/eoma68>
>
>
> On Tue, Apr 10, 2018 at 11:39 PM, Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
>
>
> Luke Kenneth Casson Leighton wrote:
>
>
> On Sun, Apr 8, 2018 at 9:46 PM, Andrew Waterman
> <wate...@eecs.berkeley.edu

Which is an option that constant-stride load/store will enable, since
RGBA is usually stored in packed rather than planar formats. To my
knowledge, current systems cannot do this easily.

-- Jacob

Luke Kenneth Casson Leighton

unread,

Apr 11, 2018, 3:30:04 PM4/11/18

to Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev

On Wed, Apr 11, 2018 at 4:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> Vectors of maximum length 3 when the pipeline is of length 4 or 8 are
>> particularly nasty ones as they cannot possibly fit at anything over
>> 75% hardware utilisation. In the more general case, vectors that are
>> a small prime number in length when the underlying hardware is a small
>> power of two are going to be an absolute nightmare.
>>
>
>
> So the scenario we are concerned about is data that forms K-tuples of native
> vector elements on a machine with N vector lanes, where N and K are
> relatively prime?

ahh... yes? :) http://primes.utm.edu/glossary/xpage/relativelyprime.html
ah, yes!

> Presumably, if vector processing is to be useful, more than one such K-tuple
> is needed in a calculation. In the trivial case, the data can be processed
> in groups of N K-tuples, which are guaranteed to keep the vector unit busy,
> requiring K cycles across the N lanes.
>
>> No amount of threading / interleaving will help with such scenarios.
>>
>
>
> Rearranging the outer loop to group many tuples into a single vector will
> help, though, as Dr. Waterman said.

A potential counter-example is the RGB one, where... oh... i dunno...
R G and B are all Half-precision FP let's suppose... and the required
task is.... suppose... to find the average weight (contrast or
brightness) of each "pixel".

Under such circumstances the inter-relation between the three - the
sum - I don't see a way how re-arranging would help.

Is there *really* no way in RVV's ISA to say how much hardware should
be allocated?

Oh wait - duh: I thought of a possible solution:

stripmine:
a5 = max(a0, 3)
vsetl t0, a5
....

what about that? Is it as simple as that? What effect would that have
on the Hwacha (or current) hardware? Could the technique of
deliberately restricting the number of vector-elements to be processed
be used in a more general context (across all implementations)? Would
it cause some implementations to stall dramatically and others to have
reasonable performance?

l.

Luke Kenneth Casson Leighton

unread,

Apr 11, 2018, 3:46:09 PM4/11/18

to Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev

On Wed, Apr 11, 2018 at 6:00 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> Rearranging the outer loop to group many tuples into a single
>> vector will help, though, as Dr. Waterman said.
>>
>>
>> Or K vectors of N elements, when K is statically known (e.g. RGBA kind of
>> stuff).
>
>
> Which is an option that constant-stride load/store will enable, since RGBA
> is usually stored in packed rather than planar formats. To my knowledge,
> current systems cannot do this easily.

If i can... interpret this with an example in a way that I can
understand, what you're saying is that if say the RGBA data is in
16-bit format (I say 16-bit because I'm not sure if STRIDE can do
8-bit single byte load/store... Hwacha looks like it can! vlb,h,w,d,
yay!), stride load/store would be able to extract all of RRRRR into
one vector, all of GGGGGG into another and so on, such that processing
could take place on all the Rs as one operand and all the Gs as
another, and then you could do one of three possible vector-operations
(six if the operations are not commutative such as DIV): R+G G+B R+B
and thus do the "averaging" example by doing first R+G followed by
(R+G) + B

something like that?

Which would indicate that yes the vector system is incredibly
powerful (hence why I'd like see its reach abstracted out and extended
down to *all* RV ISAs... with the side-benefit that future ones would
get the same parallelism capability *automatically* as well)

whilst at the same time indicating that another example needs to be
imagined which is not so amenable to reordering, hmmm...

I do appreciate what you're saying, Andrew: I can just... intuitively
tell that we're missing something, here, some important use-case where
reordering of the data to fit the architecture is "tail wagging dog".
I've seen how far that can go with the Aspex ASP: Jacob followed the
example on another thread - it's not pretty :)

l.

Andrew Waterman

unread,

Apr 11, 2018, 4:13:57 PM4/11/18

to Luke Kenneth Casson Leighton, Jacob Bachmeyer, RISC-V ISA Dev

You want to produce a new array that's the average of R, G, and B? The strip-mine loop is three stride=3 loads whose bases are offset by one byte each. Then two vector adds, a vector multiply, and a unit-stride vector store.

The "3" doesn't manifest in the vector length.

Luke Kenneth Casson Leighton

unread,

Apr 11, 2018, 4:59:58 PM4/11/18

to Andrew Waterman, Jacob Bachmeyer, RISC-V ISA Dev

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Wed, Apr 11, 2018 at 9:13 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> You want to produce a new array that's the average of R, G, and B?

not exactly: i want to come up with a use-case that *doesn't* fit the
vector pipeline. that i haven't been able to do so... yet... does not
mean that for all possible (infinite) algorithms, one very very
important use-case does not exist.

> The
> strip-mine loop is three stride=3 loads whose bases are offset by one byte
> each. Then two vector adds, a vector multiply, and a unit-stride vector
> store.
>
> The "3" doesn't manifest in the vector length.

i'm trying to think of examples that _don't_ fall into that category,
and having a hard time doing so. from long experience from trusting
that "nagging intuition at the back of my mind that just won't go
away" i *know* that just because i can't think *right now* of one,
that there isn't a really, really important use-case.

annoyingly - again from experience - it can take potentially days,
weeks and sometimes months for me to go "ah damnit, *that's* the
use-case".

grrr :)

l.

Jacob Bachmeyer

unread,

Apr 11, 2018, 10:31:27 PM4/11/18

to Luke Kenneth Casson Leighton, Andrew Waterman, RISC-V ISA Dev

Luke Kenneth Casson Leighton wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
>
> On Wed, Apr 11, 2018 at 9:13 PM, Andrew Waterman
> <wate...@eecs.berkeley.edu> wrote:
>
>
>> You want to produce a new array that's the average of R, G, and B?
>>
>
> not exactly: i want to come up with a use-case that *doesn't* fit the
> vector pipeline. that i haven't been able to do so... yet... does not
> mean that for all possible (infinite) algorithms, one very very
> important use-case does not exist.
>

Here's a tuple case for which the answer may be "baseline RVV will not
support that": 24-bit audio data. Each sample can be considered a
3-tuple of U8 elements. Or is there some way to load 24-bit integers as
U32 elements? Perhaps stride=3/4 load followed by masking the top
8-bits? Will RVV directly support misaligned element accesses?

>> The
>> strip-mine loop is three stride=3 loads whose bases are offset by one byte
>> each. Then two vector adds, a vector multiply, and a unit-stride vector
>> store.
>>
>> The "3" doesn't manifest in the vector length.
>>
>
> i'm trying to think of examples that _don't_ fall into that category,
> and having a hard time doing so. from long experience from trusting
> that "nagging intuition at the back of my mind that just won't go
> away" i *know* that just because i can't think *right now* of one,
> that there isn't a really, really important use-case.
>
> annoyingly - again from experience - it can take potentially days,
> weeks and sometimes months for me to go "ah damnit, *that's* the
> use-case".
>

Realistically, we probably have months before RVV will be anywhere close
to even 1.0. RVV is large and complex; even with Hwacha as a "cheat
sheet" development will not be quick.

-- Jacob

Reply all

Reply to author

Forward