[rfc] SimpleV vectorisation / parallelism hardware abstraction API for RISC-V [1]

895 views
Skip to first unread message

lk...@lkcl.net

unread,
Nov 5, 2018, 6:54:59 AM11/5/18
to
over the past several months i've been developing an abstraction layer
for vectorisation and parallelisation of RISC-V. i've never encountered
anything remotely similar in any commercial processor design, hence
why it could seriously benefit from a review.

this is not a hypothetical design concept: it's *going* to be implemented,
for use as the core basis of a Libre GPGPU (general-purpose 3D Graphics
Processor), with a secondary requirement to have sufficient legs to
act as a VPU (Video Processing Unit) as well.

the core of the concept is that standard scalar registers are "tagged"
with CSRs (Control Status Registers) indicating that use of that
register shall result in not one *single* (scalar) operation but
*multiple contiguous* scalar operations being issued to the ALU.

it is then up to the micro-architectural implementor to decide whether
those multiple operations shall be executed sequentially, or in
parallel, or (as is done with the Broadcom VideoCore IV), a hybrid
mixture of both (Broadcom call this "Virtual Parallelism").

effectively SV is a hardware version of c/c++ macro loop-expansion.

SV borrows and builds on RVV (a proposed vectorisation API for RISC-V),
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc

RVV in turn builds on the Cray Architecture. the primary benefit
of variable-length vectorisation is outlined here:
https://www.sigarch.org/simd-instructions-considered-harmful/

the draft specification for SV is here:
https://libre-riscv.org/simple_v_extension/specification/


if anyone finds a c++ source code referenceimplementation easier
to read than "just words", that may be found here:
https://git.libre-riscv.org/?p=riscv-isa-sim.git;a=shortlog;h=refs/heads/sv

and associated unit tests here:
https://git.libre-riscv.org/?p=riscv-tests.git;a=shortlog;h=refs/heads/sv

and preliminary build instructions for the simulator and unit tests
found here:
https://libre-riscv.org/3d_gpu/spike_sv/

much appreciated.

l.

MitchAlsup

unread,
Nov 5, 2018, 11:16:32 AM11/5/18
to
On Monday, November 5, 2018 at 5:54:59 AM UTC-6, lk...@lkcl.net wrote:
> over the past several months i've been developing an abstraction layer
> for vectorisation and parallelisation of RISC-V. i've never encountered
> anything remotely similar in any commercial processor design, hence
> why it could seriously benefit from a review.

Over the last year I have been working on a Virtual Vector extension to
My 66000 ISA. In this extension the compiler is given the illusion of
a register to register CRAY-like vector instructions set, but there are
only 2 instructions added to the ISA in support of vectors. In particular;
a) there are no vector registers (saving massive real estate)
b) no vector length register (nor a conceptual length of the vector)
c) no vector control registers (no vector state, full vector perf)
d) no vector condition codes (scalar predication covers this)

Consider for a moment DAXPY::

void daxpy(size_t n, double a, const double x[], double y[])
{
for (size_t i = 0; i < n; i++) {
y[i] = a*x[i] + y[i];
}
}

in My 66000 ISA this could be expressed as:

daxpy:
MOV Ri,0
CMP Rt,Ri,Rn
BGE exit
top:
LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]
FMAC Ry,Ra,Rx,Ry
STD Ry,[Ryb+Ri<<3]
ADD Ri,Ri,1
CMP Rt,Ri,Rn
BLT Rt,top
exit:
JMP R0

Note: This takes 2 fewer instructions that RV both overall and in the loop.
In order to convert this function into a vector function, one needs two
pieces of decoration::

daxpy:
MOV Ri,0
CMP Rt,Ri,Rn
BGE exit
top: VEC {SIV,SIV,VVV,SIV,ICI} // decorate the registers
LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]
FMAC Ry,Ra,Rx,Ry
STD Ry,[Ryb+Ri<<3]
LOOP Ri,1,Rn,LT // compress loop overhead
exit:
JMP R0

The decoration at the top of the loop annotates which registers in the
following instructions are Scalars (S), Vectors (V), and Loop-Indexes (I).

The mechanism of execution is as follows:: The VEC instruction casts a
shadow over the instructions in the vectorized loop. Scalar registers
only have to be fetched from the register file ONCE, Vector registers
will be captured Once per loop, and Writes to the Index register cause
another LOOP to be initiated in <wherever the instructions are being
buffered::but most likely something like reservation stations.>

The LOOP instruction does not need a branch target, that was already
supplied by the VEC instruction; all it needs is the Loop index register,
an increment, a comparison, and a termination.

Since the My 66000 ISA already has predication, predication can be used
for small scale flow control inside vector loops. The reservation stations
know their predicate (if any::it gets installed in the first pass over
the loop) and fires/holds based on the predicate this loop.

Architecturally, there can be as many lanes as the designers want to
build (NEC SX5+) with the caution that the memory system needs to scale
proportionate to the calculation system.

While executing, vector values flow through the forwarding paths and through
calculation units, not bothering to be written into the register file.

If an enabled exception is raised, the vector calculations preceding the
exception are allowed to complete, and vector data is stored in the
scalar registers, exception control transfer is performed, and the
exception handled as if there were no vectorization at all.

In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
are quiet, waiting for the vector to complete (and getting FETCH out of the
way of vectors in memory.)

The memory system performs store-to-load forwarding analysis so that loops
such as:

void looprecurance(size_t n, double a, const double x[])
{
for (size_t i = 0; i < n; i++) {
x[i] = a*x[i-10] + x[i-7];
}
}

remain vectorized; one calculation simply becomes dependent on the 7th and 10th
previous calculations through a memory system. No special analysis is required
in the compiler to vectorize this loop, nor is any special detection necessary
to avoid vectorizing this loop. The basic tenet is that if the scalr code can
be written in a simple loop that one can change the top and bottom of the
loop and presto-it is vectorized.

All of the advantages of a CRAY-like vector ISA, virtually no change to the
actual ISA (2 instructions), and precise exceptions too.

Terje Mathisen

unread,
Nov 5, 2018, 11:46:27 AM11/5/18
to
MitchAlsup wrote:
> The memory system performs store-to-load forwarding analysis so that
> loops such as:
>
> void looprecurance(size_t n, double a, const double x[]) { for
> (size_t i = 0; i < n; i++) { x[i] = a*x[i-10] + x[i-7]; } }
>
> remain vectorized; one calculation simply becomes dependent on the
> 7th and 10th previous calculations through a memory system. No
> special analysis is required in the compiler to vectorize this loop,
> nor is any special detection necessary to avoid vectorizing this
> loop. The basic tenet is that if the scalr code can be written in a
> simple loop that one can change the top and bottom of the loop and
> presto-it is vectorized.
>
> All of the advantages of a CRAY-like vector ISA, virtually no change
> to the actual ISA (2 instructions), and precise exceptions too.

Mitch, you have written more than enough about this feature to wet my
appetite:

How much hardware do you actually need in order to run this at
comparable speed to some Intel-specific 4-wide AVX loop code?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,
Nov 5, 2018, 12:27:45 PM11/5/18
to
Well, consider that the goal was to run stuff like daxpy at 6 IPC or
2 flops per cycle per lane, the answer to your question is "not much".

So, let us postulate a set of calculation units that can perform 128-bit
or 256-bit wide AVX codes and a cache system that can supply and absorb
these sized accesses.

In virtual vector, the more natural means would be to organize 4 lanes
of calculation, each lane 64-bits wide, each lane processing 1 loop in
coordination with the other lanes.

You would need
a) 2 loads per cycle per lane
b) 1 FMAC per cycle per lane
c) 1 store per cycle per lane
d) 1 LOOP per cycle across lanes

Since all of the loads and stores are present at the datum level, they do
not necessarily need the 4 datums to reside side by side in memory, nor
do they need 128-bit wide or 256-bit wide registers. A long time ago I took
a look at radix 2 FFT and whether I could vectorize it::

SUBROUTINE FFT2( A, N )
COMPLEX DOUBLE PRECISION A[M]
DOUBLE PRECISION RE[M] = &REAL(A[0])
DOUBLE PRECISION IM[M] = &IMAG(A[0])
N2 = N
DO 10 K = 1, M
DOUBLE PRECISION RO = &RE[N2]
DOUBLE PRECISION IN = &IM[N2]
N1 = N2
N2 = N2/2
E = 6.28318/N1
A = 0
DO 20 J = 1, N2
C = COS (A)
S =-SIN (A)
A = J*E
DO 30 I = J, N, N1
XT = RE(I) - RO(I)
YT = IM(I) - IN(I)
RE(I) = RE(I) + RO(I)
IM(I) = IM(I) + IN(I)
RO(I) = XT*C - YT*S
IN(I) = XT*S + YT*C
30 CONTINUE
20 CONTINUE
10 CONTINUE

and got something like this:

FFT2:
// MOV RRE,RA // RA not damaged during routine
ADD RIM,RA,8
MOV RN2,RN
MOV RK,1
loop10:
LDA RRO,[RRE+RN2<<3]
LDA RIN,[RIM+RN2<<3]
MOV RN1,RN
SRA RN2,RN,1
CVT (double)RN1F,(int)RN1
FDIV RE,6.283185307179586476925286766559,RN1F
MOV RA,0
MOV RJ,1
loop20:
COS RC,RA
SIN -RS,RA
CVT (double)RJF,(int)RJ
FMUL RA,RJF,RE
MOV RI,1
loop30:
VEC {VSI*4,VVV*12,VSI*4,ISS}
LDD RXI,[RRE+RI<<3]
LDD RXL,[RRO+RI<<3]
LDD RYI,[RIM+RI<<3]
LDD RYL,[RIN+RI<<3]
FADD RXT,RXI,-RXL
FADD RYT,RYI,-RYL
FADD RXI,RXI,RXL
FADD RYI,RYI,RYL
FMUL RXC,RXT,RC
FMUL RXS,RYT,RS
FMUL RYS,RXT,RS
FMUL RYC,RYT,RC
FADD RXL,RXC,-RXS
FADD RYL,RYS,RYC
STD RXI,[RRE+RI<<3]
STD RYI,[RIM+RI<<3]
STD RXL,[RRO+RI<<3]
STD RYL,[RIN+RI<<3]
LOOP RI,RN1,RN,LE

ADD RJ,RJ,1
CMP Rc,RJ,RN2
BLE Rc,loop20
ADD RK,RK,1
CMP Rc,RK,RM
BLE loop10
JMP R0

For full perf on this one, one would need 6 lanes of FADD, 4 lanes of FMUL,
4 lanes of LDD, and 4 lanes of STD. If one had that set of resources, one
would be performing a memory to memory radix 2 FFT butterfly every cycle.

{Note: some consider it numerically dangerous to fuse an FMAC out of FFT
FMULs and FADDs; so I didn't. It did not fit the general calculation
sequence either.}

lk...@lkcl.net

unread,
Nov 5, 2018, 1:03:12 PM11/5/18
to
On Monday, November 5, 2018 at 4:16:32 PM UTC, MitchAlsup wrote:

> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)

yes. there is a huge disadvantage of traditional vector ISAs:
massive duplication and specialisation of the opcodes.

in RVV they have had to add approximately a 25-30% duplication
of the xBitManip extension for the sole purpose of managing predicate
bits... *without* that hardware being actually available for the
purposes of general-purpose bit manipulation.

a combination of SV and xBitManip extensions provides near 100%
complete coverage of RVV's vectorisation functionality.

there is one perceived advantage: the vector pipeline may be entirely
separated from the scalar pipeline. i would perceive that as a
*dis*advantage, on the basis that it results in huge duplication of
silicon, particularly arithmetic pipelines.

> b) no vector length register (nor a conceptual length of the vector)

that's interesting.

> c) no vector control registers (no vector state, full vector perf)

some "setup" is required however. the state information needs to
be set up one way or the other. CSRs or instructions, it's one
or the other.

> d) no vector condition codes (scalar predication covers this)

yes it does. RISC-V scalar operations do not have the concept
of predication. therefore, SV has an additional CSR "predication"
table that allows the *use* of a given (ordinarily scalar) register
to indicate "hey this operation is now predicated, because you
used register x5, and the CSR lookup table says that register x3's
bits are to be used as the predicate".

in this way there was absolutely no need to modify the RISC-V
scalar ISA *at all*.


> Consider for a moment DAXPY::

we like DAXPY :)

> Note: This takes 2 fewer instructions that RV both overall and in the loop.
> In order to convert this function into a vector function, one needs two
> pieces of decoration::
>
> daxpy:
> MOV Ri,0
> CMP Rt,Ri,Rn
> BGE exit
> top: VEC {SIV,SIV,VVV,SIV,ICI} // decorate the registers
> LDD Rx,[Rxb+Ri<<3]
> LDD Ry,[Ryb+Ri<<3]
> FMAC Ry,Ra,Rx,Ry
> STD Ry,[Ryb+Ri<<3]
> LOOP Ri,1,Rn,LT // compress loop overhead
> exit:
> JMP R0
>
> The decoration at the top of the loop annotates which registers in the
> following instructions are Scalars (S), Vectors (V), and Loop-Indexes (I).

ok, so yes: the "mark as vectorised" is the basic core principle of
SV. in SV User-Mode there are 16 available CSR entries which may be used to
mark up to 16 registers (FP or INT) as "vectorised".

what i have *not* done is add "loop indices". there is a very simple
reason: SV is *not* about adding new instructions, it's about
expressing *parallelism*.

so for loop indices to be added, a *separate* extension would be needed
that... added looping.

in DSP terminology this is called zero-overhead looping, and at its
most effective (TI DSPs for example) a full FFT may be achieved
*in a single instruction* with 100% utilisation, thanks to a zero
overhead loop control mechanism.

i therefore looked up and contacted someone who created something
called "ZOLC", and obtained the verilog source code. if anyone
is interested i can provide a copy on request, or put you in touch
with the author.

the author of ZOLC provided an amazing example of implementing
MPEG decode, and coming up with a whopping 38% decrease in
completion time... *with no branch prediction unit*.


> The mechanism of execution is as follows:: The VEC instruction casts a
> shadow over the instructions in the vectorized loop. Scalar registers
> only have to be fetched from the register file ONCE, Vector registers
> will be captured Once per loop,

interesting that you've noted the loop invariance of scalar registers.

where are the actual vector registers? *are* there any actual vector
registers? in SV, the scalar register file is "remapped" to
"vectorised" by being treated as if it was a contiguous memory block,
and the "scalar" register is a pointer to the starting element.

in this way, an ADD r10 r20 r30 with a VL of 3 actually becomes:
ADD r10 r20 r30
ADD r11 r21 r31
ADD r12 r22 r33

assuming that all of r10, r20 and r30 have been marked as "vectorised",
that is.


> and Writes to the Index register cause
> another LOOP to be initiated in <wherever the instructions are being
> buffered::but most likely something like reservation stations.>
>
> The LOOP instruction does not need a branch target, that was already
> supplied by the VEC instruction; all it needs is the Loop index register,
> an increment, a comparison, and a termination.

this is classic DSP-style zero-overhead looping. it's pretty effective,
and completely does away with the need for branch prediction. in
the ZOLC implementation it can even be nested, and there can be
multi-variable condition codes causing jumps *between nesting levels*...
all producing 100% full pipelines, no stalls, *no branch prediction at all*.

> Architecturally, there can be as many lanes as the designers want to
> build (NEC SX5+) with the caution that the memory system needs to scale
> proportionate to the calculation system.

absolutely. it's been well-known for many many decades that the
bottleneck is never the computational aspect of vector processors:
it's the memory bandwidth.

> If an enabled exception is raised, the vector calculations preceding the
> exception are allowed to complete, and vector data is stored in the
> scalar registers, exception control transfer is performed, and the
> exception handled as if there were no vectorization at all.

in SV, one of the design criteria was that it had to be realistically
achievable for a single computer science student or a libre hardware
engineer to implement in a reasonably short time-frame.

when i first encountered RVV, the RISC-V Vectorisation proposal, i
was very surprised to learn of the design criteria that, firstly,
elements and the operations on them must be treated as being executed
sequentially, and secondly, that if an exception occurred and there
were elements further along (higher indices), the perfectly-good
results had to be THROWN OUT.

six months later i began to understand the value of this design decision.

the issue is that exceptions, such as virtual memory page table
misses, have to be re-entrant. anything more complex than the above
design criteria is just too complex to handle.

so the state that's stored in SV (and i believe RVV as well) has to
include the loop indices. in SV there are actually two indices (due
to twin-predication on certain operations).

there's seperate state for M-Mode, S-Mode (supervisor mode) and U-Mode,
allowing each to continue to use their own separate vectorisation
to manage lower-privilege execution levels.


> In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
> are quiet, waiting for the vector to complete (and getting FETCH out of the
> way of vectors in memory.)

interesting. what happens if the loop is say 2,000 to 4,000 instructions
in length?


> The memory system performs store-to-load forwarding analysis so that loops
> such as:
>
> void looprecurance(size_t n, double a, const double x[])
> {
> for (size_t i = 0; i < n; i++) {
> x[i] = a*x[i-10] + x[i-7];
> }
> }
>
> remain vectorized; one calculation simply becomes dependent on
> the 7th and 10th previous calculations through a memory system.

this is accidentally provided in SV as well :)

the CSR "register tagging" system actually allows redirection
of the registers as well. so, at the assembly level the
instruction says "ADD x3 x5 x7" however register x3 is redirected
to *actually* point at x10, x5 is redirected to point at x12,
and so on.

in this way, it should be clear that if the loop length is
say 10, then due to the hardware-macro-loop unrolling that ADD to
multiple contiguous registers, they're *going* to overlap:

ADD x3 x5 x7, VL=10, x3 redirects to x10, x5 redirects to x12:

becomes:

ADD x10 x12 x7
ADD x11 x13 x8
ADD x12 x14 x9 <- overlap already begins
ADD x13 x15 x10
...



> No special analysis is required
> in the compiler to vectorize this loop, nor is any special
> detection necessary to avoid vectorizing this loop. The basic
> tenet is that if the scalr code can
> be written in a simple loop that one can change the top and bottom
> of the loop and presto-it is vectorized.

exactly. it's pretty elegant, isn't it?

> All of the advantages of a CRAY-like vector ISA, virtually no change to the
> actual ISA (2 instructions), and precise exceptions too.

one of the things that drives SV is the need to minimise the
amount of modifications and maintenance to the existing
RISCV toolchain (including LLVM and various back-ends such
as clang, rustlang, and so on).

by not adding any new instructions *at all* i can write unit tests
purely in *standard* RISC-V assembler. the *behaviour* is different...
and that's absolutely fine.

in the target primary application we can go straight ahead with
writing a 3D GPU driver (Kazan3D, a Vulkan LLVM software renderer),
use the *existing* scalar RISCV64 compilers, and just put in
hand-coded assembler..

...or, better yet, cheat a little by writing scalar functions,
and, before calling them (and on exit) have a bit of inline assembler
that sets up and tears down the required vectorisation :)

l.

lk...@lkcl.net

unread,
Nov 5, 2018, 1:33:40 PM11/5/18
to
On Monday, November 5, 2018 at 5:27:45 PM UTC, MitchAlsup wrote:
> On Monday, November 5, 2018 at 10:46:27 AM UTC-6, Terje Mathisen wrote:

> > Mitch, you have written more than enough about this feature to wet my
> > appetite:
> >
> > How much hardware do you actually need in order to run this at
> > comparable speed to some Intel-specific 4-wide AVX loop code?

terje: do take a look at that "simd considered harmful" article
in the OP. it's well worth reading.

> DO 30 I = J, N, N1
> XT = RE(I) - RO(I)
> YT = IM(I) - IN(I)
> RE(I) = RE(I) + RO(I)
> IM(I) = IM(I) + IN(I)
> RO(I) = XT*C - YT*S
> IN(I) = XT*S + YT*C
> 30 CONTINUE

just looking at this, i believe that the regular nature
of the real/complex imaginary numbers here could benefit
from SV's "REMAP" functionality.

one of the requirements of the Khronos Group Vulkan API
is that matrices be a top level primitive. unfortunately,
the API hands you an array of data structures and whilst
you _could_ drop the data down to memory and back to
restructure it, we cannot do so for this particular
implementation as the power penalty is too great.

so to cater for this i added a 1D/2D/3D "remap"
system, which takes the loop index and "reshapes" it.

i *believe* it may be possible to apply a simple
re-shaping to the complex numbers, here, to treat
them as a 2D array. where one would normally see
array indices of:

0 1 2 3 4 5

instead the "reshaper" may map these to:

0 3 1 4 2 5

thus allowing, i believe, the real and imaginary parts to be
placed into the same sequential array, yet a seemingly
"sequential" set of FMUL and FADDs does the right thing.

this would, potentially and hyothetically, if i could get my
head round the algorithm properly, halve the total number
of instructions used in the loop. not quite, due to
the last two, one being an "add" and the other being
a "subtract".


> and got something like this:

> loop10:
> loop20:
> loop30:
> VEC {VSI*4,VVV*12,VSI*4,ISS}
> LOOP RI,RN1,RN,LE
> BLE Rc,loop20
> BLE loop10

so, like traditional DSP-style zero-overhead loops,
there's only one level of hardware-looping, here.

this is why i like ZOLC, because in ZOLC *all three*
loops may be expressed, to give full 100% ALU
utilisation with zero branch prediction needed.

> For full perf on this one, one would need 6 lanes of FADD, 4 lanes of FMUL,
> 4 lanes of LDD, and 4 lanes of STD. If one had that set of resources, one
> would be performing a memory to memory radix 2 FFT butterfly every cycle.

that sounds about on a par with the TI DSP series: it's a VLIW-like
architecture, where the inner loop fits into a single (14-word-long)
instruction. consequently they provide the same memory and
computational resources and achieve zero stalling.

l.

Terje Mathisen

unread,
Nov 5, 2018, 1:33:46 PM11/5/18
to
Nice!

> SUBROUTINE FFT2( A, N )
> COMPLEX DOUBLE PRECISION A[M]
> DOUBLE PRECISION RE[M] = &REAL(A[0])
> DOUBLE PRECISION IM[M] = &IMAG(A[0])
> N2 = N
> DO 10 K = 1, M
> DOUBLE PRECISION RO = &RE[N2]
> DOUBLE PRECISION IN = &IM[N2]
> N1 = N2
> N2 = N2/2
> E = 6.28318/N1
> A = 0
> DO 20 J = 1, N2
> C = COS (A)
> S =-SIN (A)

Not even using SIN/COS tables? Wow! :-)
Here it looks like you need at least 4-way associativity in order to
make sure that the four source/target vectors don't trash each other
even if the vector offsets are a plural of the page size, but as long as
you have that you should be OK, right?
Those requirements don't seem too bad. :-)

MitchAlsup

unread,
Nov 5, 2018, 1:37:21 PM11/5/18
to
On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> On Monday, November 5, 2018 at 4:16:32 PM UTC, MitchAlsup wrote:
>
> > Over the last year I have been working on a Virtual Vector extension to
> > My 66000 ISA. In this extension the compiler is given the illusion of
> > a register to register CRAY-like vector instructions set, but there are
> > only 2 instructions added to the ISA in support of vectors. In particular;
> > a) there are no vector registers (saving massive real estate)
>
> yes. there is a huge disadvantage of traditional vector ISAs:
> massive duplication and specialisation of the opcodes.
>
> in RVV they have had to add approximately a 25-30% duplication
> of the xBitManip extension for the sole purpose of managing predicate
> bits... *without* that hardware being actually available for the
> purposes of general-purpose bit manipulation.
>
> a combination of SV and xBitManip extensions provides near 100%
> complete coverage of RVV's vectorisation functionality.
>
> there is one perceived advantage: the vector pipeline may be entirely
> separated from the scalar pipeline. i would perceive that as a
> *dis*advantage, on the basis that it results in huge duplication of
> silicon, particularly arithmetic pipelines.

The Vector machine can be "attached at the hip" in a My 66000 implementation
or it can be microcoded to use the same calculation and agen units the
scalar machine uses. The SW model is invariant.
>
> > b) no vector length register (nor a conceptual length of the vector)
>
> that's interesting.
>
> > c) no vector control registers (no vector state, full vector perf)
>
> some "setup" is required however. the state information needs to
> be set up one way or the other. CSRs or instructions, it's one
> or the other.

The VEC instructions casts that information over the loop.
>
> > d) no vector condition codes (scalar predication covers this)
>
> yes it does. RISC-V scalar operations do not have the concept
> of predication.

Bad Move, By the way.

> therefore, SV has an additional CSR "predication"
> table that allows the *use* of a given (ordinarily scalar) register
> to indicate "hey this operation is now predicated, because you
> used register x5, and the CSR lookup table says that register x3's
> bits are to be used as the predicate".
>
> in this way there was absolutely no need to modify the RISC-V
> scalar ISA *at all*.

But if the RV scalar code had predication, there would have been no need
to do this and only make it available for the vector stuff. Thus by not
adding instructions, you add instructions--become less expressive in the
process.
Marking the loop indexes enables the reservation stations to increment
their counters and wait for the next operand to arrive. That is, when
issuing instructions into the RS, those entries with I watch for the
execution of the LOOP instruction, and when it fires, they increment
and wait for the operands of the i^th loop to arrive.
>
> so for loop indices to be added, a *separate* extension would be needed
> that... added looping.
>
> in DSP terminology this is called zero-overhead looping, and at its
> most effective (TI DSPs for example) a full FFT may be achieved
> *in a single instruction* with 100% utilisation, thanks to a zero
> overhead loop control mechanism.
>
> i therefore looked up and contacted someone who created something
> called "ZOLC", and obtained the verilog source code. if anyone
> is interested i can provide a copy on request, or put you in touch
> with the author.

Could you send me a copy?
>
> the author of ZOLC provided an amazing example of implementing
> MPEG decode, and coming up with a whopping 38% decrease in
> completion time... *with no branch prediction unit*.
>
>
> > The mechanism of execution is as follows:: The VEC instruction casts a
> > shadow over the instructions in the vectorized loop. Scalar registers
> > only have to be fetched from the register file ONCE, Vector registers
> > will be captured Once per loop,
>
> interesting that you've noted the loop invariance of scalar registers.

Power savings and you don't have to make the reservation stations count.
>
> where are the actual vector registers? *are* there any actual vector
> registers? in SV, the scalar register file is "remapped" to
> "vectorised" by being treated as if it was a contiguous memory block,
> and the "scalar" register is a pointer to the starting element.

The vector registers are, in effect, in the reservation station operand
containers, reutilized as calculations get done.
>
> in this way, an ADD r10 r20 r30 with a VL of 3 actually becomes:
> ADD r10 r20 r30
> ADD r11 r21 r31
> ADD r12 r22 r33
>
> assuming that all of r10, r20 and r30 have been marked as "vectorised",
> that is.

Same principle, R7[k], except k is held in the stations and has no
multiported file containing the bits.
>
>
> > and Writes to the Index register cause
> > another LOOP to be initiated in <wherever the instructions are being
> > buffered::but most likely something like reservation stations.>
> >
> > The LOOP instruction does not need a branch target, that was already
> > supplied by the VEC instruction; all it needs is the Loop index register,
> > an increment, a comparison, and a termination.
>
> this is classic DSP-style zero-overhead looping. it's pretty effective,
> and completely does away with the need for branch prediction. in
> the ZOLC implementation it can even be nested, and there can be
> multi-variable condition codes causing jumps *between nesting levels*...
> all producing 100% full pipelines, no stalls, *no branch prediction at all*.

It also gets rid of overcommit of the loop--that is issuing the instructions
in loop a few more times than the loop is run, so there is less to backup when
the branch (n.e., LOOP) mispredicts--saving power, again.
>
> > Architecturally, there can be as many lanes as the designers want to
> > build (NEC SX5+) with the caution that the memory system needs to scale
> > proportionate to the calculation system.
>
> absolutely. it's been well-known for many many decades that the
> bottleneck is never the computational aspect of vector processors:
> it's the memory bandwidth.
>
> > If an enabled exception is raised, the vector calculations preceding the
> > exception are allowed to complete, and vector data is stored in the
> > scalar registers, exception control transfer is performed, and the
> > exception handled as if there were no vectorization at all.
>
> in SV, one of the design criteria was that it had to be realistically
> achievable for a single computer science student or a libre hardware
> engineer to implement in a reasonably short time-frame.
>
> when i first encountered RVV, the RISC-V Vectorisation proposal, i
> was very surprised to learn of the design criteria that, firstly,
> elements and the operations on them must be treated as being executed
> sequentially, and secondly, that if an exception occurred and there
> were elements further along (higher indices), the perfectly-good
> results had to be THROWN OUT.

Just the breaks of the game.
>
> six months later i began to understand the value of this design decision.
>
> the issue is that exceptions, such as virtual memory page table
> misses, have to be re-entrant. anything more complex than the above
> design criteria is just too complex to handle.
>
> so the state that's stored in SV (and i believe RVV as well) has to
> include the loop indices. in SV there are actually two indices (due
> to twin-predication on certain operations).
>
> there's seperate state for M-Mode, S-Mode (supervisor mode) and U-Mode,
> allowing each to continue to use their own separate vectorisation
> to manage lower-privilege execution levels.

I should note there is no privilege in My 66000 architecture. No privileged
instructions, no privileged state, ... none.
>
>
> > In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
> > are quiet, waiting for the vector to complete (and getting FETCH out of the
> > way of vectors in memory.)
>
> interesting. what happens if the loop is say 2,000 to 4,000 instructions
> in length?

In the absence of exceptions or interrupts, the loop runs to completion and the
FETCH/DECODE/ISSUE units restart when the LOOP is done.
>
>
> > The memory system performs store-to-load forwarding analysis so that loops
> > such as:
> >
> > void looprecurance(size_t n, double a, const double x[])
> > {
> > for (size_t i = 0; i < n; i++) {
> > x[i] = a*x[i-10] + x[i-7];
> > }
> > }
> >
> > remain vectorized; one calculation simply becomes dependent on
> > the 7th and 10th previous calculations through a memory system.
>
> this is accidentally provided in SV as well :)

It better NOT be an accident!
>
> the CSR "register tagging" system actually allows redirection
> of the registers as well. so, at the assembly level the
> instruction says "ADD x3 x5 x7" however register x3 is redirected
> to *actually* point at x10, x5 is redirected to point at x12,
> and so on.
>
> in this way, it should be clear that if the loop length is
> say 10, then due to the hardware-macro-loop unrolling that ADD to
> multiple contiguous registers, they're *going* to overlap:
>
> ADD x3 x5 x7, VL=10, x3 redirects to x10, x5 redirects to x12:
>
> becomes:
>
> ADD x10 x12 x7
> ADD x11 x13 x8
> ADD x12 x14 x9 <- overlap already begins
> ADD x13 x15 x10
> ...

I just made the stations do this for me.
>
>
>
> > No special analysis is required
> > in the compiler to vectorize this loop, nor is any special
> > detection necessary to avoid vectorizing this loop. The basic
> > tenet is that if the scalr code can
> > be written in a simple loop that one can change the top and bottom
> > of the loop and presto-it is vectorized.
>
> exactly. it's pretty elegant, isn't it?
>
> > All of the advantages of a CRAY-like vector ISA, virtually no change to the
> > actual ISA (2 instructions), and precise exceptions too.
>
> one of the things that drives SV is the need to minimise the
> amount of modifications and maintenance to the existing
> RISCV toolchain (including LLVM and various back-ends such
> as clang, rustlang, and so on).

Not my problem. But I think RV has lost a few brownie point for not getting
the base ISA up to industrial quality levels first. Then you would not be
having the issues you are having adding vectors, predication,...
>
> by not adding any new instructions *at all* i can write unit tests
> purely in *standard* RISC-V assembler. the *behaviour* is different...
> and that's absolutely fine.

Note: My 66000 vector addition was exactly 2 instructions and ZERO thread
state (ZERO!) Thus the context switch part of the OS does not care if
vectors were used or not--vectors are INVISIBLE to the OS parts of the
hardware.

{Also note: My66000 performs context switch in hardware. A thread is
enabled (by the OS) and a processor picks it up by fetching 5 cache lines.
By the time these lines arrive, the register state, and the thread header
state have arrived and been installed. So we have a IP, Root pointer,...
already loaded ready to perform the instruction fetched at memory[IP++].
So too when the thread is disabled, the HW shoves the modified state back
whence it came. So the HW actually does the context switching...and in less
time than SW could conceive of ding it.}

MitchAlsup

unread,
Nov 5, 2018, 1:39:24 PM11/5/18
to
Except that it is a straight vector implementation.

MitchAlsup

unread,
Nov 5, 2018, 1:41:57 PM11/5/18
to
At 19 cycles for SIN or COS these in HW MAY be faster than in memory (L2 hit)
No, I can use the L1 miss buffers directly for temporary vector storage
and avoid thrashing the L1 caches when processing vectors.

lk...@lkcl.net

unread,
Nov 5, 2018, 1:54:52 PM11/5/18
to
On Monday, November 5, 2018 at 6:39:24 PM UTC, MitchAlsup wrote:

> > that sounds about on a par with the TI DSP series: it's a VLIW-like
> > architecture, where the inner loop fits into a single (14-word-long)
> > instruction. consequently they provide the same memory and
> > computational resources and achieve zero stalling.
> >
> > l.
>
> Except that it is a straight vector implementation.

yeah, which is an aspect that i like about vectorisation.
i haven't used TI's DSPs since 1992: at the time, the
compilers were pretty poor. to get full utilisation
they had to be programmed in hand-coded assembler.
maybe their proprietary compiler toolchain has improved,
it makes no odds: it's such a specialist embedded engine.
and also, TI's top DSPs are classified under US BXPA as a
"weapons-grade munition" *sigh*...

l.

MitchAlsup

unread,
Nov 5, 2018, 3:36:53 PM11/5/18
to
On Monday, November 5, 2018 at 10:16:32 AM UTC-6, MitchAlsup wrote:
> On Monday, November 5, 2018 at 5:54:59 AM UTC-6, lk...@lkcl.net wrote:
> > over the past several months i've been developing an abstraction layer
> > for vectorisation and parallelisation of RISC-V. i've never encountered
> > anything remotely similar in any commercial processor design, hence
> > why it could seriously benefit from a review.
>
> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)

Consider the CRAY-1 with 8*64-entry-64-bit vector registers (4096 bytes
of 5 ported storage). The vector register file is larger than the area
of all the function units connected to the 512 64-bit registers in the
VRF.

Now consider a machine with reservation stations in front of each function
unit. As long as the RS storage is deeper than the pipeline depth of the
function unit, one can use the RS operand capture mechanism to stand in
the place of the vector registers and associated forwarding/chaining.

Getting rid of the VFR was simply an enabling device to slip the vector
extension into My 66000, and once thought through, it works perfectly
and seamlessly.

> b) no vector length register (nor a conceptual length of the vector)
> c) no vector control registers (no vector state, full vector perf)
> d) no vector condition codes (scalar predication covers this)

Getting rid of this stuff causes one to design the vector facility as if
one were building a LOOP around scalar code--which one actually is.

void conditional(size_t n, double a, const double x[], double y[])
{
for (size_t i = 0; i < n; i++) {
if( x[i] > 0.0 )
y[i] = a*x[i] + y[i];
else
y[i] = x[i] + a*y[i];
}
}

could be compiled as:

conditional:
MOV Ri,0
CMP Rt,Ri,Rn
BGE Rt,exit
top: VEC {VSI*2,V,VSVV,VSI,ISS}
LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]
PGT Rx,TF // predicate on condition
FMAC Ry,Ra,Rx,Ry // executed if Rx > 0
FMAC Ry,Rx,Ry,Rx // executed otherwise (inc. NaNs)
STD Ry,[Ryb+Ri<<3]
LOOP Ri,1,Rn
exit:
JMP R0

Note: VERY IMPORTANT! Individual instructions do NOT waste bits supporting
predication! These bits are provided by predicate instructions (which are
degenerate branches in effect).

The PREDICATE instructions casts predicates over the subsequent
instructions--in this case the first instruction is in the THEN
clause (T) while the next instruction is in the ELSE (F) clause.
One or the other is issued, the instructions buffering mechanism
(reservation stations) can figure out whether to fire or not
eliminating the need to nullify a calculation that is not needed.

In the My 66000 ISA predicates can be combined to perform && an ||
branches as well as nested if-the-else stuff.

Bruce Hoult

unread,
Nov 5, 2018, 7:14:21 PM11/5/18
to
On Monday, November 5, 2018 at 10:37:21 AM UTC-8, MitchAlsup wrote:
> On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> > one of the things that drives SV is the need to minimise the
> > amount of modifications and maintenance to the existing
> > RISCV toolchain (including LLVM and various back-ends such
> > as clang, rustlang, and so on).
>
> Not my problem. But I think RV has lost a few brownie point for not getting
> the base ISA up to industrial quality levels first. Then you would not be
> having the issues you are having adding vectors, predication,...

It's the classic bootstrapping problem with limited resources, isn't it?

What is the Minimum Viable Product? Is it better to ship first hardware in 2016 and hardware with vector support in 2019, or is it better to ship a finished (ha ha) thing in 2019?

How will you support yourself financially for those extra three years? How will you get buzz going so that well before the hardware with vector support ships you're already a 1st class citizen in binutils, gcc, glibc, newlib, gdb, llvm (March 2019 release), already have multiple Linux distributions with most packages ported, and BSDs, and several RTOSes.

I tend to the view that it's better to ship and generate buzz as soon as you have a useful product. SiFive has *plenty* of customers who want the smallest and lowest power 32 bit chip possible with just the basic integer instructions (not even multiply/divide), two stage pipeline, no cache, no branch prediction. And there are a lot of customers chomping at the bit for the 64 bit dual-issue in order processor (competitive with ARM A55 and A53) announced on October 31. There's no NEON equivalent and no GPU but many applications don't need either of those.

Of course, having those later will open up a whole new range of applications and customers. But, in the meantime, *revenue* and buzz and software support.


> > by not adding any new instructions *at all* i can write unit tests
> > purely in *standard* RISC-V assembler. the *behaviour* is different...
> > and that's absolutely fine.
>
> Note: My 66000 vector addition was exactly 2 instructions and ZERO thread
> state (ZERO!) Thus the context switch part of the OS does not care if
> vectors were used or not--vectors are INVISIBLE to the OS parts of the
> hardware.

For sure it's very nifty and worth studying. I think there's a lot to be said for a pure data streaming model without huge vector registers, especially on bigger designs that have OoO resources already.

One of the design goals of the official RISC-V Vector ISA is that it is microarchitecture-agnostic. You can implement it with basically a traditional SIMD hardware approach and the software doesn't notice, and runs fine. You can implement it with a basically streaming OoO data flow approach and the software doesn't notice, and runs fine. The streaming approach doesn't *need* the SETVL instruction and mechanism, but it also doesn't hurt it -- the hardware could just report that it supports infinitely long (or at least very long) vectors without actually having backing store for that. On the other hand, a traditional SIMD micro-architecture can just always report the same vector length. It *does* have to support some way to mask off the extra lanes in the vector tail, but that's pretty trivial.

Bruce Hoult

unread,
Nov 5, 2018, 7:29:06 PM11/5/18
to
On Monday, November 5, 2018 at 12:36:53 PM UTC-8, MitchAlsup wrote:
> Note: VERY IMPORTANT! Individual instructions do NOT waste bits supporting
> predication! These bits are provided by predicate instructions (which are
> degenerate branches in effect).

I agree this is a good thing to do.

I worry that it is probably protected by patents. ARM does it with the "IT" instruction in Thumb2 (2003) which casts Then or Else predicates over up to the following four instructions. Does anyone else do it?

Also, ARM has backed away from actually using it. In the latest ARMv8 version all uses of IT are deprecated except those which control a single instruction, and that single instruction must be a 16 bit instruction.

In other words, in future they will be supporting only an IT and a following instruction that together can be treated as a pseudo-32 bit instruction.

jim.bra...@ieee.org

unread,
Nov 5, 2018, 8:20:58 PM11/5/18
to
Perhaps off topic

Would like to introduce an alternative view:
Reserve n registers in the register file as a "barrel"
Vector results go into the barrel one after another.
"n" is large enough to hold all the intermediate results for one loop iteration.

So announcing the barrel with a special instruction starts the loop
And a repeat instruction is at end of loop.

The intermediate results don't actually have to reside in the barrel, instead, say, in function unit output shift registers as in the Mill. Or the barrel is just a compiler slight-of-hand and an exception mechanism.

Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time. The problem of partially filled registers at the end of the loop can be addressed generically?

MitchAlsup

unread,
Nov 5, 2018, 8:34:53 PM11/5/18
to
In effect, this is what my virtual vector registers are, except I cast them
as operand storage associated with reservation stations. Each operand in
an RS maintains a loop counter so it can wait for more than one loop result
one per loop.

I wanted the mechanism to smell just like CRAY-style vectors for ease of
compiler support, but I did not want to pay for the VRF storage area
or power.
>
> Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time.

> The problem of partially filled registers at the end of the loop can be addressed generically?

But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
processed. The MMX model only works when the data is packed in memory, and
only when all sub-locations want to be processed similarly (with a couple of
exceptions.)

Stephen Fuld

unread,
Nov 5, 2018, 9:00:16 PM11/5/18
to
This has the advantage of providing a place to hold "partial values" for
loops that e.g. sum the elements of a vector, do a dot product,or even
compute a checksum.

But it has the significant disadvantage of losing one of the nicest
features of Mitch's design, the code is identical independent of the
number of functional units i.e., no need to specify the number of
registers to use for the "barrel", and even potentially allows for an
implementation with more functional units than registers.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

jim.bra...@ieee.org

unread,
Nov 5, 2018, 9:50:10 PM11/5/18
to
Need to plead inexperience WRT gate counts for all the different approaches.
Associating a loop count with each operand solves the alignment over multiple loops. And Mitch's design has a lower register count than CRAY-1.

When I see input reservation stations I think data flow. And wonder how it gets setup and torn down? Presumably no branches outside of the loop nor any subroutine calls within the loop?

Am thinking any of the approaches needs to be opportunistic (grab low
hanging fruit) when short operands are adjacent in memory.

IMHO it becomes a question of where to place the "crossbar" and whether it can be time multiplexed to reduce the number of write ports?

MitchAlsup

unread,
Nov 5, 2018, 10:06:15 PM11/5/18
to
It has the same register count as a scalar design, but with vector perf.

Literally, there was no increase in thread state due to having vector
capabilities, and as far as ISA goes, there are exactly 2 instructions
added to to ISA.

>
> When I see input reservation stations I think data flow. And wonder how it gets setup and torn down? Presumably no branches outside of the loop nor any subroutine calls within the loop?

Reservation stations get setup as instructions are issued on the first pass.
Each operand in each station entry is marked Scalar, Vector, or Indexed (by
the loop count).
Scalar operands are read from the RF just like scalar instructions.
Scalar operands are captured once and remain resident over multiple firings.
As vector operands arrive, the RS sees that it has all the needed operands
and fires the instruction off to the function unit. Vector operands are
use once per loop per station entry.
Index registers run the loop overhead, and each time they calculate, their
result enables the next loop for the station vector operand capture and fire.

When the LOOP is satisfied, the stations self clear as if they were scalar
entries and the loop is done.

As far as scope, I would like to be able to vectorize great big loops
such as FPPPP, but that is simply impossible; as the loop is 8 KBytes
of instructions long (2K instructions) and bigger than any conceivable
set of RS entries.

Looking smaller, I demanded of the mechanism to handle radix 2 FFT
(20 instructions plus loop overhead.) So, somewhere in the 16-64
instruction range is the scale of vectorizing this mechanism is
optimized for.

lk...@lkcl.net

unread,
Nov 5, 2018, 10:42:34 PM11/5/18
to
On Tuesday, November 6, 2018 at 3:06:15 AM UTC, MitchAlsup wrote:

> As far as scope, I would like to be able to vectorize great big loops
> such as FPPPP, but that is simply impossible; as the loop is 8 KBytes
> of instructions long (2K instructions) and bigger than any conceivable
> set of RS entries.

i love the idea of vectorisation being opaque: i am nervous about
the impact the idea has on both microarchitectures (restricting
implementors to not being able to use certain microarchitectures
as the base) and on its understandability.

> Looking smaller, I demanded of the mechanism to handle radix 2 FFT
> (20 instructions plus loop overhead.) So, somewhere in the 16-64
> instruction range is the scale of vectorizing this mechanism is
> optimized for.

ok, so this would be where RVV and SV have a distinct advantage, as:

* in RVV the vector register file is separate and therefore
may be passed as parameters to functions at an unrestricted
stack depth

* in SV, registers may be "tagged" and left that way, at the
programmer's discretion, such that the same effect is
achieved.

so both have no such limit on the number of instructions that
may be executed using vectorised registers, nor on the stack
depth.

this does mean that it is necessary to save/restore registers
and associated state on entry/exit to function calls (and
on context-switches). this being nothing new and done all
the time in scalar programs, the only question being how
much in terms of power/computing-time resources it takes to
do so.

l.

lk...@lkcl.net

unread,
Nov 5, 2018, 10:50:21 PM11/5/18
to
On Tuesday, November 6, 2018 at 2:50:10 AM UTC, jim.bra...@ieee.org wrote:

> When I see input reservation stations I think data flow.
> And wonder how it gets setup and torn down? Presumably no
> branches outside of the loop nor any subroutine calls within
> the loop?

no, that's correct, on both counts.

(1) standard simple zero-overhead loops just cannot cope with
anything other than the most basic simplest for-loops. they're
designed specifically for algorithms like FFTs with fixed
loops. this is where things like ZOLC come in:

https://arxiv.org/pdf/0710.4632

(2) mitch answers in a follow-up about the limitations on
using RS. it's a leaf-node design concept, where SV
(and RVV) are unlimited length and depth on function
calls.

l.

lk...@lkcl.net

unread,
Nov 5, 2018, 11:11:57 PM11/5/18
to
On Tuesday, November 6, 2018 at 1:20:58 AM UTC, jim.bra...@ieee.org wrote:

> Perhaps off topic

not at all: this is a thread for reviewing SimpleV

so as such, comparative and alternative vectorisation and
parallelism concepts are all up for consideration, so as to be
able to see what what would work.

> Would like to introduce an alternative view:
> Reserve n registers in the register file as a "barrel"
> Vector results go into the barrel one after another.

this is pretty much exactly what SV is based on. i did not like
the idea of having a separate register file as in a traditional
CRAY-style vector architecture, so i went, "ok, so let's see
what kind of mileage we can get out of using the *standard*
integer and floating-point register files".

> "n" is large enough to hold all the intermediate results for
> one loop iteration.

the next logical conceptual progression is to just think of
the loop as just "using the standard register file(s)".

> So announcing the barrel with a special instruction starts the loop
> And a repeat instruction is at end of loop.

why not just have a standard scalar zero-overhead loop mechanism,
which, when you happen to introduce vectorisation, may be used
there as well?

in that way, the zero-overhead loop mechanism may be used to
optimise scalar code as well?

the only tricky bit is that i believe the loop setup and usage
does have to be slightly different for vector and scalar: i
haven't investigated this fully as i have been focussing on
the parallelism.

> Makes it easier for my mind to comprehend what is going on:
> Every result has a place.

yes. using hidden machine state has me concerned for this
reason. if it's *actually* registers - real registers -
microarchitectural implementations can get a concrete handle
on where the data resides, and may decide to drop in OoO,
superscaling, register renaming, lanes, lane-crossing
crossbars, and, my favourite i'd really really like to
implement: L1/L2 SMP style coherent caches on the register
file.

> Use of wider registers and multi-media instructions
> is just optimization of resources done either at compile
> time or run time.

mmm be veerry waary of the seductive powerrrr of SIMDeeee...
https://www.youtube.com/watch?v=fZY8jUuEzJQ

SIMD i view as a de-optimisation:
https://www.sigarch.org/simd-instructions-considered-harmful/

the loops are less efficient (far more instructions than
variable-length vectorisation), and the corner-cases are
absolutely dreadful: often longer than the actual loop,
and requiring a power-of-two cascade of decreasing-size
SIMD instructions plus branches or loops to deal with.

> The problem of partially filled registers at the end of
> the loop can be addressed generically?

in SV and RVV (and vectorisation systems in general),
i believe the answer is yes.

i have not encountered however any variable-length vectorisation
systems that also have zero-overhead loops, although i am
seriously considering dropping a ZOLC implementation into
the Libre RISCV GPGPU.

l.

Bruce Hoult

unread,
Nov 5, 2018, 11:21:08 PM11/5/18
to
On Monday, November 5, 2018 at 7:42:34 PM UTC-8, lk...@lkcl.net wrote:
> ok, so this would be where RVV and SV have a distinct advantage, as:
>
> * in RVV the vector register file is separate and therefore
> may be passed as parameters to functions at an unrestricted
> stack depth

I don't believe it's a useful or desirable thing in RVV to call functions that execute vector instructions while in the middle of a vector processing loop. Any given called function can only usefully work with one configuration (out of the thousand or so possible) of the vector registers. If the called function configures the vector registers then all current contents are lost. Saving and restoring that state is a ridiculous thing to do -- it's potentially kilobytes.

There may be certain special and specialised runtime functions that can be called from inside a vector processing loop. Transcendental functions, for example. They'll be hand-written by specialists.

But in general, any vector processing function that you want to call from inside a vector processing loop should be inlined.

There will, I believe, be no such thing in the ABI as function arguments or results with vector-register values.

There can of course be user-level vector types or classes that are argument to functions. That's a totally different matter.

lk...@lkcl.net

unread,
Nov 5, 2018, 11:52:00 PM11/5/18
to
On Tuesday, November 6, 2018 at 4:21:08 AM UTC, Bruce Hoult wrote:
> On Monday, November 5, 2018 at 7:42:34 PM UTC-8, lk...@lkcl.net wrote:
> > ok, so this would be where RVV and SV have a distinct advantage, as:
> >
> > * in RVV the vector register file is separate and therefore
> > may be passed as parameters to functions at an unrestricted
> > stack depth
>
> I don't believe it's a useful or desirable thing in RVV to
> call functions that execute vector instructions while in the
> middle of a vector processing loop.

i am beginning to understand that RVV may not have been designed
with such a goal in mind.

> Any given called function can only usefully work with one
> configuration (out of the thousand or so possible) of the
> vector registers. If the called function configures the vector
> registers then all current contents are lost.

this is why, in SV, i provide a mechanism to "subset" the
vector CSR tables. with up to 16 entries in the table,
it would become possible to treat the table as a mini
stack. only when the "stack" depth was to be exceeded
would the state need to be switched out to a real
memory-based stack.

oo i just had an idea: when the CSR bank address is
altered to subset the CSRs, move the meaning of
CSR0 to point to the new "start" of the bank.

that way (until you run out of "stack", that is),
you get the benefits of a real stack: the callee
doesn't need to know anything about the function
that called it.

hmmm...

> Saving and restoring that state is a ridiculous thing to do
> -- it's potentially kilobytes.

in SV it's a maxium of... i think a maximum of 76 bytes for the
CSR state. that would require all 16 entries to have been used
up.

vectorised-registers may also be passed as parameters to function
calls: in SV, all that means is: you're passing a contiguous
*block* of registers to a function, you just happen to be
using one single (scalar) register to do so.

hmmm... i believe that if i also add in a means to shuffle
(move) the CSRs up and down the table, the amount of state
required to be saved/restored on each function call entry
and exit is greatly reduced.


> There may be certain special and specialised runtime functions that can be called from inside a vector processing loop. Transcendental functions, for example. They'll be hand-written by specialists.
>
> But in general, any vector processing function that you want to call from inside a vector processing loop should be inlined.
>
> There will, I believe, be no such thing in the ABI as function arguments or results with vector-register values.

all of these are based, i believe, on deliberate conscious design
choices to limit RVV's scope.

whilst it may take considerable time - potentially years and *definitely*
man-years - to get the infrastructure into place in compiler toolchains,
i'm planning SV to be perfectly reasonable to consider using nested
arbitrary function call depths: leveraging *existing* scalar compiler
infrastructure conventions for temporary, caller and callee register
name conventions does not seem unreasonable as long as there is
a way to push/pop the vector config state.

of course, for the 3D application that's likely to go completely out
the window and require custom assembly, for many years, if not permanently.

i still can't quite get over the fact that even for a
lowly "embedded" 3D GPU which is only going to be doing around 5GFLOPS,
we had to extend the FP and INT regfiles to a whopping 128 64-bit entries
in order to avoid the [extremely significant] power penalty of pushing
large amounts of vector data through the L1/L2 cache.

l.

Ivan Godard

unread,
Nov 6, 2018, 12:04:01 AM11/6/18
to
On 11/5/2018 5:20 PM, jim.bra...@ieee.org wrote:

> Perhaps off topic
>
> Would like to introduce an alternative view:
> Reserve n registers in the register file as a "barrel"
> Vector results go into the barrel one after another.
> "n" is large enough to hold all the intermediate results for one loop iteration.
>
> So announcing the barrel with a special instruction starts the loop
> And a repeat instruction is at end of loop.
>
> The intermediate results don't actually have to reside in the barrel, instead, say, in function unit output shift registers as in the Mill. Or the barrel is just a compiler slight-of-hand and an exception mechanism.
>
> Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time. The problem of partially filled registers at the end of the loop can be addressed generically?
>

Didn't the Itanium do this? Of course, it's off-patent now.

lk...@lkcl.net

unread,
Nov 6, 2018, 12:08:33 AM11/6/18
to
On Tuesday, November 6, 2018 at 1:34:53 AM UTC, MitchAlsup wrote:

> > The problem of partially filled registers at the end of the loop can be addressed generically?
>
> But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
> processed. The MMX model only works when the data is packed in memory, and
> only when all sub-locations want to be processed similarly (with a couple of
> exceptions.)

i took one look at variable-length vectors and went, "*this* makes sense".
SIMD is genuinely insane. it's an O(N^6) - N to the power of SIX - opcode
proliferation! no wonder x86 and ARC have several thousand instructions!

RVV is O(N) opcode proliferation. 66000 is O(1), and SV is also O(1).

the nice thing about SV is that the implementation is a bit like
"pimp my ride" (i've just discovered Fast'n'Loud, sorry...) in that
you simply decouple the decode and execution phases, and drop in a
hardware-macro-loop-unroller algorithm. here's the pseudocode for
ADD (yes, it covers scalar as well as vector):

 for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication uses intregs
   ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
ireg[rs2+remap(irs2)];
if (!int_vec[rd ].isvector) break;
if (int_vec[rd ].isvector)  { id += 1; }
if (int_vec[rs1].isvector)  { irs1 += 1; }
if (int_vec[rs2].isvector)  { irs2 += 1; }

note that the loop is terminated early if the destination
is a scalar, but *only* on the first bit set in the
predication.

this allows V/S + S/V -> S to be allowed and all
covered by the same pipeline stage.

it also allows even fully scalar operations to be
predicated.

l.

lk...@lkcl.net

unread,
Nov 6, 2018, 12:13:16 AM11/6/18
to
On Tuesday, November 6, 2018 at 4:21:08 AM UTC, Bruce Hoult wrote:

> But in general, any vector processing function that you want
> to call from inside a vector processing loop should be inlined.

i forgot to mention, that SV is designed to augment scalar
operation as much as it is vector operation. it extends
the reach of the register file - *without modifying the
ISA* - to up to 128 integer and 128 floating-point entries
*including* for Compressed instructions, *and* allows
predication on standard *scalar* operations as well.

there's even mention somewhere in the standard RV spec
of the idea of future extensions augmenting the
register file from the standard 32 entries.

so for SV, the idea of being able to do nested arbitrary
function call depth, and still carry even just the
*scalar* register-redirection capability and the
*scalar* predication capability, makes an awful lot
of sense.

l.

James Van Buskirk

unread,
Nov 6, 2018, 12:17:11 AM11/6/18
to
wrote in message
news:a3bc5db9-ad53-4440...@googlegroups.com...

> i *believe* it may be possible to apply a simple
> re-shaping to the complex numbers, here, to treat
> them as a 2D array. where one would normally see
> array indices of:

> 0 1 2 3 4 5

> instead the "reshaper" may map these to:

> 0 3 1 4 2 5

> thus allowing, i believe, the real and imaginary parts to be
> placed into the same sequential array, yet a seemingly
> "sequential" set of FMUL and FADDs does the right thing.

Remapping the inputs is simply wrongheaded for an FFT.
The way to go is to carry out separate real-half complex
FFTs separately on the real and imaginary parts of the input
and then combine results in a final O(N) 'zip-up' step.

Since you are doing the same stuff to the real and imaginary
parts and no complex arithmetic, you can do whatever
equivalent you have of VBROADCAST to the constants
and just perform identical operations on adjacent data
up to the zip-up.

Besides, the radix-2 butterfly algorithm is very inefficient
and a crappy test case. If you wanna make FFTs more
performant, predict triangular loops in your hardware and
find some place in your ISA to store all the fanout while your
algorithm covers floating point latency.

lk...@lkcl.net

unread,
Nov 6, 2018, 12:51:35 AM11/6/18
to
On Tuesday, November 6, 2018 at 5:17:11 AM UTC, James Van Buskirk wrote:
> wrote in message

> > thus allowing, i believe, the real and imaginary parts to be
> > placed into the same sequential array, yet a seemingly
> > "sequential" set of FMUL and FADDs does the right thing.
>
> Remapping the inputs is simply wrongheaded for an FFT.

thanks for the insight, james. i'm definitely not
an algorithms expert: i have to use comprehensive
unit tests, and even then, asking me to explain the
resultant code that *i wrote* is... yeah :)

primarily i wanted to introduce the REMAP concept that
needed to be introduced in SV, for optimal efficiency
when used as a 3D GPU.

the Vulkan API requires 4x3 matrices of parameters
(quantity several) to be processed alongside other
data, including carrying out matrix multiplication.

without the REMAP concept the entire set of values
would need to be pushed out through the L1 (and
possibly L2) caches and back again: the power consumption
to do so is unacceptable.

by providing a way to remap the linear elements in
an arbitrary dynamic fashion, we may leave the matrix
in-place and still do the required "transposition" of
Y and X which are necessary for matrix multiplication.
the general algorithm (in python) is as follows:

(xdim, ydim, zdim) = (3, 2, 1)
lims = [xdim, ydim, zdim]
idxs = [0,0,0]
order = [1,0,2]

for idx in range(xdim * ydim * zdim):
new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
print new_idx,
for i in range(3):
idxs[order[i]] = idxs[order[i]] + 1
if (idxs[order[i]] != lims[order[i]]):
break
print
idxs[order[i]] = 0

Vulkan not only requires matrix multiplication, it also
requires processing of multiple simultaneous X,Y,Z vectors:
QTY 4 at a time. in certain circumstances it is necessary
to access and process all of the Xs, all of the Ys, all of
the Zs.

at other times, in the same loop, it is necessary to process
the 4 vectors as X,Y,Z in particular for normalisation of
vectors, you want to know the distance which is:

sqrt(X^2+Y^2+Z^2)

here, again, we can avoid disrupting the data structures
by using REMAP of the contiguous data.

it's really powerful and absolutely necessary, if we want
to meet the power budget. we simply cannot move the
data out to memory and back again: Jeff Bush, of Nyuzi,
highlighted that moving data through the L1/L2 cache is
a whopping 50% of the power budget in software-rendered
GPUs.

l.

lk...@lkcl.net

unread,
Nov 6, 2018, 1:02:23 AM11/6/18
to
On Monday, November 5, 2018 at 8:36:53 PM UTC, MitchAlsup wrote:

> Consider the CRAY-1 with 8*64-entry-64-bit vector registers (4096 bytes
> of 5 ported storage). The vector register file is larger than the area
> of all the function units connected to the 512 64-bit registers in the
> VRF.

i was genuinely taken aback at the fact that we will need 128 64-bit
FP registers to have an optimal efficient Vulkan implementation.
in for a penny, in for a pound, we decided that a VPU is likely to
need 128 64-bit integer registers as well.

what we then have is a potential bit of a problem with context-switching:
a whopping 2048 bytes!

so as it's all down to conventions, what we decided to do (and this
is in no way a hardware limitation), is simply *ignore* the top
registers numbered 32-127 on context-switching, and to have a
software requirement where each 3D GPU thread is force-pinned to
one and only one SMP core.

bear in mind that the first version of Kazan3D will be
inline assembler for the inner loops.

in this way, no modifications of pre-existing ABIs, compilers etc.
are needed. the bottom "standard" registers 1-31 may be context
switched "as usual". the linux kernel won't even know that
registers 32-127 are there, and won't touch them either.

it's a Monster Hack that we can clean up later, whilst still
"Getting Product Out The Door (tm)".

> In the My 66000 ISA predicates can be combined to perform && an ||
> branches as well as nested if-the-else stuff.

nice. i mentioned in a couple other posts that SV introduces
scalar branch predication as well. one of the CSR bits for
the register/predication tables is "this register is still scalar".

l.

lk...@lkcl.net

unread,
Nov 6, 2018, 2:31:15 AM11/6/18
to
On Tuesday, November 6, 2018 at 12:14:21 AM UTC, Bruce Hoult wrote:
> On Monday, November 5, 2018 at 10:37:21 AM UTC-8, MitchAlsup wrote:
> > On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> > > one of the things that drives SV is the need to minimise the
> > > amount of modifications and maintenance to the existing
> > > RISCV toolchain (including LLVM and various back-ends such
> > > as clang, rustlang, and so on).
> >
> > Not my problem. But I think RV has lost a few brownie point for not getting
> > the base ISA up to industrial quality levels first. Then you would not be
> > having the issues you are having adding vectors, predication,...
>
> It's the classic bootstrapping problem with limited resources, isn't it?

definitely. the modular ISA flexibility of RV has allowed useful
implementations to be made in as little as 2,500 gates (i believe).
look up PicoRV32. it was used to control multi-megawatt electro
magnets in a particle accelerator because the

india is going to be replacing 40-year-old 68k processors in their
fast nuclear breeder reactors with RV64GC home-grown processors,
fabbed on india-only 180nm foundries for obvious national security
reasons.

university students and libre hardware enthusiasts can and have
created implementations *WITHOUT* needing to do the compiler,
assembler or unit tests, as all of that has already been done.

*later* they or someone else can build on that and add extra
functionality.

> What is the Minimum Viable Product? Is it better to ship
> first hardware in 2016 and hardware with vector support in 2019,
> or is it better to ship a finished (ha ha) thing in 2019?

exactly. i'd not in the *slightest* bit be able to remotely
consider, "oh i think i will start a Libre 3D GPU project,
today", without all that hard work having been already been
done.

i don't like the ITU-style closed doors process under which
RISC-V is developed: that has absolutely nothing to do with
the fact that i fully recognise and deeply appreciate the
amount of effort that's gone into RISC-V by the Founders.

> There's no NEON equivalent and no GPU

i put in a proposal to the sifive "democratising ideas" page
that closed last month, to address that.


> Of course, having those later will open up a whole new
> range of applications and customers.

absolutely. one of the "selling points" of ARM is the fact
that a "proven" MALI GPU is available out-of-the-box. they
even provide it free of charge to licensees of the standard
ARM cores.

> But, in the meantime, *revenue* and buzz and software support.

which funds incremental improvements, in manageable chunks.

the only issue i have with RVV is: even in its base form i
consider it to be too heavy-weight for simple verifiable
implementation by small (and one-man) teams in a reasonable
timeframe.

it's adding not just extra opcodes, it's adding an entire
new pipeline, plus features that have not been seen before
in the base RV which, if you *don't* have those features
*and* the new opcodes *and* the new pipeline, you're automatically
non-compliant with the spec...

... behind an opaque closed-doors ITU-style development process.

it's just far too risky a proposition for a small team, isolated
outside of the ITU-style development process, to consider taking on.

this heavily influenced the design of SV. the absolute minimum
compliance is to provide the CSRs, and to trigger a trap handler
whenever a "tagged" register is encountered. in this way, it
is perfectly reasonable to consider deploying SV even on top of
the greatly-reduced RV32E (only 16 32-bit integer registers).

now, whether the traps may be efficiently implemented in RV32E
software, vs providing a hardware-macro-loop instead, remains
to be seen. ROM-based software-emulation of the macro-loop
may turn out to be quite efficient: we just have to see; it's
not my primary focus.

the next level(s) include permitting implementors to provide
a mixture of partial and hybrid parallelism and software-traps.
so, if an implementor does not want (or need) non-default
element widths for their application, *great*, they can punt that
to software-emulation.

i considered this to be a really, really important aspect of
SV, to make the implementation barrier to entry as low and as
flexible as possible.

if implementors find that the decision to punt to software
emulation does not provide them the anticipated performance,
they can always ramp it up...

... but here's the thing: the software-emulation will allow
them to complete and pass *all* of the unit tests, 100% of
the time, *even* as they're developing and optimising the
hardware, all in an incremental fashion.

compare and contrast this to RVV where it's "all-or-nothing",
in one of the most complex architectural areas of computer
science: parallelisation / vectorisation.


> For sure it's very nifty and worth studying. I think there's a
> lot to be said for a pure data streaming model without huge vector
> registers, especially on bigger designs that have OoO resources already.

... with the caveat (i believe) that the data streaming model is
critically dependent *on* having OoO resources.

oh. i remember now. wasn't there someone who did a full data-streaming
(scalar) ISA? i have this vague recollection last year of seeing a
post on RV isa-dev about someone who had made a full ISA which didn't
actually have any registers at all. data was passed entirely from
LOAD to STORE through ALU chains.


> One of the design goals of the official RISC-V Vector ISA
> is that it is microarchitecture-agnostic. You can implement it
> with basically a traditional SIMD hardware approach and the
> software doesn't notice, and runs fine. You can implement
> it with a basically streaming OoO data flow approach and the
> software doesn't notice, and runs fine.

whilst that's true, there's a performance penalty built-in
to RVV for all and any non-OoO (non-register-renaming)
microarchitectures, related to the zeroing on predication.

in a traditional incredibly simple predicated DSP-style SIMD
architecture, with no register-renaming, non-predicated elements
save resources: a zero bit indicates that the ALU for that lane
(if there are lanes) is simply... completely switched off.

no inputs, no outputs. saves on reads, saves on writes.

when the specification makes it *mandatory* that non-predicated
(zero-bit) elements be zero'd, such simple microarchitectures
may no longer fully optimise on power saving: they're *forced*
into a position of writing out a zero to the register file.

in SV i took a different perspective: i noticed in the discussions
that there were benefits to both approaches, so i permitted
both, dynamically, as part of the predication CSR tables.

in this way it becomes up to the compiler and assembly writers
to decide the best approach on a case by case basis.

in addition, in a non-lanes-based scalar multi-issue microarchitecture,
non-zero-predication means that the hardware-macro-loop phase of
SV may issue multiple instructions to the underlying multi-issue
execution units from the one instruction, *not* issuing instructions
to the execution units where the predication bit is zero.

a 4-issue standard microarchitecture could therefore really efficiently
and quickly be turned into an SV Monster, as the infrastructure to
do so is already there.

the advantage of such an approach: 100% execution throughput even when
there is a significant percentage of predication going on.


> On the other hand, a traditional SIMD micro-architecture can just
> always report the same vector length. It *does* have to support some
> way to mask off the extra lanes in the vector tail, but that's
> pretty trivial.

yes, and it's a really nice aspect of vectorisation (as long as
the SIMD architecture has predication). at the end of the loop
all you do is: hard-wire the last bits of predication to zero.

it does mean that you don't get 100% ALU utilisation, however
there's a way to mitigate that: instead of having massively-long
8, 16, 32 long SIMD ALUs, cut them in half: go the other way,
and do multi-issue instead.

i like vectorisation, you can probably tell :)

l.

Bruce Hoult

unread,
Nov 6, 2018, 3:18:33 AM11/6/18
to
On Monday, November 5, 2018 at 11:31:15 PM UTC-8, lk...@lkcl.net wrote:
> > One of the design goals of the official RISC-V Vector ISA
> > is that it is microarchitecture-agnostic. You can implement it
> > with basically a traditional SIMD hardware approach and the
> > software doesn't notice, and runs fine. You can implement
> > it with a basically streaming OoO data flow approach and the
> > software doesn't notice, and runs fine.
>
> whilst that's true, there's a performance penalty built-in
> to RVV for all and any non-OoO (non-register-renaming)
> microarchitectures, related to the zeroing on predication.
>
> in a traditional incredibly simple predicated DSP-style SIMD
> architecture, with no register-renaming, non-predicated elements
> save resources: a zero bit indicates that the ALU for that lane
> (if there are lanes) is simply... completely switched off.
>
> no inputs, no outputs. saves on reads, saves on writes.
>
> when the specification makes it *mandatory* that non-predicated
> (zero-bit) elements be zero'd, such simple microarchitectures
> may no longer fully optimise on power saving: they're *forced*
> into a position of writing out a zero to the register file.

That's not actually true.

You could also implement this by have a 1-bit per lane bitmap of which lanes have been predicated to zero. You don't actually have to write zeros to the register bits, only to the bitmap.

On reading the register you just use the bitmap to select either the register contents or else hardwired zeroes. That adds a 2:1 mux to the data path, but you probably have that anyway, to implement the zero register.

lk...@lkcl.net

unread,
Nov 6, 2018, 5:03:03 AM11/6/18
to
On Tuesday, November 6, 2018 at 8:18:33 AM UTC, Bruce Hoult wrote:

> That's not actually true.
>
> You could also implement this by have a 1-bit per lane bitmap of which lanes have been predicated to zero. You don't actually have to write zeros to the register bits, only to the bitmap.
>
> On reading the register you just use the bitmap to select either the register contents or else hardwired zeroes. That adds a 2:1 mux to the data path, but you probably have that anyway, to implement the zero register.

interesting: i like it. and, i just realised that in the
spike-sv emulator i actually ended up implementing zeroing
in READ_REG as pretty much exactly this (returning x0
on zero-predication): what hadn't occurred to me was the
next step, to actually have the zero as a bitfield
associated with the register file.

neat trick.

l.

lk...@lkcl.net

unread,
Nov 6, 2018, 11:31:25 AM11/6/18
to
On Tuesday, November 6, 2018 at 7:31:15 AM UTC, lk...@lkcl.net wrote:

> > It's the classic bootstrapping problem with limited resources, isn't it?
>
> definitely. the modular ISA flexibility of RV has allowed useful
> implementations to be made in as little as 2,500 gates (i believe).
> look up PicoRV32. it was used to control multi-megawatt electro
> magnets in a particle accelerator because the

... sentence didn't get finished :)

the group had contacted a formal verification company, the conversation
went like this:

"hi we need a formally-verified core, for a mission-critical safety
application involving gigawatts of power, hard radiation going in
the wrong direction and damaging people and stuff, y'ken"

"sure, we can help! our proprietary processor has been formally
verified and used by our customers for 20 years"

"ok, that's great. so, well, we'd like to make sure, by checking
that for ourselves"

"oh nooo, we can't possibly provide you with the souuurce cooode,
it's our proprietary secret, y'ken"

"so who double-checks that your design is correct?"

"oh, we do"

"and who double-checks that the formal verification suite is correct?"

"oh, we do"

"and your customers *trust* you??"

"oh yes, absolutely!"

naturally, PicoRV32, by virtue of it being downloadable and made
available under a libre license, along with a corresponding formal
verification suite, was selected instead.

what's happening with RISC-V is just... amazing.

l.

Anton Ertl

unread,
Nov 6, 2018, 12:32:05 PM11/6/18
to
You present this as an advantage, but I think that this is the biggest
disadvantage of these schemes that try to implement sequences of
vector operations as (conceptual, but more efficiently implemented)
loops of scalar operations.

E.g., desktop hardware has two load units that can load a cache line
(typically 32 bytes on recent CPUs) without much more effort than it
needs for loading smaller amounts [1]. If you design the architecture
such that it uses a load unit for every 8 bytes, the load units will
have only a quarter of the troughput (for vectors of bytes, only
1/32th of the throughput). Sure, you can then run gather loads at
similar speeds as contiguous loads, but you have achieved that by
turning the contiguous loads into gather loads and thus slowing them
down.

[1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
Bridge (2011), but they needed until Skylake (2015) to widen the load
units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.

For other operations, similar considerations apply, although the
tradoffs are different: Making a FP multiply-add unit wider probably
costs more than making a load unit wider (but I may be wrong, given
that Intel implemented 256-bit wide FP units before 256-bit wide load
units; OTOH, maybe Intel tuned the CPUs for matrix multiplication).

In any case, when building on scalar operations, you also exercise the
OoO machinery 4-32 times as much as with 32-byte wide operations.

Concerning the bow to the Cray-1 that I see in several of these works,
I think that one cool thing that the Cray-1 did was that the program
could start some vector operations, and then do scalar work while the
vector stuff was being processed.

Actually, this does not quite work out because of the limited vector
length of the Cray machines, so the programmer has to write loops
containing vector instructions after all.

But consider an architecture where vector length is limited only by
memory. Vector operations could be performed almost independently of
scalar stuff; only scalar operands and memory dependencies would cause
interactions between the scalar and the vector part of the CPU. One
can imagine various ways to implement the vector part (with different
implementations using different processing granularity); I think this
can be quite a bit simpler than exercising the full reservation
station machinery for each operation on each grain.

The downside of arbitrarily long vectors is that you have to be able
to interrupt it in the middle. This means that the interrupt state
includes information about which vector operations are in progress and
how far they are. However, if you let each dependence chain run to
completion of processing all started grains, you only need to save
metadata on an interrupt, and do not need to save registers that
contain intermediate result grains.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,
Nov 6, 2018, 1:23:55 PM11/6/18
to
MitchAlsup <Mitch...@aol.com> writes:
>But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
>processed.

AVX(2) has VMASKMOVPS/VMASKMOVPD/VMASKMOVDQU that allow loading and
storing sublocations with a predicate mask. For the ALU operations,
processing more is not that bad, as long as you don't store the
results in the end, and as long as you don't produce an exception on
loading.

I have not looked closely at AVX512, but AFAIK it supports sublocation
predication for all operations.

>The MMX model only works when the data is packed in memory, and
>only when all sub-locations want to be processed similarly (with a couple of
>exceptions.)

IA-64 worked well at data-parallel stuff. Yet it still did not
conquer the supercomputing world. My explanation for this is that,
while it was more flexible than SIMD extensions, SIMD extensions are
good enough (compared to IA-64) for most stuff. I.e., the cases where
the sublocations have to be processed differently are either not that
frequent, or they are covered by the exceptions you mention. In the
meantime, we have GPGPUs, but I think that vector/SIMD stuff in the
CPU can still be useful.

MitchAlsup

unread,
Nov 6, 2018, 3:09:08 PM11/6/18
to
On Tuesday, November 6, 2018 at 11:32:05 AM UTC-6, Anton Ertl wrote:
The counter argument is that if one makes vector loads and vector stores,
one assigns the semantic that they either run to completion or did not
start. {At least everyone I studied did.}

Then you get in the loop of having stride 1 vector loads and stores,
stride K loads and stores, and vector indirect (gather scatter) loads
and stores.

Whereas doing it loop-by-loop and each load/store is its own unit
of visible pipeline work, all those variants are unnecessary. The
power overhead is all in operand routing and not in address calculation
or memory access.
>
> [1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
> Bridge (2011), but they needed until Skylake (2015) to widen the load
> units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.
>
> For other operations, similar considerations apply, although the
> tradoffs are different: Making a FP multiply-add unit wider probably
> costs more than making a load unit wider (but I may be wrong, given
> that Intel implemented 256-bit wide FP units before 256-bit wide load
> units; OTOH, maybe Intel tuned the CPUs for matrix multiplication).
>
> In any case, when building on scalar operations, you also exercise the
> OoO machinery 4-32 times as much as with 32-byte wide operations.

But at the same time, the FETCH-DECODE-ISSUE stages of the pipeline
are quiescent. For simple instruction sequences, this part of the
pipe can represent 50% of the power to perform instructions. Leaving
this quiet balances out with the RSs firing more often and capturing
operands more often.
>
> Concerning the bow to the Cray-1 that I see in several of these works,
> I think that one cool thing that the Cray-1 did was that the program
> could start some vector operations, and then do scalar work while the
> vector stuff was being processed.

Also note: on CRAY-1 one vector load did not start until the previous
vector (or scalar) load completed. This proved to be a big problem,
and was fixed on CRAY-XMP where the vector pipe had 3 memory units
per cycle -- what we would call 3 memory lanes today.

So::
LDV Ra,[addrA]
LDV Rb,[addrB]
FMUL Rc,Ra,Rb

was piped in CRAY-1:

LDV |---+++++++++++++++++++++++++|
LDV |---V++++++++++++++++++++++++|
FMUL |-----*************************|

whereas in CRAY-XMP it was piped as:

LDV |---V++++++++++++++++++++++++|
LDV |---V++++++++++++++++++++++++|
FMUL |-----*************************|

>
> Actually, this does not quite work out because of the limited vector
> length of the Cray machines, so the programmer has to write loops
> containing vector instructions after all.
>
----------------------------------------------------------------------

> But consider an architecture where vector length is limited only by
> memory. Vector operations could be performed almost independently of
> scalar stuff; only scalar operands and memory dependencies would cause
> interactions between the scalar and the vector part of the CPU. One
> can imagine various ways to implement the vector part (with different
> implementations using different processing granularity); I think this
> can be quite a bit simpler than exercising the full reservation
> station machinery for each operation on each grain.

On the other hand: those M-M Vector machines performed one instruction
for lots of cycles, and then performed the next instruction for lots of
cycles. There was no overlap across instructions--probably for interrupt
reasons--but routing could have been another rational.

What we want is for the instructions in the whole loop to execute
loop by loop, not instruction by instruction.

>
> The downside of arbitrarily long vectors is that you have to be able
> to interrupt it in the middle. This means that the interrupt state
> includes information about which vector operations are in progress and
> how far they are. However, if you let each dependence chain run to
> completion of processing all started grains, you only need to save
> metadata on an interrupt, and do not need to save registers that
> contain intermediate result grains.

No the other hand, if the instructions run loop-by-loop and an interrupt
takes place, the IP and loop variable contains everything needed to
restart the calculation. No extra stat needed.

George Neuner

unread,
Nov 6, 2018, 4:20:32 PM11/6/18
to
On Tue, 06 Nov 2018 14:48:14 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>But consider an architecture where vector length is limited only by
>memory. Vector operations could be performed almost independently of
>scalar stuff; only scalar operands and memory dependencies would cause
>interactions between the scalar and the vector part of the CPU. One
>can imagine various ways to implement the vector part (with different
>implementations using different processing granularity); I think this
>can be quite a bit simpler than exercising the full reservation
>station machinery for each operation on each grain.
>
>The downside of arbitrarily long vectors is that you have to be able
>to interrupt it in the middle. This means that the interrupt state
>includes information about which vector operations are in progress and
>how far they are. However, if you let each dependence chain run to
>completion of processing all started grains, you only need to save
>metadata on an interrupt, and do not need to save registers that
>contain intermediate result grains.

I've been an advocate of something like this for a long time.

I had occasion to work with DSPs that could perform 2-dimensional [X,Y
counter] memory->memory DMA. Unfortunately, all it could do was
*move* data around ... it would have been nice to be able to perform
computations on the in-flight data streams.

I would love to see something like this realized. I don't think
arbitrary striding / gather really is necessary - contiguous blocks
would be fine - but certainly I would want arbitrary length.

I think a coprocessor-like approach using DMA offers a lot more than
loops of short vector operations. Even with Cray-like vector lengths,
too much program intervention is needed even if you can do something
else while waiting.

YMMV.

>- anton

George

MitchAlsup

unread,
Nov 6, 2018, 4:47:11 PM11/6/18
to
On Tuesday, November 6, 2018 at 3:20:32 PM UTC-6, George Neuner wrote:
> On Tue, 06 Nov 2018 14:48:14 GMT, an...@mips.complang.tuwien.ac.at
> (Anton Ertl) wrote:
>
> >But consider an architecture where vector length is limited only by
> >memory. Vector operations could be performed almost independently of
> >scalar stuff; only scalar operands and memory dependencies would cause
> >interactions between the scalar and the vector part of the CPU. One
> >can imagine various ways to implement the vector part (with different
> >implementations using different processing granularity); I think this
> >can be quite a bit simpler than exercising the full reservation
> >station machinery for each operation on each grain.
> >
> >The downside of arbitrarily long vectors is that you have to be able
> >to interrupt it in the middle. This means that the interrupt state
> >includes information about which vector operations are in progress and
> >how far they are. However, if you let each dependence chain run to
> >completion of processing all started grains, you only need to save
> >metadata on an interrupt, and do not need to save registers that
> >contain intermediate result grains.
>
> I've been an advocate of something like this for a long time.
>
> I had occasion to work with DSPs that could perform 2-dimensional [X,Y
> counter] memory->memory DMA. Unfortunately, all it could do was
> *move* data around ... it would have been nice to be able to perform
> computations on the in-flight data streams.

About 30 years ago, I realized that a Vector processor was simply a
DMA device that mangles the data on the way through.

lk...@lkcl.net

unread,
Nov 7, 2018, 1:34:36 AM11/7/18
to
On Tuesday, November 6, 2018 at 5:32:05 PM UTC, Anton Ertl wrote:

> [1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
> Bridge (2011), but they needed until Skylake (2015) to widen the load
> units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.

there is no question and absolutely no doubt about the fact that
an increased computational capability requires an increased
t