[rfc] SimpleV vectorisation / parallelism hardware abstraction API for RISC-V [1]

lk...@lkcl.net

unread,

Nov 5, 2018, 6:54:59 AM11/5/18

to

over the past several months i've been developing an abstraction layer
for vectorisation and parallelisation of RISC-V. i've never encountered
anything remotely similar in any commercial processor design, hence
why it could seriously benefit from a review.

this is not a hypothetical design concept: it's *going* to be implemented,
for use as the core basis of a Libre GPGPU (general-purpose 3D Graphics
Processor), with a secondary requirement to have sufficient legs to
act as a VPU (Video Processing Unit) as well.

the core of the concept is that standard scalar registers are "tagged"
with CSRs (Control Status Registers) indicating that use of that
register shall result in not one *single* (scalar) operation but
*multiple contiguous* scalar operations being issued to the ALU.

it is then up to the micro-architectural implementor to decide whether
those multiple operations shall be executed sequentially, or in
parallel, or (as is done with the Broadcom VideoCore IV), a hybrid
mixture of both (Broadcom call this "Virtual Parallelism").

effectively SV is a hardware version of c/c++ macro loop-expansion.

SV borrows and builds on RVV (a proposed vectorisation API for RISC-V),
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc

RVV in turn builds on the Cray Architecture. the primary benefit
of variable-length vectorisation is outlined here:
https://www.sigarch.org/simd-instructions-considered-harmful/

the draft specification for SV is here:
https://libre-riscv.org/simple_v_extension/specification/

if anyone finds a c++ source code referenceimplementation easier
to read than "just words", that may be found here:
https://git.libre-riscv.org/?p=riscv-isa-sim.git;a=shortlog;h=refs/heads/sv

and associated unit tests here:
https://git.libre-riscv.org/?p=riscv-tests.git;a=shortlog;h=refs/heads/sv

and preliminary build instructions for the simulator and unit tests
found here:
https://libre-riscv.org/3d_gpu/spike_sv/

much appreciated.

l.

MitchAlsup

unread,

Nov 5, 2018, 11:16:32 AM11/5/18

to

On Monday, November 5, 2018 at 5:54:59 AM UTC-6, lk...@lkcl.net wrote:
> over the past several months i've been developing an abstraction layer
> for vectorisation and parallelisation of RISC-V. i've never encountered
> anything remotely similar in any commercial processor design, hence
> why it could seriously benefit from a review.

Over the last year I have been working on a Virtual Vector extension to
My 66000 ISA. In this extension the compiler is given the illusion of
a register to register CRAY-like vector instructions set, but there are
only 2 instructions added to the ISA in support of vectors. In particular;
a) there are no vector registers (saving massive real estate)
b) no vector length register (nor a conceptual length of the vector)
c) no vector control registers (no vector state, full vector perf)
d) no vector condition codes (scalar predication covers this)

Consider for a moment DAXPY::

void daxpy(size_t n, double a, const double x[], double y[])
{
for (size_t i = 0; i < n; i++) {
y[i] = a*x[i] + y[i];
}
}

in My 66000 ISA this could be expressed as:

daxpy:
MOV Ri,0
CMP Rt,Ri,Rn
BGE exit
top:
LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]
FMAC Ry,Ra,Rx,Ry
STD Ry,[Ryb+Ri<<3]
ADD Ri,Ri,1
CMP Rt,Ri,Rn
BLT Rt,top
exit:
JMP R0

Note: This takes 2 fewer instructions that RV both overall and in the loop.
In order to convert this function into a vector function, one needs two
pieces of decoration::

daxpy:
MOV Ri,0
CMP Rt,Ri,Rn
BGE exit
top: VEC {SIV,SIV,VVV,SIV,ICI} // decorate the registers
LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]
FMAC Ry,Ra,Rx,Ry
STD Ry,[Ryb+Ri<<3]
LOOP Ri,1,Rn,LT // compress loop overhead
exit:
JMP R0

The decoration at the top of the loop annotates which registers in the
following instructions are Scalars (S), Vectors (V), and Loop-Indexes (I).

The mechanism of execution is as follows:: The VEC instruction casts a
shadow over the instructions in the vectorized loop. Scalar registers
only have to be fetched from the register file ONCE, Vector registers
will be captured Once per loop, and Writes to the Index register cause
another LOOP to be initiated in <wherever the instructions are being
buffered::but most likely something like reservation stations.>

The LOOP instruction does not need a branch target, that was already
supplied by the VEC instruction; all it needs is the Loop index register,
an increment, a comparison, and a termination.

Since the My 66000 ISA already has predication, predication can be used
for small scale flow control inside vector loops. The reservation stations
know their predicate (if any::it gets installed in the first pass over
the loop) and fires/holds based on the predicate this loop.

Architecturally, there can be as many lanes as the designers want to
build (NEC SX5+) with the caution that the memory system needs to scale
proportionate to the calculation system.

While executing, vector values flow through the forwarding paths and through
calculation units, not bothering to be written into the register file.

If an enabled exception is raised, the vector calculations preceding the
exception are allowed to complete, and vector data is stored in the
scalar registers, exception control transfer is performed, and the
exception handled as if there were no vectorization at all.

In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
are quiet, waiting for the vector to complete (and getting FETCH out of the
way of vectors in memory.)

The memory system performs store-to-load forwarding analysis so that loops
such as:

void looprecurance(size_t n, double a, const double x[])
{
for (size_t i = 0; i < n; i++) {
x[i] = a*x[i-10] + x[i-7];
}
}

remain vectorized; one calculation simply becomes dependent on the 7th and 10th
previous calculations through a memory system. No special analysis is required
in the compiler to vectorize this loop, nor is any special detection necessary
to avoid vectorizing this loop. The basic tenet is that if the scalr code can
be written in a simple loop that one can change the top and bottom of the
loop and presto-it is vectorized.

All of the advantages of a CRAY-like vector ISA, virtually no change to the
actual ISA (2 instructions), and precise exceptions too.

Terje Mathisen

unread,

Nov 5, 2018, 11:46:27 AM11/5/18

to

MitchAlsup wrote:
> The memory system performs store-to-load forwarding analysis so that
> loops such as:
>
> void looprecurance(size_t n, double a, const double x[]) { for
> (size_t i = 0; i < n; i++) { x[i] = a*x[i-10] + x[i-7]; } }
>
> remain vectorized; one calculation simply becomes dependent on the
> 7th and 10th previous calculations through a memory system. No
> special analysis is required in the compiler to vectorize this loop,
> nor is any special detection necessary to avoid vectorizing this
> loop. The basic tenet is that if the scalr code can be written in a
> simple loop that one can change the top and bottom of the loop and
> presto-it is vectorized.
>
> All of the advantages of a CRAY-like vector ISA, virtually no change
> to the actual ISA (2 instructions), and precise exceptions too.

Mitch, you have written more than enough about this feature to wet my
appetite:

How much hardware do you actually need in order to run this at
comparable speed to some Intel-specific 4-wide AVX loop code?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Nov 5, 2018, 12:27:45 PM11/5/18

to

Well, consider that the goal was to run stuff like daxpy at 6 IPC or
2 flops per cycle per lane, the answer to your question is "not much".

So, let us postulate a set of calculation units that can perform 128-bit
or 256-bit wide AVX codes and a cache system that can supply and absorb
these sized accesses.

In virtual vector, the more natural means would be to organize 4 lanes
of calculation, each lane 64-bits wide, each lane processing 1 loop in
coordination with the other lanes.

You would need
a) 2 loads per cycle per lane
b) 1 FMAC per cycle per lane
c) 1 store per cycle per lane
d) 1 LOOP per cycle across lanes

Since all of the loads and stores are present at the datum level, they do
not necessarily need the 4 datums to reside side by side in memory, nor
do they need 128-bit wide or 256-bit wide registers. A long time ago I took
a look at radix 2 FFT and whether I could vectorize it::

SUBROUTINE FFT2( A, N )
COMPLEX DOUBLE PRECISION A[M]
DOUBLE PRECISION RE[M] = &REAL(A[0])
DOUBLE PRECISION IM[M] = &IMAG(A[0])
N2 = N
DO 10 K = 1, M
DOUBLE PRECISION RO = &RE[N2]
DOUBLE PRECISION IN = &IM[N2]
N1 = N2
N2 = N2/2
E = 6.28318/N1
A = 0
DO 20 J = 1, N2
C = COS (A)
S =-SIN (A)
A = J*E
DO 30 I = J, N, N1
XT = RE(I) - RO(I)
YT = IM(I) - IN(I)
RE(I) = RE(I) + RO(I)
IM(I) = IM(I) + IN(I)
RO(I) = XT*C - YT*S
IN(I) = XT*S + YT*C
30 CONTINUE
20 CONTINUE
10 CONTINUE

and got something like this:

FFT2:
// MOV RRE,RA // RA not damaged during routine
ADD RIM,RA,8
MOV RN2,RN
MOV RK,1
loop10:
LDA RRO,[RRE+RN2<<3]
LDA RIN,[RIM+RN2<<3]
MOV RN1,RN
SRA RN2,RN,1
CVT (double)RN1F,(int)RN1
FDIV RE,6.283185307179586476925286766559,RN1F
MOV RA,0
MOV RJ,1
loop20:
COS RC,RA
SIN -RS,RA
CVT (double)RJF,(int)RJ
FMUL RA,RJF,RE
MOV RI,1
loop30:
VEC {VSI*4,VVV*12,VSI*4,ISS}
LDD RXI,[RRE+RI<<3]
LDD RXL,[RRO+RI<<3]
LDD RYI,[RIM+RI<<3]
LDD RYL,[RIN+RI<<3]
FADD RXT,RXI,-RXL
FADD RYT,RYI,-RYL
FADD RXI,RXI,RXL
FADD RYI,RYI,RYL
FMUL RXC,RXT,RC
FMUL RXS,RYT,RS
FMUL RYS,RXT,RS
FMUL RYC,RYT,RC
FADD RXL,RXC,-RXS
FADD RYL,RYS,RYC
STD RXI,[RRE+RI<<3]
STD RYI,[RIM+RI<<3]
STD RXL,[RRO+RI<<3]
STD RYL,[RIN+RI<<3]
LOOP RI,RN1,RN,LE

ADD RJ,RJ,1
CMP Rc,RJ,RN2
BLE Rc,loop20
ADD RK,RK,1
CMP Rc,RK,RM
BLE loop10
JMP R0

For full perf on this one, one would need 6 lanes of FADD, 4 lanes of FMUL,
4 lanes of LDD, and 4 lanes of STD. If one had that set of resources, one
would be performing a memory to memory radix 2 FFT butterfly every cycle.

{Note: some consider it numerically dangerous to fuse an FMAC out of FFT
FMULs and FADDs; so I didn't. It did not fit the general calculation
sequence either.}

lk...@lkcl.net

unread,

Nov 5, 2018, 1:03:12 PM11/5/18

to

On Monday, November 5, 2018 at 4:16:32 PM UTC, MitchAlsup wrote:

> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)

yes. there is a huge disadvantage of traditional vector ISAs:
massive duplication and specialisation of the opcodes.

in RVV they have had to add approximately a 25-30% duplication
of the xBitManip extension for the sole purpose of managing predicate
bits... *without* that hardware being actually available for the
purposes of general-purpose bit manipulation.

a combination of SV and xBitManip extensions provides near 100%
complete coverage of RVV's vectorisation functionality.

there is one perceived advantage: the vector pipeline may be entirely
separated from the scalar pipeline. i would perceive that as a
*dis*advantage, on the basis that it results in huge duplication of
silicon, particularly arithmetic pipelines.

> b) no vector length register (nor a conceptual length of the vector)

that's interesting.

> c) no vector control registers (no vector state, full vector perf)

some "setup" is required however. the state information needs to
be set up one way or the other. CSRs or instructions, it's one
or the other.

> d) no vector condition codes (scalar predication covers this)

yes it does. RISC-V scalar operations do not have the concept
of predication. therefore, SV has an additional CSR "predication"
table that allows the *use* of a given (ordinarily scalar) register
to indicate "hey this operation is now predicated, because you
used register x5, and the CSR lookup table says that register x3's
bits are to be used as the predicate".

in this way there was absolutely no need to modify the RISC-V
scalar ISA *at all*.

> Consider for a moment DAXPY::

we like DAXPY :)

> Note: This takes 2 fewer instructions that RV both overall and in the loop.
> In order to convert this function into a vector function, one needs two
> pieces of decoration::
>
> daxpy:
> MOV Ri,0
> CMP Rt,Ri,Rn
> BGE exit
> top: VEC {SIV,SIV,VVV,SIV,ICI} // decorate the registers
> LDD Rx,[Rxb+Ri<<3]
> LDD Ry,[Ryb+Ri<<3]
> FMAC Ry,Ra,Rx,Ry
> STD Ry,[Ryb+Ri<<3]
> LOOP Ri,1,Rn,LT // compress loop overhead
> exit:
> JMP R0
>
> The decoration at the top of the loop annotates which registers in the
> following instructions are Scalars (S), Vectors (V), and Loop-Indexes (I).

ok, so yes: the "mark as vectorised" is the basic core principle of
SV. in SV User-Mode there are 16 available CSR entries which may be used to
mark up to 16 registers (FP or INT) as "vectorised".

what i have *not* done is add "loop indices". there is a very simple
reason: SV is *not* about adding new instructions, it's about
expressing *parallelism*.

so for loop indices to be added, a *separate* extension would be needed
that... added looping.

in DSP terminology this is called zero-overhead looping, and at its
most effective (TI DSPs for example) a full FFT may be achieved
*in a single instruction* with 100% utilisation, thanks to a zero
overhead loop control mechanism.

i therefore looked up and contacted someone who created something
called "ZOLC", and obtained the verilog source code. if anyone
is interested i can provide a copy on request, or put you in touch
with the author.

the author of ZOLC provided an amazing example of implementing
MPEG decode, and coming up with a whopping 38% decrease in
completion time... *with no branch prediction unit*.

> The mechanism of execution is as follows:: The VEC instruction casts a
> shadow over the instructions in the vectorized loop. Scalar registers
> only have to be fetched from the register file ONCE, Vector registers
> will be captured Once per loop,

interesting that you've noted the loop invariance of scalar registers.

where are the actual vector registers? *are* there any actual vector
registers? in SV, the scalar register file is "remapped" to
"vectorised" by being treated as if it was a contiguous memory block,
and the "scalar" register is a pointer to the starting element.

in this way, an ADD r10 r20 r30 with a VL of 3 actually becomes:
ADD r10 r20 r30
ADD r11 r21 r31
ADD r12 r22 r33

assuming that all of r10, r20 and r30 have been marked as "vectorised",
that is.

> and Writes to the Index register cause
> another LOOP to be initiated in <wherever the instructions are being
> buffered::but most likely something like reservation stations.>
>
> The LOOP instruction does not need a branch target, that was already
> supplied by the VEC instruction; all it needs is the Loop index register,
> an increment, a comparison, and a termination.

this is classic DSP-style zero-overhead looping. it's pretty effective,
and completely does away with the need for branch prediction. in
the ZOLC implementation it can even be nested, and there can be
multi-variable condition codes causing jumps *between nesting levels*...
all producing 100% full pipelines, no stalls, *no branch prediction at all*.

> Architecturally, there can be as many lanes as the designers want to
> build (NEC SX5+) with the caution that the memory system needs to scale
> proportionate to the calculation system.

absolutely. it's been well-known for many many decades that the
bottleneck is never the computational aspect of vector processors:
it's the memory bandwidth.

> If an enabled exception is raised, the vector calculations preceding the
> exception are allowed to complete, and vector data is stored in the
> scalar registers, exception control transfer is performed, and the
> exception handled as if there were no vectorization at all.

in SV, one of the design criteria was that it had to be realistically
achievable for a single computer science student or a libre hardware
engineer to implement in a reasonably short time-frame.

when i first encountered RVV, the RISC-V Vectorisation proposal, i
was very surprised to learn of the design criteria that, firstly,
elements and the operations on them must be treated as being executed
sequentially, and secondly, that if an exception occurred and there
were elements further along (higher indices), the perfectly-good
results had to be THROWN OUT.

six months later i began to understand the value of this design decision.

the issue is that exceptions, such as virtual memory page table
misses, have to be re-entrant. anything more complex than the above
design criteria is just too complex to handle.

so the state that's stored in SV (and i believe RVV as well) has to
include the loop indices. in SV there are actually two indices (due
to twin-predication on certain operations).

there's seperate state for M-Mode, S-Mode (supervisor mode) and U-Mode,
allowing each to continue to use their own separate vectorisation
to manage lower-privilege execution levels.

> In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
> are quiet, waiting for the vector to complete (and getting FETCH out of the
> way of vectors in memory.)

interesting. what happens if the loop is say 2,000 to 4,000 instructions
in length?

> The memory system performs store-to-load forwarding analysis so that loops
> such as:
>
> void looprecurance(size_t n, double a, const double x[])
> {
> for (size_t i = 0; i < n; i++) {
> x[i] = a*x[i-10] + x[i-7];
> }
> }
>
> remain vectorized; one calculation simply becomes dependent on
> the 7th and 10th previous calculations through a memory system.

this is accidentally provided in SV as well :)

the CSR "register tagging" system actually allows redirection
of the registers as well. so, at the assembly level the
instruction says "ADD x3 x5 x7" however register x3 is redirected
to *actually* point at x10, x5 is redirected to point at x12,
and so on.

in this way, it should be clear that if the loop length is
say 10, then due to the hardware-macro-loop unrolling that ADD to
multiple contiguous registers, they're *going* to overlap:

ADD x3 x5 x7, VL=10, x3 redirects to x10, x5 redirects to x12:

becomes:

ADD x10 x12 x7
ADD x11 x13 x8
ADD x12 x14 x9 <- overlap already begins
ADD x13 x15 x10
...

> No special analysis is required
> in the compiler to vectorize this loop, nor is any special
> detection necessary to avoid vectorizing this loop. The basic
> tenet is that if the scalr code can
> be written in a simple loop that one can change the top and bottom
> of the loop and presto-it is vectorized.

exactly. it's pretty elegant, isn't it?

> All of the advantages of a CRAY-like vector ISA, virtually no change to the
> actual ISA (2 instructions), and precise exceptions too.

one of the things that drives SV is the need to minimise the
amount of modifications and maintenance to the existing
RISCV toolchain (including LLVM and various back-ends such
as clang, rustlang, and so on).

by not adding any new instructions *at all* i can write unit tests
purely in *standard* RISC-V assembler. the *behaviour* is different...
and that's absolutely fine.

in the target primary application we can go straight ahead with
writing a 3D GPU driver (Kazan3D, a Vulkan LLVM software renderer),
use the *existing* scalar RISCV64 compilers, and just put in
hand-coded assembler..

...or, better yet, cheat a little by writing scalar functions,
and, before calling them (and on exit) have a bit of inline assembler
that sets up and tears down the required vectorisation :)

l.

lk...@lkcl.net

unread,

Nov 5, 2018, 1:33:40 PM11/5/18

to

On Monday, November 5, 2018 at 5:27:45 PM UTC, MitchAlsup wrote:
> On Monday, November 5, 2018 at 10:46:27 AM UTC-6, Terje Mathisen wrote:

> > Mitch, you have written more than enough about this feature to wet my
> > appetite:
> >
> > How much hardware do you actually need in order to run this at
> > comparable speed to some Intel-specific 4-wide AVX loop code?

terje: do take a look at that "simd considered harmful" article
in the OP. it's well worth reading.

> DO 30 I = J, N, N1
> XT = RE(I) - RO(I)
> YT = IM(I) - IN(I)
> RE(I) = RE(I) + RO(I)
> IM(I) = IM(I) + IN(I)
> RO(I) = XT*C - YT*S
> IN(I) = XT*S + YT*C
> 30 CONTINUE

just looking at this, i believe that the regular nature
of the real/complex imaginary numbers here could benefit
from SV's "REMAP" functionality.

one of the requirements of the Khronos Group Vulkan API
is that matrices be a top level primitive. unfortunately,
the API hands you an array of data structures and whilst
you _could_ drop the data down to memory and back to
restructure it, we cannot do so for this particular
implementation as the power penalty is too great.

so to cater for this i added a 1D/2D/3D "remap"
system, which takes the loop index and "reshapes" it.

i *believe* it may be possible to apply a simple
re-shaping to the complex numbers, here, to treat
them as a 2D array. where one would normally see
array indices of:

0 1 2 3 4 5

instead the "reshaper" may map these to:

0 3 1 4 2 5

thus allowing, i believe, the real and imaginary parts to be
placed into the same sequential array, yet a seemingly
"sequential" set of FMUL and FADDs does the right thing.

this would, potentially and hyothetically, if i could get my
head round the algorithm properly, halve the total number
of instructions used in the loop. not quite, due to
the last two, one being an "add" and the other being
a "subtract".

> and got something like this:

> loop10:
> loop20:
> loop30:
> VEC {VSI*4,VVV*12,VSI*4,ISS}
> LOOP RI,RN1,RN,LE
> BLE Rc,loop20
> BLE loop10

so, like traditional DSP-style zero-overhead loops,
there's only one level of hardware-looping, here.

this is why i like ZOLC, because in ZOLC *all three*
loops may be expressed, to give full 100% ALU
utilisation with zero branch prediction needed.

> For full perf on this one, one would need 6 lanes of FADD, 4 lanes of FMUL,
> 4 lanes of LDD, and 4 lanes of STD. If one had that set of resources, one
> would be performing a memory to memory radix 2 FFT butterfly every cycle.

that sounds about on a par with the TI DSP series: it's a VLIW-like
architecture, where the inner loop fits into a single (14-word-long)
instruction. consequently they provide the same memory and
computational resources and achieve zero stalling.

l.

Terje Mathisen

unread,

Nov 5, 2018, 1:33:46 PM11/5/18

to

Nice!

> SUBROUTINE FFT2( A, N )
> COMPLEX DOUBLE PRECISION A[M]
> DOUBLE PRECISION RE[M] = &REAL(A[0])
> DOUBLE PRECISION IM[M] = &IMAG(A[0])
> N2 = N
> DO 10 K = 1, M
> DOUBLE PRECISION RO = &RE[N2]
> DOUBLE PRECISION IN = &IM[N2]
> N1 = N2
> N2 = N2/2
> E = 6.28318/N1
> A = 0
> DO 20 J = 1, N2
> C = COS (A)
> S =-SIN (A)

Not even using SIN/COS tables? Wow! :-)

Here it looks like you need at least 4-way associativity in order to
make sure that the four source/target vectors don't trash each other
even if the vector offsets are a plural of the page size, but as long as
you have that you should be OK, right?

Those requirements don't seem too bad. :-)

MitchAlsup

unread,

Nov 5, 2018, 1:37:21 PM11/5/18

to

On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> On Monday, November 5, 2018 at 4:16:32 PM UTC, MitchAlsup wrote:
>
> > Over the last year I have been working on a Virtual Vector extension to
> > My 66000 ISA. In this extension the compiler is given the illusion of
> > a register to register CRAY-like vector instructions set, but there are
> > only 2 instructions added to the ISA in support of vectors. In particular;
> > a) there are no vector registers (saving massive real estate)
>
> yes. there is a huge disadvantage of traditional vector ISAs:
> massive duplication and specialisation of the opcodes.
>
> in RVV they have had to add approximately a 25-30% duplication
> of the xBitManip extension for the sole purpose of managing predicate
> bits... *without* that hardware being actually available for the
> purposes of general-purpose bit manipulation.
>
> a combination of SV and xBitManip extensions provides near 100%
> complete coverage of RVV's vectorisation functionality.
>
> there is one perceived advantage: the vector pipeline may be entirely
> separated from the scalar pipeline. i would perceive that as a
> *dis*advantage, on the basis that it results in huge duplication of
> silicon, particularly arithmetic pipelines.

The Vector machine can be "attached at the hip" in a My 66000 implementation
or it can be microcoded to use the same calculation and agen units the
scalar machine uses. The SW model is invariant.

>
> > b) no vector length register (nor a conceptual length of the vector)
>
> that's interesting.
>
> > c) no vector control registers (no vector state, full vector perf)
>
> some "setup" is required however. the state information needs to
> be set up one way or the other. CSRs or instructions, it's one
> or the other.

The VEC instructions casts that information over the loop.

>
> > d) no vector condition codes (scalar predication covers this)
>
> yes it does. RISC-V scalar operations do not have the concept
> of predication.

Bad Move, By the way.

> therefore, SV has an additional CSR "predication"
> table that allows the *use* of a given (ordinarily scalar) register
> to indicate "hey this operation is now predicated, because you
> used register x5, and the CSR lookup table says that register x3's
> bits are to be used as the predicate".
>
> in this way there was absolutely no need to modify the RISC-V
> scalar ISA *at all*.

But if the RV scalar code had predication, there would have been no need
to do this and only make it available for the vector stuff. Thus by not
adding instructions, you add instructions--become less expressive in the
process.

Marking the loop indexes enables the reservation stations to increment
their counters and wait for the next operand to arrive. That is, when
issuing instructions into the RS, those entries with I watch for the
execution of the LOOP instruction, and when it fires, they increment
and wait for the operands of the i^th loop to arrive.

>
> so for loop indices to be added, a *separate* extension would be needed
> that... added looping.
>
> in DSP terminology this is called zero-overhead looping, and at its
> most effective (TI DSPs for example) a full FFT may be achieved
> *in a single instruction* with 100% utilisation, thanks to a zero
> overhead loop control mechanism.
>
> i therefore looked up and contacted someone who created something
> called "ZOLC", and obtained the verilog source code. if anyone
> is interested i can provide a copy on request, or put you in touch
> with the author.

Could you send me a copy?

>
> the author of ZOLC provided an amazing example of implementing
> MPEG decode, and coming up with a whopping 38% decrease in
> completion time... *with no branch prediction unit*.
>
>
> > The mechanism of execution is as follows:: The VEC instruction casts a
> > shadow over the instructions in the vectorized loop. Scalar registers
> > only have to be fetched from the register file ONCE, Vector registers
> > will be captured Once per loop,
>
> interesting that you've noted the loop invariance of scalar registers.

Power savings and you don't have to make the reservation stations count.

>
> where are the actual vector registers? *are* there any actual vector
> registers? in SV, the scalar register file is "remapped" to
> "vectorised" by being treated as if it was a contiguous memory block,
> and the "scalar" register is a pointer to the starting element.

The vector registers are, in effect, in the reservation station operand
containers, reutilized as calculations get done.

>
> in this way, an ADD r10 r20 r30 with a VL of 3 actually becomes:
> ADD r10 r20 r30
> ADD r11 r21 r31
> ADD r12 r22 r33
>
> assuming that all of r10, r20 and r30 have been marked as "vectorised",
> that is.

Same principle, R7[k], except k is held in the stations and has no
multiported file containing the bits.

>
>
> > and Writes to the Index register cause
> > another LOOP to be initiated in <wherever the instructions are being
> > buffered::but most likely something like reservation stations.>
> >
> > The LOOP instruction does not need a branch target, that was already
> > supplied by the VEC instruction; all it needs is the Loop index register,
> > an increment, a comparison, and a termination.
>
> this is classic DSP-style zero-overhead looping. it's pretty effective,
> and completely does away with the need for branch prediction. in
> the ZOLC implementation it can even be nested, and there can be
> multi-variable condition codes causing jumps *between nesting levels*...
> all producing 100% full pipelines, no stalls, *no branch prediction at all*.

It also gets rid of overcommit of the loop--that is issuing the instructions
in loop a few more times than the loop is run, so there is less to backup when
the branch (n.e., LOOP) mispredicts--saving power, again.

>
> > Architecturally, there can be as many lanes as the designers want to
> > build (NEC SX5+) with the caution that the memory system needs to scale
> > proportionate to the calculation system.
>
> absolutely. it's been well-known for many many decades that the
> bottleneck is never the computational aspect of vector processors:
> it's the memory bandwidth.
>
> > If an enabled exception is raised, the vector calculations preceding the
> > exception are allowed to complete, and vector data is stored in the
> > scalar registers, exception control transfer is performed, and the
> > exception handled as if there were no vectorization at all.
>
> in SV, one of the design criteria was that it had to be realistically
> achievable for a single computer science student or a libre hardware
> engineer to implement in a reasonably short time-frame.
>
> when i first encountered RVV, the RISC-V Vectorisation proposal, i
> was very surprised to learn of the design criteria that, firstly,
> elements and the operations on them must be treated as being executed
> sequentially, and secondly, that if an exception occurred and there
> were elements further along (higher indices), the perfectly-good
> results had to be THROWN OUT.

Just the breaks of the game.

>
> six months later i began to understand the value of this design decision.
>
> the issue is that exceptions, such as virtual memory page table
> misses, have to be re-entrant. anything more complex than the above
> design criteria is just too complex to handle.
>
> so the state that's stored in SV (and i believe RVV as well) has to
> include the loop indices. in SV there are actually two indices (due
> to twin-predication on certain operations).
>
> there's seperate state for M-Mode, S-Mode (supervisor mode) and U-Mode,
> allowing each to continue to use their own separate vectorisation
> to manage lower-privilege execution levels.

I should note there is no privilege in My 66000 architecture. No privileged
instructions, no privileged state, ... none.

>
>
> > In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
> > are quiet, waiting for the vector to complete (and getting FETCH out of the
> > way of vectors in memory.)
>
> interesting. what happens if the loop is say 2,000 to 4,000 instructions
> in length?

In the absence of exceptions or interrupts, the loop runs to completion and the
FETCH/DECODE/ISSUE units restart when the LOOP is done.

>
>
> > The memory system performs store-to-load forwarding analysis so that loops
> > such as:
> >
> > void looprecurance(size_t n, double a, const double x[])
> > {
> > for (size_t i = 0; i < n; i++) {
> > x[i] = a*x[i-10] + x[i-7];
> > }
> > }
> >
> > remain vectorized; one calculation simply becomes dependent on
> > the 7th and 10th previous calculations through a memory system.
>
> this is accidentally provided in SV as well :)

It better NOT be an accident!

>
> the CSR "register tagging" system actually allows redirection
> of the registers as well. so, at the assembly level the
> instruction says "ADD x3 x5 x7" however register x3 is redirected
> to *actually* point at x10, x5 is redirected to point at x12,
> and so on.
>
> in this way, it should be clear that if the loop length is
> say 10, then due to the hardware-macro-loop unrolling that ADD to
> multiple contiguous registers, they're *going* to overlap:
>
> ADD x3 x5 x7, VL=10, x3 redirects to x10, x5 redirects to x12:
>
> becomes:
>
> ADD x10 x12 x7
> ADD x11 x13 x8
> ADD x12 x14 x9 <- overlap already begins
> ADD x13 x15 x10
> ...

I just made the stations do this for me.

>
>
>
> > No special analysis is required
> > in the compiler to vectorize this loop, nor is any special
> > detection necessary to avoid vectorizing this loop. The basic
> > tenet is that if the scalr code can
> > be written in a simple loop that one can change the top and bottom
> > of the loop and presto-it is vectorized.
>
> exactly. it's pretty elegant, isn't it?
>
> > All of the advantages of a CRAY-like vector ISA, virtually no change to the
> > actual ISA (2 instructions), and precise exceptions too.
>
> one of the things that drives SV is the need to minimise the
> amount of modifications and maintenance to the existing
> RISCV toolchain (including LLVM and various back-ends such
> as clang, rustlang, and so on).

Not my problem. But I think RV has lost a few brownie point for not getting
the base ISA up to industrial quality levels first. Then you would not be
having the issues you are having adding vectors, predication,...

>
> by not adding any new instructions *at all* i can write unit tests
> purely in *standard* RISC-V assembler. the *behaviour* is different...
> and that's absolutely fine.

Note: My 66000 vector addition was exactly 2 instructions and ZERO thread
state (ZERO!) Thus the context switch part of the OS does not care if
vectors were used or not--vectors are INVISIBLE to the OS parts of the
hardware.

{Also note: My66000 performs context switch in hardware. A thread is
enabled (by the OS) and a processor picks it up by fetching 5 cache lines.
By the time these lines arrive, the register state, and the thread header
state have arrived and been installed. So we have a IP, Root pointer,...
already loaded ready to perform the instruction fetched at memory[IP++].
So too when the thread is disabled, the HW shoves the modified state back
whence it came. So the HW actually does the context switching...and in less
time than SW could conceive of ding it.}

MitchAlsup

unread,

Nov 5, 2018, 1:39:24 PM11/5/18

to

Except that it is a straight vector implementation.

MitchAlsup

unread,

Nov 5, 2018, 1:41:57 PM11/5/18

to

At 19 cycles for SIN or COS these in HW MAY be faster than in memory (L2 hit)

No, I can use the L1 miss buffers directly for temporary vector storage
and avoid thrashing the L1 caches when processing vectors.

lk...@lkcl.net

unread,

Nov 5, 2018, 1:54:52 PM11/5/18

to

On Monday, November 5, 2018 at 6:39:24 PM UTC, MitchAlsup wrote:

> > that sounds about on a par with the TI DSP series: it's a VLIW-like
> > architecture, where the inner loop fits into a single (14-word-long)
> > instruction. consequently they provide the same memory and
> > computational resources and achieve zero stalling.
> >
> > l.
>
> Except that it is a straight vector implementation.

yeah, which is an aspect that i like about vectorisation.
i haven't used TI's DSPs since 1992: at the time, the
compilers were pretty poor. to get full utilisation
they had to be programmed in hand-coded assembler.
maybe their proprietary compiler toolchain has improved,
it makes no odds: it's such a specialist embedded engine.
and also, TI's top DSPs are classified under US BXPA as a
"weapons-grade munition" *sigh*...

l.

MitchAlsup

unread,

Nov 5, 2018, 3:36:53 PM11/5/18

to

On Monday, November 5, 2018 at 10:16:32 AM UTC-6, MitchAlsup wrote:
> On Monday, November 5, 2018 at 5:54:59 AM UTC-6, lk...@lkcl.net wrote:
> > over the past several months i've been developing an abstraction layer
> > for vectorisation and parallelisation of RISC-V. i've never encountered
> > anything remotely similar in any commercial processor design, hence
> > why it could seriously benefit from a review.
>
> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)

Consider the CRAY-1 with 8*64-entry-64-bit vector registers (4096 bytes
of 5 ported storage). The vector register file is larger than the area
of all the function units connected to the 512 64-bit registers in the
VRF.

Now consider a machine with reservation stations in front of each function
unit. As long as the RS storage is deeper than the pipeline depth of the
function unit, one can use the RS operand capture mechanism to stand in
the place of the vector registers and associated forwarding/chaining.

Getting rid of the VFR was simply an enabling device to slip the vector
extension into My 66000, and once thought through, it works perfectly
and seamlessly.

> b) no vector length register (nor a conceptual length of the vector)
> c) no vector control registers (no vector state, full vector perf)
> d) no vector condition codes (scalar predication covers this)

Getting rid of this stuff causes one to design the vector facility as if
one were building a LOOP around scalar code--which one actually is.

void conditional(size_t n, double a, const double x[], double y[])

{
for (size_t i = 0; i < n; i++) {

if( x[i] > 0.0 )

y[i] = a*x[i] + y[i];

else
y[i] = x[i] + a*y[i];
}
}

could be compiled as:

conditional:
MOV Ri,0
CMP Rt,Ri,Rn
BGE Rt,exit
top: VEC {VSI*2,V,VSVV,VSI,ISS}

LDD Rx,[Rxb+Ri<<3]
LDD Ry,[Ryb+Ri<<3]

PGT Rx,TF // predicate on condition
FMAC Ry,Ra,Rx,Ry // executed if Rx > 0
FMAC Ry,Rx,Ry,Rx // executed otherwise (inc. NaNs)

STD Ry,[Ryb+Ri<<3]
LOOP Ri,1,Rn

exit:
JMP R0

Note: VERY IMPORTANT! Individual instructions do NOT waste bits supporting
predication! These bits are provided by predicate instructions (which are
degenerate branches in effect).

The PREDICATE instructions casts predicates over the subsequent
instructions--in this case the first instruction is in the THEN
clause (T) while the next instruction is in the ELSE (F) clause.
One or the other is issued, the instructions buffering mechanism
(reservation stations) can figure out whether to fire or not
eliminating the need to nullify a calculation that is not needed.

In the My 66000 ISA predicates can be combined to perform && an ||
branches as well as nested if-the-else stuff.

Bruce Hoult

unread,

Nov 5, 2018, 7:14:21 PM11/5/18

to

On Monday, November 5, 2018 at 10:37:21 AM UTC-8, MitchAlsup wrote:
> On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> > one of the things that drives SV is the need to minimise the
> > amount of modifications and maintenance to the existing
> > RISCV toolchain (including LLVM and various back-ends such
> > as clang, rustlang, and so on).
>
> Not my problem. But I think RV has lost a few brownie point for not getting
> the base ISA up to industrial quality levels first. Then you would not be
> having the issues you are having adding vectors, predication,...

It's the classic bootstrapping problem with limited resources, isn't it?

What is the Minimum Viable Product? Is it better to ship first hardware in 2016 and hardware with vector support in 2019, or is it better to ship a finished (ha ha) thing in 2019?

How will you support yourself financially for those extra three years? How will you get buzz going so that well before the hardware with vector support ships you're already a 1st class citizen in binutils, gcc, glibc, newlib, gdb, llvm (March 2019 release), already have multiple Linux distributions with most packages ported, and BSDs, and several RTOSes.

I tend to the view that it's better to ship and generate buzz as soon as you have a useful product. SiFive has *plenty* of customers who want the smallest and lowest power 32 bit chip possible with just the basic integer instructions (not even multiply/divide), two stage pipeline, no cache, no branch prediction. And there are a lot of customers chomping at the bit for the 64 bit dual-issue in order processor (competitive with ARM A55 and A53) announced on October 31. There's no NEON equivalent and no GPU but many applications don't need either of those.

Of course, having those later will open up a whole new range of applications and customers. But, in the meantime, *revenue* and buzz and software support.

> > by not adding any new instructions *at all* i can write unit tests
> > purely in *standard* RISC-V assembler. the *behaviour* is different...
> > and that's absolutely fine.
>
> Note: My 66000 vector addition was exactly 2 instructions and ZERO thread
> state (ZERO!) Thus the context switch part of the OS does not care if
> vectors were used or not--vectors are INVISIBLE to the OS parts of the
> hardware.

For sure it's very nifty and worth studying. I think there's a lot to be said for a pure data streaming model without huge vector registers, especially on bigger designs that have OoO resources already.

One of the design goals of the official RISC-V Vector ISA is that it is microarchitecture-agnostic. You can implement it with basically a traditional SIMD hardware approach and the software doesn't notice, and runs fine. You can implement it with a basically streaming OoO data flow approach and the software doesn't notice, and runs fine. The streaming approach doesn't *need* the SETVL instruction and mechanism, but it also doesn't hurt it -- the hardware could just report that it supports infinitely long (or at least very long) vectors without actually having backing store for that. On the other hand, a traditional SIMD micro-architecture can just always report the same vector length. It *does* have to support some way to mask off the extra lanes in the vector tail, but that's pretty trivial.

Bruce Hoult

unread,

Nov 5, 2018, 7:29:06 PM11/5/18

to

On Monday, November 5, 2018 at 12:36:53 PM UTC-8, MitchAlsup wrote:
> Note: VERY IMPORTANT! Individual instructions do NOT waste bits supporting
> predication! These bits are provided by predicate instructions (which are
> degenerate branches in effect).

I agree this is a good thing to do.

I worry that it is probably protected by patents. ARM does it with the "IT" instruction in Thumb2 (2003) which casts Then or Else predicates over up to the following four instructions. Does anyone else do it?

Also, ARM has backed away from actually using it. In the latest ARMv8 version all uses of IT are deprecated except those which control a single instruction, and that single instruction must be a 16 bit instruction.

In other words, in future they will be supporting only an IT and a following instruction that together can be treated as a pseudo-32 bit instruction.

jim.bra...@ieee.org

unread,

Nov 5, 2018, 8:20:58 PM11/5/18

to

Perhaps off topic

Would like to introduce an alternative view:
Reserve n registers in the register file as a "barrel"
Vector results go into the barrel one after another.
"n" is large enough to hold all the intermediate results for one loop iteration.

So announcing the barrel with a special instruction starts the loop
And a repeat instruction is at end of loop.

The intermediate results don't actually have to reside in the barrel, instead, say, in function unit output shift registers as in the Mill. Or the barrel is just a compiler slight-of-hand and an exception mechanism.

Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time. The problem of partially filled registers at the end of the loop can be addressed generically?

MitchAlsup

unread,

Nov 5, 2018, 8:34:53 PM11/5/18

to

In effect, this is what my virtual vector registers are, except I cast them
as operand storage associated with reservation stations. Each operand in
an RS maintains a loop counter so it can wait for more than one loop result
one per loop.

I wanted the mechanism to smell just like CRAY-style vectors for ease of
compiler support, but I did not want to pay for the VRF storage area
or power.

>
> Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time.

> The problem of partially filled registers at the end of the loop can be addressed generically?

But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
processed. The MMX model only works when the data is packed in memory, and
only when all sub-locations want to be processed similarly (with a couple of
exceptions.)

Stephen Fuld

unread,

Nov 5, 2018, 9:00:16 PM11/5/18

to

This has the advantage of providing a place to hold "partial values" for
loops that e.g. sum the elements of a vector, do a dot product,or even
compute a checksum.

But it has the significant disadvantage of losing one of the nicest
features of Mitch's design, the code is identical independent of the
number of functional units i.e., no need to specify the number of
registers to use for the "barrel", and even potentially allows for an
implementation with more functional units than registers.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

jim.bra...@ieee.org

unread,

Nov 5, 2018, 9:50:10 PM11/5/18

to

Need to plead inexperience WRT gate counts for all the different approaches.
Associating a loop count with each operand solves the alignment over multiple loops. And Mitch's design has a lower register count than CRAY-1.

When I see input reservation stations I think data flow. And wonder how it gets setup and torn down? Presumably no branches outside of the loop nor any subroutine calls within the loop?

Am thinking any of the approaches needs to be opportunistic (grab low
hanging fruit) when short operands are adjacent in memory.

IMHO it becomes a question of where to place the "crossbar" and whether it can be time multiplexed to reduce the number of write ports?

MitchAlsup

unread,

Nov 5, 2018, 10:06:15 PM11/5/18

to

It has the same register count as a scalar design, but with vector perf.

Literally, there was no increase in thread state due to having vector
capabilities, and as far as ISA goes, there are exactly 2 instructions
added to to ISA.

>
> When I see input reservation stations I think data flow. And wonder how it gets setup and torn down? Presumably no branches outside of the loop nor any subroutine calls within the loop?

Reservation stations get setup as instructions are issued on the first pass.
Each operand in each station entry is marked Scalar, Vector, or Indexed (by
the loop count).
Scalar operands are read from the RF just like scalar instructions.
Scalar operands are captured once and remain resident over multiple firings.
As vector operands arrive, the RS sees that it has all the needed operands
and fires the instruction off to the function unit. Vector operands are
use once per loop per station entry.
Index registers run the loop overhead, and each time they calculate, their
result enables the next loop for the station vector operand capture and fire.

When the LOOP is satisfied, the stations self clear as if they were scalar
entries and the loop is done.

As far as scope, I would like to be able to vectorize great big loops
such as FPPPP, but that is simply impossible; as the loop is 8 KBytes
of instructions long (2K instructions) and bigger than any conceivable
set of RS entries.

Looking smaller, I demanded of the mechanism to handle radix 2 FFT
(20 instructions plus loop overhead.) So, somewhere in the 16-64
instruction range is the scale of vectorizing this mechanism is
optimized for.

lk...@lkcl.net

unread,

Nov 5, 2018, 10:42:34 PM11/5/18

to

On Tuesday, November 6, 2018 at 3:06:15 AM UTC, MitchAlsup wrote:

> As far as scope, I would like to be able to vectorize great big loops
> such as FPPPP, but that is simply impossible; as the loop is 8 KBytes
> of instructions long (2K instructions) and bigger than any conceivable
> set of RS entries.

i love the idea of vectorisation being opaque: i am nervous about
the impact the idea has on both microarchitectures (restricting
implementors to not being able to use certain microarchitectures
as the base) and on its understandability.

> Looking smaller, I demanded of the mechanism to handle radix 2 FFT
> (20 instructions plus loop overhead.) So, somewhere in the 16-64
> instruction range is the scale of vectorizing this mechanism is
> optimized for.

ok, so this would be where RVV and SV have a distinct advantage, as:

* in RVV the vector register file is separate and therefore
may be passed as parameters to functions at an unrestricted
stack depth

* in SV, registers may be "tagged" and left that way, at the
programmer's discretion, such that the same effect is
achieved.

so both have no such limit on the number of instructions that
may be executed using vectorised registers, nor on the stack
depth.

this does mean that it is necessary to save/restore registers
and associated state on entry/exit to function calls (and
on context-switches). this being nothing new and done all
the time in scalar programs, the only question being how
much in terms of power/computing-time resources it takes to
do so.

l.

lk...@lkcl.net

unread,

Nov 5, 2018, 10:50:21 PM11/5/18

to

On Tuesday, November 6, 2018 at 2:50:10 AM UTC, jim.bra...@ieee.org wrote:

> When I see input reservation stations I think data flow.
> And wonder how it gets setup and torn down? Presumably no
> branches outside of the loop nor any subroutine calls within
> the loop?

no, that's correct, on both counts.

(1) standard simple zero-overhead loops just cannot cope with
anything other than the most basic simplest for-loops. they're
designed specifically for algorithms like FFTs with fixed
loops. this is where things like ZOLC come in:

https://arxiv.org/pdf/0710.4632

(2) mitch answers in a follow-up about the limitations on
using RS. it's a leaf-node design concept, where SV
(and RVV) are unlimited length and depth on function
calls.

l.

lk...@lkcl.net

unread,

Nov 5, 2018, 11:11:57 PM11/5/18

to

On Tuesday, November 6, 2018 at 1:20:58 AM UTC, jim.bra...@ieee.org wrote:

> Perhaps off topic

not at all: this is a thread for reviewing SimpleV

so as such, comparative and alternative vectorisation and
parallelism concepts are all up for consideration, so as to be
able to see what what would work.

> Would like to introduce an alternative view:
> Reserve n registers in the register file as a "barrel"
> Vector results go into the barrel one after another.

this is pretty much exactly what SV is based on. i did not like
the idea of having a separate register file as in a traditional
CRAY-style vector architecture, so i went, "ok, so let's see
what kind of mileage we can get out of using the *standard*
integer and floating-point register files".

> "n" is large enough to hold all the intermediate results for
> one loop iteration.

the next logical conceptual progression is to just think of
the loop as just "using the standard register file(s)".

> So announcing the barrel with a special instruction starts the loop
> And a repeat instruction is at end of loop.

why not just have a standard scalar zero-overhead loop mechanism,
which, when you happen to introduce vectorisation, may be used
there as well?

in that way, the zero-overhead loop mechanism may be used to
optimise scalar code as well?

the only tricky bit is that i believe the loop setup and usage
does have to be slightly different for vector and scalar: i
haven't investigated this fully as i have been focussing on
the parallelism.

> Makes it easier for my mind to comprehend what is going on:
> Every result has a place.

yes. using hidden machine state has me concerned for this
reason. if it's *actually* registers - real registers -
microarchitectural implementations can get a concrete handle
on where the data resides, and may decide to drop in OoO,
superscaling, register renaming, lanes, lane-crossing
crossbars, and, my favourite i'd really really like to
implement: L1/L2 SMP style coherent caches on the register
file.

> Use of wider registers and multi-media instructions
> is just optimization of resources done either at compile
> time or run time.

mmm be veerry waary of the seductive powerrrr of SIMDeeee...
https://www.youtube.com/watch?v=fZY8jUuEzJQ

SIMD i view as a de-optimisation:
https://www.sigarch.org/simd-instructions-considered-harmful/

the loops are less efficient (far more instructions than
variable-length vectorisation), and the corner-cases are
absolutely dreadful: often longer than the actual loop,
and requiring a power-of-two cascade of decreasing-size
SIMD instructions plus branches or loops to deal with.

> The problem of partially filled registers at the end of
> the loop can be addressed generically?

in SV and RVV (and vectorisation systems in general),
i believe the answer is yes.

i have not encountered however any variable-length vectorisation
systems that also have zero-overhead loops, although i am
seriously considering dropping a ZOLC implementation into
the Libre RISCV GPGPU.

l.

Bruce Hoult

unread,

Nov 5, 2018, 11:21:08 PM11/5/18

to

On Monday, November 5, 2018 at 7:42:34 PM UTC-8, lk...@lkcl.net wrote:
> ok, so this would be where RVV and SV have a distinct advantage, as:
>
> * in RVV the vector register file is separate and therefore
> may be passed as parameters to functions at an unrestricted
> stack depth

I don't believe it's a useful or desirable thing in RVV to call functions that execute vector instructions while in the middle of a vector processing loop. Any given called function can only usefully work with one configuration (out of the thousand or so possible) of the vector registers. If the called function configures the vector registers then all current contents are lost. Saving and restoring that state is a ridiculous thing to do -- it's potentially kilobytes.

There may be certain special and specialised runtime functions that can be called from inside a vector processing loop. Transcendental functions, for example. They'll be hand-written by specialists.

But in general, any vector processing function that you want to call from inside a vector processing loop should be inlined.

There will, I believe, be no such thing in the ABI as function arguments or results with vector-register values.

There can of course be user-level vector types or classes that are argument to functions. That's a totally different matter.

lk...@lkcl.net

unread,

Nov 5, 2018, 11:52:00 PM11/5/18

to

On Tuesday, November 6, 2018 at 4:21:08 AM UTC, Bruce Hoult wrote:
> On Monday, November 5, 2018 at 7:42:34 PM UTC-8, lk...@lkcl.net wrote:
> > ok, so this would be where RVV and SV have a distinct advantage, as:
> >
> > * in RVV the vector register file is separate and therefore
> > may be passed as parameters to functions at an unrestricted
> > stack depth
>
> I don't believe it's a useful or desirable thing in RVV to
> call functions that execute vector instructions while in the
> middle of a vector processing loop.

i am beginning to understand that RVV may not have been designed
with such a goal in mind.

> Any given called function can only usefully work with one
> configuration (out of the thousand or so possible) of the
> vector registers. If the called function configures the vector
> registers then all current contents are lost.

this is why, in SV, i provide a mechanism to "subset" the
vector CSR tables. with up to 16 entries in the table,
it would become possible to treat the table as a mini
stack. only when the "stack" depth was to be exceeded
would the state need to be switched out to a real
memory-based stack.

oo i just had an idea: when the CSR bank address is
altered to subset the CSRs, move the meaning of
CSR0 to point to the new "start" of the bank.

that way (until you run out of "stack", that is),
you get the benefits of a real stack: the callee
doesn't need to know anything about the function
that called it.

hmmm...

> Saving and restoring that state is a ridiculous thing to do
> -- it's potentially kilobytes.

in SV it's a maxium of... i think a maximum of 76 bytes for the
CSR state. that would require all 16 entries to have been used
up.

vectorised-registers may also be passed as parameters to function
calls: in SV, all that means is: you're passing a contiguous
*block* of registers to a function, you just happen to be
using one single (scalar) register to do so.

hmmm... i believe that if i also add in a means to shuffle
(move) the CSRs up and down the table, the amount of state
required to be saved/restored on each function call entry
and exit is greatly reduced.

> There may be certain special and specialised runtime functions that can be called from inside a vector processing loop. Transcendental functions, for example. They'll be hand-written by specialists.
>
> But in general, any vector processing function that you want to call from inside a vector processing loop should be inlined.
>
> There will, I believe, be no such thing in the ABI as function arguments or results with vector-register values.

all of these are based, i believe, on deliberate conscious design
choices to limit RVV's scope.

whilst it may take considerable time - potentially years and *definitely*
man-years - to get the infrastructure into place in compiler toolchains,
i'm planning SV to be perfectly reasonable to consider using nested
arbitrary function call depths: leveraging *existing* scalar compiler
infrastructure conventions for temporary, caller and callee register
name conventions does not seem unreasonable as long as there is
a way to push/pop the vector config state.

of course, for the 3D application that's likely to go completely out
the window and require custom assembly, for many years, if not permanently.

i still can't quite get over the fact that even for a
lowly "embedded" 3D GPU which is only going to be doing around 5GFLOPS,
we had to extend the FP and INT regfiles to a whopping 128 64-bit entries
in order to avoid the [extremely significant] power penalty of pushing
large amounts of vector data through the L1/L2 cache.

l.

Ivan Godard

unread,

Nov 6, 2018, 12:04:01 AM11/6/18

to

On 11/5/2018 5:20 PM, jim.bra...@ieee.org wrote:

> Perhaps off topic
>
> Would like to introduce an alternative view:
> Reserve n registers in the register file as a "barrel"
> Vector results go into the barrel one after another.
> "n" is large enough to hold all the intermediate results for one loop iteration.
>
> So announcing the barrel with a special instruction starts the loop
> And a repeat instruction is at end of loop.
>
> The intermediate results don't actually have to reside in the barrel, instead, say, in function unit output shift registers as in the Mill. Or the barrel is just a compiler slight-of-hand and an exception mechanism.
>
> Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time. The problem of partially filled registers at the end of the loop can be addressed generically?
>

Didn't the Itanium do this? Of course, it's off-patent now.

lk...@lkcl.net

unread,

Nov 6, 2018, 12:08:33 AM11/6/18

to

On Tuesday, November 6, 2018 at 1:34:53 AM UTC, MitchAlsup wrote:

> > The problem of partially filled registers at the end of the loop can be addressed generically?
>
> But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
> processed. The MMX model only works when the data is packed in memory, and
> only when all sub-locations want to be processed similarly (with a couple of
> exceptions.)

i took one look at variable-length vectors and went, "*this* makes sense".
SIMD is genuinely insane. it's an O(N^6) - N to the power of SIX - opcode
proliferation! no wonder x86 and ARC have several thousand instructions!

RVV is O(N) opcode proliferation. 66000 is O(1), and SV is also O(1).

the nice thing about SV is that the implementation is a bit like
"pimp my ride" (i've just discovered Fast'n'Loud, sorry...) in that
you simply decouple the decode and execution phases, and drop in a
hardware-macro-loop-unroller algorithm. here's the pseudocode for
ADD (yes, it covers scalar as well as vector):

for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication uses intregs
ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
ireg[rs2+remap(irs2)];
if (!int_vec[rd ].isvector) break;
if (int_vec[rd ].isvector) { id += 1; }
if (int_vec[rs1].isvector) { irs1 += 1; }
if (int_vec[rs2].isvector) { irs2 += 1; }

note that the loop is terminated early if the destination
is a scalar, but *only* on the first bit set in the
predication.

this allows V/S + S/V -> S to be allowed and all
covered by the same pipeline stage.

it also allows even fully scalar operations to be
predicated.

l.

lk...@lkcl.net

unread,

Nov 6, 2018, 12:13:16 AM11/6/18

to

On Tuesday, November 6, 2018 at 4:21:08 AM UTC, Bruce Hoult wrote:

> But in general, any vector processing function that you want
> to call from inside a vector processing loop should be inlined.

i forgot to mention, that SV is designed to augment scalar
operation as much as it is vector operation. it extends
the reach of the register file - *without modifying the
ISA* - to up to 128 integer and 128 floating-point entries
*including* for Compressed instructions, *and* allows
predication on standard *scalar* operations as well.

there's even mention somewhere in the standard RV spec
of the idea of future extensions augmenting the
register file from the standard 32 entries.

so for SV, the idea of being able to do nested arbitrary
function call depth, and still carry even just the
*scalar* register-redirection capability and the
*scalar* predication capability, makes an awful lot
of sense.

l.

James Van Buskirk

unread,

Nov 6, 2018, 12:17:11 AM11/6/18

to

wrote in message
news:a3bc5db9-ad53-4440...@googlegroups.com...

> i *believe* it may be possible to apply a simple
> re-shaping to the complex numbers, here, to treat
> them as a 2D array. where one would normally see
> array indices of:

> 0 1 2 3 4 5

> instead the "reshaper" may map these to:

> 0 3 1 4 2 5

> thus allowing, i believe, the real and imaginary parts to be
> placed into the same sequential array, yet a seemingly
> "sequential" set of FMUL and FADDs does the right thing.

Remapping the inputs is simply wrongheaded for an FFT.
The way to go is to carry out separate real-half complex
FFTs separately on the real and imaginary parts of the input
and then combine results in a final O(N) 'zip-up' step.

Since you are doing the same stuff to the real and imaginary
parts and no complex arithmetic, you can do whatever
equivalent you have of VBROADCAST to the constants
and just perform identical operations on adjacent data
up to the zip-up.

Besides, the radix-2 butterfly algorithm is very inefficient
and a crappy test case. If you wanna make FFTs more
performant, predict triangular loops in your hardware and
find some place in your ISA to store all the fanout while your
algorithm covers floating point latency.

lk...@lkcl.net

unread,

Nov 6, 2018, 12:51:35 AM11/6/18

to

On Tuesday, November 6, 2018 at 5:17:11 AM UTC, James Van Buskirk wrote:
> wrote in message

> > thus allowing, i believe, the real and imaginary parts to be
> > placed into the same sequential array, yet a seemingly
> > "sequential" set of FMUL and FADDs does the right thing.
>
> Remapping the inputs is simply wrongheaded for an FFT.

thanks for the insight, james. i'm definitely not
an algorithms expert: i have to use comprehensive
unit tests, and even then, asking me to explain the
resultant code that *i wrote* is... yeah :)

primarily i wanted to introduce the REMAP concept that
needed to be introduced in SV, for optimal efficiency
when used as a 3D GPU.

the Vulkan API requires 4x3 matrices of parameters
(quantity several) to be processed alongside other
data, including carrying out matrix multiplication.

without the REMAP concept the entire set of values
would need to be pushed out through the L1 (and
possibly L2) caches and back again: the power consumption
to do so is unacceptable.

by providing a way to remap the linear elements in
an arbitrary dynamic fashion, we may leave the matrix
in-place and still do the required "transposition" of
Y and X which are necessary for matrix multiplication.
the general algorithm (in python) is as follows:

(xdim, ydim, zdim) = (3, 2, 1)
lims = [xdim, ydim, zdim]
idxs = [0,0,0]
order = [1,0,2]

for idx in range(xdim * ydim * zdim):
new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
print new_idx,
for i in range(3):
idxs[order[i]] = idxs[order[i]] + 1
if (idxs[order[i]] != lims[order[i]]):
break
print
idxs[order[i]] = 0

Vulkan not only requires matrix multiplication, it also
requires processing of multiple simultaneous X,Y,Z vectors:
QTY 4 at a time. in certain circumstances it is necessary
to access and process all of the Xs, all of the Ys, all of
the Zs.

at other times, in the same loop, it is necessary to process
the 4 vectors as X,Y,Z in particular for normalisation of
vectors, you want to know the distance which is:

sqrt(X^2+Y^2+Z^2)

here, again, we can avoid disrupting the data structures
by using REMAP of the contiguous data.

it's really powerful and absolutely necessary, if we want
to meet the power budget. we simply cannot move the
data out to memory and back again: Jeff Bush, of Nyuzi,
highlighted that moving data through the L1/L2 cache is
a whopping 50% of the power budget in software-rendered
GPUs.

l.

lk...@lkcl.net

unread,

Nov 6, 2018, 1:02:23 AM11/6/18

to

On Monday, November 5, 2018 at 8:36:53 PM UTC, MitchAlsup wrote:

> Consider the CRAY-1 with 8*64-entry-64-bit vector registers (4096 bytes
> of 5 ported storage). The vector register file is larger than the area
> of all the function units connected to the 512 64-bit registers in the
> VRF.

i was genuinely taken aback at the fact that we will need 128 64-bit
FP registers to have an optimal efficient Vulkan implementation.
in for a penny, in for a pound, we decided that a VPU is likely to
need 128 64-bit integer registers as well.

what we then have is a potential bit of a problem with context-switching:
a whopping 2048 bytes!

so as it's all down to conventions, what we decided to do (and this
is in no way a hardware limitation), is simply *ignore* the top
registers numbered 32-127 on context-switching, and to have a
software requirement where each 3D GPU thread is force-pinned to
one and only one SMP core.

bear in mind that the first version of Kazan3D will be
inline assembler for the inner loops.

in this way, no modifications of pre-existing ABIs, compilers etc.
are needed. the bottom "standard" registers 1-31 may be context
switched "as usual". the linux kernel won't even know that
registers 32-127 are there, and won't touch them either.

it's a Monster Hack that we can clean up later, whilst still
"Getting Product Out The Door (tm)".

> In the My 66000 ISA predicates can be combined to perform && an ||
> branches as well as nested if-the-else stuff.

nice. i mentioned in a couple other posts that SV introduces
scalar branch predication as well. one of the CSR bits for
the register/predication tables is "this register is still scalar".

l.

lk...@lkcl.net

unread,

Nov 6, 2018, 2:31:15 AM11/6/18

to

On Tuesday, November 6, 2018 at 12:14:21 AM UTC, Bruce Hoult wrote:
> On Monday, November 5, 2018 at 10:37:21 AM UTC-8, MitchAlsup wrote:
> > On Monday, November 5, 2018 at 12:03:12 PM UTC-6, lk...@lkcl.net wrote:
> > > one of the things that drives SV is the need to minimise the
> > > amount of modifications and maintenance to the existing
> > > RISCV toolchain (including LLVM and various back-ends such
> > > as clang, rustlang, and so on).
> >
> > Not my problem. But I think RV has lost a few brownie point for not getting
> > the base ISA up to industrial quality levels first. Then you would not be
> > having the issues you are having adding vectors, predication,...
>
> It's the classic bootstrapping problem with limited resources, isn't it?

definitely. the modular ISA flexibility of RV has allowed useful
implementations to be made in as little as 2,500 gates (i believe).
look up PicoRV32. it was used to control multi-megawatt electro
magnets in a particle accelerator because the

india is going to be replacing 40-year-old 68k processors in their
fast nuclear breeder reactors with RV64GC home-grown processors,
fabbed on india-only 180nm foundries for obvious national security
reasons.

university students and libre hardware enthusiasts can and have
created implementations *WITHOUT* needing to do the compiler,
assembler or unit tests, as all of that has already been done.

*later* they or someone else can build on that and add extra
functionality.

> What is the Minimum Viable Product? Is it better to ship
> first hardware in 2016 and hardware with vector support in 2019,
> or is it better to ship a finished (ha ha) thing in 2019?

exactly. i'd not in the *slightest* bit be able to remotely
consider, "oh i think i will start a Libre 3D GPU project,
today", without all that hard work having been already been
done.

i don't like the ITU-style closed doors process under which
RISC-V is developed: that has absolutely nothing to do with
the fact that i fully recognise and deeply appreciate the
amount of effort that's gone into RISC-V by the Founders.

> There's no NEON equivalent and no GPU

i put in a proposal to the sifive "democratising ideas" page
that closed last month, to address that.

> Of course, having those later will open up a whole new
> range of applications and customers.

absolutely. one of the "selling points" of ARM is the fact
that a "proven" MALI GPU is available out-of-the-box. they
even provide it free of charge to licensees of the standard
ARM cores.

> But, in the meantime, *revenue* and buzz and software support.

which funds incremental improvements, in manageable chunks.

the only issue i have with RVV is: even in its base form i
consider it to be too heavy-weight for simple verifiable
implementation by small (and one-man) teams in a reasonable
timeframe.

it's adding not just extra opcodes, it's adding an entire
new pipeline, plus features that have not been seen before
in the base RV which, if you *don't* have those features
*and* the new opcodes *and* the new pipeline, you're automatically
non-compliant with the spec...

... behind an opaque closed-doors ITU-style development process.

it's just far too risky a proposition for a small team, isolated
outside of the ITU-style development process, to consider taking on.

this heavily influenced the design of SV. the absolute minimum
compliance is to provide the CSRs, and to trigger a trap handler
whenever a "tagged" register is encountered. in this way, it
is perfectly reasonable to consider deploying SV even on top of
the greatly-reduced RV32E (only 16 32-bit integer registers).

now, whether the traps may be efficiently implemented in RV32E
software, vs providing a hardware-macro-loop instead, remains
to be seen. ROM-based software-emulation of the macro-loop
may turn out to be quite efficient: we just have to see; it's
not my primary focus.

the next level(s) include permitting implementors to provide
a mixture of partial and hybrid parallelism and software-traps.
so, if an implementor does not want (or need) non-default
element widths for their application, *great*, they can punt that
to software-emulation.

i considered this to be a really, really important aspect of
SV, to make the implementation barrier to entry as low and as
flexible as possible.

if implementors find that the decision to punt to software
emulation does not provide them the anticipated performance,
they can always ramp it up...

... but here's the thing: the software-emulation will allow
them to complete and pass *all* of the unit tests, 100% of
the time, *even* as they're developing and optimising the
hardware, all in an incremental fashion.

compare and contrast this to RVV where it's "all-or-nothing",
in one of the most complex architectural areas of computer
science: parallelisation / vectorisation.

> For sure it's very nifty and worth studying. I think there's a
> lot to be said for a pure data streaming model without huge vector
> registers, especially on bigger designs that have OoO resources already.

... with the caveat (i believe) that the data streaming model is
critically dependent *on* having OoO resources.

oh. i remember now. wasn't there someone who did a full data-streaming
(scalar) ISA? i have this vague recollection last year of seeing a
post on RV isa-dev about someone who had made a full ISA which didn't
actually have any registers at all. data was passed entirely from
LOAD to STORE through ALU chains.

> One of the design goals of the official RISC-V Vector ISA
> is that it is microarchitecture-agnostic. You can implement it
> with basically a traditional SIMD hardware approach and the
> software doesn't notice, and runs fine. You can implement
> it with a basically streaming OoO data flow approach and the
> software doesn't notice, and runs fine.

whilst that's true, there's a performance penalty built-in
to RVV for all and any non-OoO (non-register-renaming)
microarchitectures, related to the zeroing on predication.

in a traditional incredibly simple predicated DSP-style SIMD
architecture, with no register-renaming, non-predicated elements
save resources: a zero bit indicates that the ALU for that lane
(if there are lanes) is simply... completely switched off.

no inputs, no outputs. saves on reads, saves on writes.

when the specification makes it *mandatory* that non-predicated
(zero-bit) elements be zero'd, such simple microarchitectures
may no longer fully optimise on power saving: they're *forced*
into a position of writing out a zero to the register file.

in SV i took a different perspective: i noticed in the discussions
that there were benefits to both approaches, so i permitted
both, dynamically, as part of the predication CSR tables.

in this way it becomes up to the compiler and assembly writers
to decide the best approach on a case by case basis.

in addition, in a non-lanes-based scalar multi-issue microarchitecture,
non-zero-predication means that the hardware-macro-loop phase of
SV may issue multiple instructions to the underlying multi-issue
execution units from the one instruction, *not* issuing instructions
to the execution units where the predication bit is zero.

a 4-issue standard microarchitecture could therefore really efficiently
and quickly be turned into an SV Monster, as the infrastructure to
do so is already there.

the advantage of such an approach: 100% execution throughput even when
there is a significant percentage of predication going on.

> On the other hand, a traditional SIMD micro-architecture can just
> always report the same vector length. It *does* have to support some
> way to mask off the extra lanes in the vector tail, but that's
> pretty trivial.

yes, and it's a really nice aspect of vectorisation (as long as
the SIMD architecture has predication). at the end of the loop
all you do is: hard-wire the last bits of predication to zero.

it does mean that you don't get 100% ALU utilisation, however
there's a way to mitigate that: instead of having massively-long
8, 16, 32 long SIMD ALUs, cut them in half: go the other way,
and do multi-issue instead.

i like vectorisation, you can probably tell :)

l.

Bruce Hoult

unread,

Nov 6, 2018, 3:18:33 AM11/6/18

to

On Monday, November 5, 2018 at 11:31:15 PM UTC-8, lk...@lkcl.net wrote:
> > One of the design goals of the official RISC-V Vector ISA
> > is that it is microarchitecture-agnostic. You can implement it
> > with basically a traditional SIMD hardware approach and the
> > software doesn't notice, and runs fine. You can implement
> > it with a basically streaming OoO data flow approach and the
> > software doesn't notice, and runs fine.
>
> whilst that's true, there's a performance penalty built-in
> to RVV for all and any non-OoO (non-register-renaming)
> microarchitectures, related to the zeroing on predication.
>
> in a traditional incredibly simple predicated DSP-style SIMD
> architecture, with no register-renaming, non-predicated elements
> save resources: a zero bit indicates that the ALU for that lane
> (if there are lanes) is simply... completely switched off.
>
> no inputs, no outputs. saves on reads, saves on writes.
>
> when the specification makes it *mandatory* that non-predicated
> (zero-bit) elements be zero'd, such simple microarchitectures
> may no longer fully optimise on power saving: they're *forced*
> into a position of writing out a zero to the register file.

That's not actually true.

You could also implement this by have a 1-bit per lane bitmap of which lanes have been predicated to zero. You don't actually have to write zeros to the register bits, only to the bitmap.

On reading the register you just use the bitmap to select either the register contents or else hardwired zeroes. That adds a 2:1 mux to the data path, but you probably have that anyway, to implement the zero register.

lk...@lkcl.net

unread,

Nov 6, 2018, 5:03:03 AM11/6/18

to

On Tuesday, November 6, 2018 at 8:18:33 AM UTC, Bruce Hoult wrote:

> That's not actually true.
>
> You could also implement this by have a 1-bit per lane bitmap of which lanes have been predicated to zero. You don't actually have to write zeros to the register bits, only to the bitmap.
>
> On reading the register you just use the bitmap to select either the register contents or else hardwired zeroes. That adds a 2:1 mux to the data path, but you probably have that anyway, to implement the zero register.

interesting: i like it. and, i just realised that in the
spike-sv emulator i actually ended up implementing zeroing
in READ_REG as pretty much exactly this (returning x0
on zero-predication): what hadn't occurred to me was the
next step, to actually have the zero as a bitfield
associated with the register file.

neat trick.

l.

lk...@lkcl.net

unread,

Nov 6, 2018, 11:31:25 AM11/6/18

to

On Tuesday, November 6, 2018 at 7:31:15 AM UTC, lk...@lkcl.net wrote:

> > It's the classic bootstrapping problem with limited resources, isn't it?
>
> definitely. the modular ISA flexibility of RV has allowed useful
> implementations to be made in as little as 2,500 gates (i believe).
> look up PicoRV32. it was used to control multi-megawatt electro
> magnets in a particle accelerator because the

... sentence didn't get finished :)

the group had contacted a formal verification company, the conversation
went like this:

"hi we need a formally-verified core, for a mission-critical safety
application involving gigawatts of power, hard radiation going in
the wrong direction and damaging people and stuff, y'ken"

"sure, we can help! our proprietary processor has been formally
verified and used by our customers for 20 years"

"ok, that's great. so, well, we'd like to make sure, by checking
that for ourselves"

"oh nooo, we can't possibly provide you with the souuurce cooode,
it's our proprietary secret, y'ken"

"so who double-checks that your design is correct?"

"oh, we do"

"and who double-checks that the formal verification suite is correct?"

"oh, we do"

"and your customers *trust* you??"

"oh yes, absolutely!"

naturally, PicoRV32, by virtue of it being downloadable and made
available under a libre license, along with a corresponding formal
verification suite, was selected instead.

what's happening with RISC-V is just... amazing.

l.

Anton Ertl

unread,

Nov 6, 2018, 12:32:05 PM11/6/18

to

You present this as an advantage, but I think that this is the biggest
disadvantage of these schemes that try to implement sequences of
vector operations as (conceptual, but more efficiently implemented)
loops of scalar operations.

E.g., desktop hardware has two load units that can load a cache line
(typically 32 bytes on recent CPUs) without much more effort than it
needs for loading smaller amounts [1]. If you design the architecture
such that it uses a load unit for every 8 bytes, the load units will
have only a quarter of the troughput (for vectors of bytes, only
1/32th of the throughput). Sure, you can then run gather loads at
similar speeds as contiguous loads, but you have achieved that by
turning the contiguous loads into gather loads and thus slowing them
down.

[1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
Bridge (2011), but they needed until Skylake (2015) to widen the load
units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.

For other operations, similar considerations apply, although the
tradoffs are different: Making a FP multiply-add unit wider probably
costs more than making a load unit wider (but I may be wrong, given
that Intel implemented 256-bit wide FP units before 256-bit wide load
units; OTOH, maybe Intel tuned the CPUs for matrix multiplication).

In any case, when building on scalar operations, you also exercise the
OoO machinery 4-32 times as much as with 32-byte wide operations.

Concerning the bow to the Cray-1 that I see in several of these works,
I think that one cool thing that the Cray-1 did was that the program
could start some vector operations, and then do scalar work while the
vector stuff was being processed.

Actually, this does not quite work out because of the limited vector
length of the Cray machines, so the programmer has to write loops
containing vector instructions after all.

But consider an architecture where vector length is limited only by
memory. Vector operations could be performed almost independently of
scalar stuff; only scalar operands and memory dependencies would cause
interactions between the scalar and the vector part of the CPU. One
can imagine various ways to implement the vector part (with different
implementations using different processing granularity); I think this
can be quite a bit simpler than exercising the full reservation
station machinery for each operation on each grain.

The downside of arbitrarily long vectors is that you have to be able
to interrupt it in the middle. This means that the interrupt state
includes information about which vector operations are in progress and
how far they are. However, if you let each dependence chain run to
completion of processing all started grains, you only need to save
metadata on an interrupt, and do not need to save registers that
contain intermediate result grains.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Nov 6, 2018, 1:23:55 PM11/6/18

to

MitchAlsup <Mitch...@aol.com> writes:
>But AVX gets nasty if you need 1,2,3 out of the 8 register sub-locations to be
>processed.

AVX(2) has VMASKMOVPS/VMASKMOVPD/VMASKMOVDQU that allow loading and
storing sublocations with a predicate mask. For the ALU operations,
processing more is not that bad, as long as you don't store the
results in the end, and as long as you don't produce an exception on
loading.

I have not looked closely at AVX512, but AFAIK it supports sublocation
predication for all operations.

>The MMX model only works when the data is packed in memory, and
>only when all sub-locations want to be processed similarly (with a couple of
>exceptions.)

IA-64 worked well at data-parallel stuff. Yet it still did not
conquer the supercomputing world. My explanation for this is that,
while it was more flexible than SIMD extensions, SIMD extensions are
good enough (compared to IA-64) for most stuff. I.e., the cases where
the sublocations have to be processed differently are either not that
frequent, or they are covered by the exceptions you mention. In the
meantime, we have GPGPUs, but I think that vector/SIMD stuff in the
CPU can still be useful.

MitchAlsup

unread,

Nov 6, 2018, 3:09:08 PM11/6/18

to

On Tuesday, November 6, 2018 at 11:32:05 AM UTC-6, Anton Ertl wrote:

The counter argument is that if one makes vector loads and vector stores,
one assigns the semantic that they either run to completion or did not
start. {At least everyone I studied did.}

Then you get in the loop of having stride 1 vector loads and stores,
stride K loads and stores, and vector indirect (gather scatter) loads
and stores.

Whereas doing it loop-by-loop and each load/store is its own unit
of visible pipeline work, all those variants are unnecessary. The
power overhead is all in operand routing and not in address calculation
or memory access.

>
> [1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
> Bridge (2011), but they needed until Skylake (2015) to widen the load
> units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.
>
> For other operations, similar considerations apply, although the
> tradoffs are different: Making a FP multiply-add unit wider probably
> costs more than making a load unit wider (but I may be wrong, given
> that Intel implemented 256-bit wide FP units before 256-bit wide load
> units; OTOH, maybe Intel tuned the CPUs for matrix multiplication).
>
> In any case, when building on scalar operations, you also exercise the
> OoO machinery 4-32 times as much as with 32-byte wide operations.

But at the same time, the FETCH-DECODE-ISSUE stages of the pipeline
are quiescent. For simple instruction sequences, this part of the
pipe can represent 50% of the power to perform instructions. Leaving
this quiet balances out with the RSs firing more often and capturing
operands more often.

>
> Concerning the bow to the Cray-1 that I see in several of these works,
> I think that one cool thing that the Cray-1 did was that the program
> could start some vector operations, and then do scalar work while the
> vector stuff was being processed.

Also note: on CRAY-1 one vector load did not start until the previous
vector (or scalar) load completed. This proved to be a big problem,
and was fixed on CRAY-XMP where the vector pipe had 3 memory units
per cycle -- what we would call 3 memory lanes today.

So::
LDV Ra,[addrA]
LDV Rb,[addrB]
FMUL Rc,Ra,Rb

was piped in CRAY-1:

LDV |---+++++++++++++++++++++++++|
LDV |---V++++++++++++++++++++++++|
FMUL |-----*************************|

whereas in CRAY-XMP it was piped as:

LDV |---V++++++++++++++++++++++++|
LDV |---V++++++++++++++++++++++++|
FMUL |-----*************************|

>
> Actually, this does not quite work out because of the limited vector
> length of the Cray machines, so the programmer has to write loops
> containing vector instructions after all.
>

----------------------------------------------------------------------

> But consider an architecture where vector length is limited only by
> memory. Vector operations could be performed almost independently of
> scalar stuff; only scalar operands and memory dependencies would cause
> interactions between the scalar and the vector part of the CPU. One
> can imagine various ways to implement the vector part (with different
> implementations using different processing granularity); I think this
> can be quite a bit simpler than exercising the full reservation
> station machinery for each operation on each grain.

On the other hand: those M-M Vector machines performed one instruction
for lots of cycles, and then performed the next instruction for lots of
cycles. There was no overlap across instructions--probably for interrupt
reasons--but routing could have been another rational.

What we want is for the instructions in the whole loop to execute
loop by loop, not instruction by instruction.

>
> The downside of arbitrarily long vectors is that you have to be able
> to interrupt it in the middle. This means that the interrupt state
> includes information about which vector operations are in progress and
> how far they are. However, if you let each dependence chain run to
> completion of processing all started grains, you only need to save
> metadata on an interrupt, and do not need to save registers that
> contain intermediate result grains.

No the other hand, if the instructions run loop-by-loop and an interrupt
takes place, the IP and loop variable contains everything needed to
restart the calculation. No extra stat needed.

George Neuner

unread,

Nov 6, 2018, 4:20:32 PM11/6/18

to

On Tue, 06 Nov 2018 14:48:14 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>But consider an architecture where vector length is limited only by
>memory. Vector operations could be performed almost independently of
>scalar stuff; only scalar operands and memory dependencies would cause
>interactions between the scalar and the vector part of the CPU. One
>can imagine various ways to implement the vector part (with different
>implementations using different processing granularity); I think this
>can be quite a bit simpler than exercising the full reservation
>station machinery for each operation on each grain.
>
>The downside of arbitrarily long vectors is that you have to be able
>to interrupt it in the middle. This means that the interrupt state
>includes information about which vector operations are in progress and
>how far they are. However, if you let each dependence chain run to
>completion of processing all started grains, you only need to save
>metadata on an interrupt, and do not need to save registers that
>contain intermediate result grains.

I've been an advocate of something like this for a long time.

I had occasion to work with DSPs that could perform 2-dimensional [X,Y
counter] memory->memory DMA. Unfortunately, all it could do was
*move* data around ... it would have been nice to be able to perform
computations on the in-flight data streams.

I would love to see something like this realized. I don't think
arbitrary striding / gather really is necessary - contiguous blocks
would be fine - but certainly I would want arbitrary length.

I think a coprocessor-like approach using DMA offers a lot more than
loops of short vector operations. Even with Cray-like vector lengths,
too much program intervention is needed even if you can do something
else while waiting.

YMMV.

>- anton

George

MitchAlsup

unread,

Nov 6, 2018, 4:47:11 PM11/6/18

to

On Tuesday, November 6, 2018 at 3:20:32 PM UTC-6, George Neuner wrote:
> On Tue, 06 Nov 2018 14:48:14 GMT, an...@mips.complang.tuwien.ac.at
> (Anton Ertl) wrote:
>
> >But consider an architecture where vector length is limited only by
> >memory. Vector operations could be performed almost independently of
> >scalar stuff; only scalar operands and memory dependencies would cause
> >interactions between the scalar and the vector part of the CPU. One
> >can imagine various ways to implement the vector part (with different
> >implementations using different processing granularity); I think this
> >can be quite a bit simpler than exercising the full reservation
> >station machinery for each operation on each grain.
> >
> >The downside of arbitrarily long vectors is that you have to be able
> >to interrupt it in the middle. This means that the interrupt state
> >includes information about which vector operations are in progress and
> >how far they are. However, if you let each dependence chain run to
> >completion of processing all started grains, you only need to save
> >metadata on an interrupt, and do not need to save registers that
> >contain intermediate result grains.
>
> I've been an advocate of something like this for a long time.
>
> I had occasion to work with DSPs that could perform 2-dimensional [X,Y
> counter] memory->memory DMA. Unfortunately, all it could do was
> *move* data around ... it would have been nice to be able to perform
> computations on the in-flight data streams.

About 30 years ago, I realized that a Vector processor was simply a
DMA device that mangles the data on the way through.

lk...@lkcl.net

unread,

Nov 7, 2018, 1:34:36 AM11/7/18

to

On Tuesday, November 6, 2018 at 5:32:05 PM UTC, Anton Ertl wrote:

> [1] Counterargument: Intel introduced 256-bit-wide operations in Sandy
> Bridge (2011), but they needed until Skylake (2015) to widen the load
> units to 32 bytes; and Ryzen (2017) sticks with 128-bit-wide loads.

there is no question and absolutely no doubt about the fact that
an increased computational capability requires an increased
throughput capability to match, right the way along the data
path from LOAD, through L2/L1 cache, to ALU *and* back.

this is also well-known to require considerable amounts of power,
and consequently for the Libre-RISCV GPGPU we will be putting in
a shockingly-large register bank for such a lowly embedded design:
128 64-bit FP and 128 64-bit INT registers.

GPU workloads are i believe quite exceptional in that a *lot* of
computation is required before the data is pushed back down through
memory.

> The downside of arbitrarily long vectors is that you have to be able
> to interrupt it in the middle.

it's not just arbitrarily-long vectors: it's all vectors.

> This means that the interrupt state includes information about which
> vector operations are in progress and how far they are.

yes it does, and that's ok: there are workable schemes (mentioned
earlier, repeated in summary and expanded on below)

> However, if you let each dependence chain run to
> completion of processing all started grains, you only need to save
> metadata on an interrupt, and do not need to save registers that
> contain intermediate result grains.

so how would you deal with a TLB miss? or an absolutely essential
(real-time) context-switch? or anything else that absolutely requires
unavoidably dealing with?

how would real-time latency guarantees be provided if the vector length
is 4096 entries long, resulting in potentially tens of thousands of
cycles passing?

it's not practical, is it?

so this is why the requirements to keep the appearance of a sequential
element operation processing exist, even if it means throwing away
perfectly good results should one elements' operation with a lower
index happen to throw an exception.

when i first encountered this requirement, i *really* didn't like it.
wasting perfectly good results?? oink?? so i thought, hmm, how could
you deal with that?

and one way is: just not have any exceptions (and keep the vector
lengths artificially short) and to pre-process memory LOADs and
STOREs. floating point, you raise flags, and it is the software's
job to post-analyse them.

the LD/ST requirements would effectively turn what was otherwise a
simple RISC design into something extremely complex and CISC-like,
with pre-analysis, pre-TLB fetching, memory reservations and many
of the characteristics of transactional processing.

hell for implementors on short timescales, basically.

instead, i thought, well, hmmm, you have these predication bits: what
if those were actually used to indicate that the operation had
completed? so, each element operation may be arbitrarily performed in
any order, and, on completion, the relevant source predicate's bit is
*cleared*.

on a [necessary] context switch, the very same "half-completed" predicate
may even be used as a means to swap out partially-completed results,
if the (un-computed) destination register(s) are known not to have
pre-existing data in them.

now, there's a couple of problems with this: the first is, not all
parallel operations are predicated; and secondly, it requires extra
writes back to the register file.

in the end, in SV, i decided that the simplest thing to do is follow
the lead of the established research here: to impose a sequential
element execution path and throw away results with a higher element
index than the point where an exception occurs.

l.

lk...@lkcl.net

unread,

Nov 7, 2018, 1:58:27 AM11/7/18

to

On Tuesday, November 6, 2018 at 8:09:08 PM UTC, MitchAlsup wrote:

> The counter argument is that if one makes vector loads and vector stores,
> one assigns the semantic that they either run to completion or did not
> start. {At least everyone I studied did.}

i can recommend looking at hwacha and RVV (which SV follows).

as i outlined to anton, this requires a complex transactional design
that places a very high barrier to entry on implementors, and also
makes real-time low-latency guarantees pretty much impossible.

RVV and SV have sequential element order and throw away results beyond
the exceptions' indices. it's not nice, however it is something that
even a simple implementation may conform to and stand a chance of
getting it right.

and, honestly, i don't believe that there's going to be significant
performance reduction: you don't get exceptions coming along that
often.

> Then you get in the loop of having stride 1 vector loads and stores,
> stride K loads and stores, and vector indirect (gather scatter) loads
> and stores.

yep. amazingly, SV has all of those - with no new instructions - by
deploying something that i termed "twin predication". the pseudocode
is as follows:

function op(rd, rs):
rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
ps = get_pred_val(FALSE, rs); # predication on src
pd = get_pred_val(FALSE, rd); # ... AND on dest
for (int i = 0, int j = 0; i < VL && j < VL;):
if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
if (int_csr[rs].isvec) i++;
if (int_csr[rd].isvec) j++; else break

note that this covers *all* cases: scalar-scalar, scalar-vector,
vector-scalar, vector-vector *and* predication on scalar-source
or vector-source *and* predication on scalar-dest or vector-dest.

it's really quite deceptively straightforward (and yes there
is a bug in the while loop over-running the end of the
predication array, i know, i know).

this pseudo-code therefore, amazingly, covers VINSERT, VSPLAT,
VREDUCE, VEXTRACT, VSCATTER, VGATHER, VCOPY and a whole stack
more, including a hybrid SCATTER-GATHER.

and at *no time* does it require complex transactional guarantees.

in other words, as long as both the current src-element loop
(i) and the dest-element loop index (j) are part of the vector
state that's stored on context-switch, the above pseudo-code
is *re-entrant*.

> > In any case, when building on scalar operations, you also exercise the
> > OoO machinery 4-32 times as much as with 32-byte wide operations.
>
> But at the same time, the FETCH-DECODE-ISSUE stages of the pipeline
> are quiescent. For simple instruction sequences, this part of the
> pipe can represent 50% of the power to perform instructions. Leaving
> this quiet balances out with the RSs firing more often and capturing
> operands more often.

the power reduction and "compactification" of the instructions
in SV, particularly given that register-redirection may allow
compilers and assembly writers to make more frequent use of
the 16-bit "Compressed" instructions from RV, was a nice accidental
unintended side-effect.

> What we want is for the instructions in the whole loop to execute
> loop by loop, not instruction by instruction.

the proviso i see is that this would require a CISC architecture

> >
> > The downside of arbitrarily long vectors is that you have to be able
> > to interrupt it in the middle. This means that the interrupt state
> > includes information about which vector operations are in progress and
> > how far they are. However, if you let each dependence chain run to
> > completion of processing all started grains, you only need to save
> > metadata on an interrupt, and do not need to save registers that
> > contain intermediate result grains.
>
> No the other hand, if the instructions run loop-by-loop and an interrupt
> takes place, the IP and loop variable contains everything needed to

> restart the calculation. No extra state needed.

how? i am probably going to understand this aspect of 66000 vectorisation
when i see some microarchitectural pseudo-code.

l.

Terje Mathisen

unread,

Nov 7, 2018, 2:56:33 AM11/7/18

to

lk...@lkcl.net wrote:
> so how would you deal with a TLB miss? or an absolutely essential
> (real-time) context-switch? or anything else that absolutely
> requires unavoidably dealing with?
>
> how would real-time latency guarantees be provided if the vector
> length is 4096 entries long, resulting in potentially tens of
> thousands of cycles passing?
>
> it's not practical, is it?

This is a good reason to want BIG/little style multicore cpus, so that
the BIG ones with the vector hw can be dedicated to such use and never
have to serve any kind of interrupt.

If you still have to handle an in-flight TLB miss you would probably end
up stalling the BIG core until the TLB has been updated.

If you also want to allow sw handling, like for a page miss, you still
stall the BIG core while a little core is handling the OS work needed.

At this point the only thing remaining is the need to handle
hibernation, but this can probably be done by allowing any outstanding
vector ops to finish first?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

lk...@lkcl.net

unread,

Nov 7, 2018, 3:30:05 AM11/7/18

to

On Wednesday, November 7, 2018 at 7:56:33 AM UTC, Terje Mathisen wrote:
> lk...@lkcl.net wrote:
> > so how would you deal with a TLB miss? or an absolutely essential
> > (real-time) context-switch? or anything else that absolutely
> > requires unavoidably dealing with?
> >
> > how would real-time latency guarantees be provided if the vector
> > length is 4096 entries long, resulting in potentially tens of
> > thousands of cycles passing?
> >
> > it's not practical, is it?
>
> This is a good reason to want BIG/little style multicore cpus, so that
> the BIG ones with the vector hw can be dedicated to such use and never
> have to serve any kind of interrupt.

in the Libre RISCV GPGPU we will be going for a uniform SMP design,
where the SV engine may be disabled or simply not used.

> If you still have to handle an in-flight TLB miss you would probably end
> up stalling the BIG core until the TLB has been updated.
>
> If you also want to allow sw handling, like for a page miss, you still
> stall the BIG core while a little core is handling the OS work needed.

this would work well, the only proviso being that in conversations
on isa-dev, if TLB is handled by software, andrew waterman points
out (in different words) that doing so is horribly inefficient.

see below as well...

> At this point the only thing remaining is the need to handle
> hibernation, but this can probably be done by allowing any outstanding
> vector ops to finish first?

honestly i feel that hibernation is not an issue: there really is
enough state information published from SV's CSRs to do a full
swap-out and store (s2disk). now, it means swapping out all
128 64-bit INT and 64-bit FP registers *and* 76 (or so) bytes
of CSR state, which includes the current element indices of
a (partially-complete) loop, and that's fine.

which brings me back to the point you raise about big.little,
mitch: the capability which allows the full state to be
saved/restored even on hibernation should also allow single-core
context-switch / interrupt scenarios to work as well.

here is where SV differs radically from RVV (RISC-V vectorisation):
there's separate CSR register and predication tables for M-Mode
and S-Mode. they're vastly smaller, however they're there,
and their purpose is to be used, by convention not by hardware-design,
for fast and efficient context-switching.

in the higher privileged modes the lower-privileged SV CSRs are
untouched, and, on switching to the lower privileged modes, the
higher privileged SV CSRs also remain as they are, so that on
return to a higher privileged mode, execution continues using
the exact same state.

thus, by *convention*, it would be possible in user-mode to
allocate one register as a predicate mask for LOAD/STORE
in the *higher* privileged modes to *only* LOAD/STORE those
registers that were *actually* in use [in the lower privileged
mode].

this is why the twin-predication has been added, so that the
register which (by convention) says which registers are in
use may efficiently use the VGATHER-like characteristics of
twin-predication on a STORE, to context-switch *only* the
registers in use out to the stack....

...*as a contiguous block with no gaps*...

and likewise on switching to the new process, the VSCATTER-like
characteristics of twin-predication may be used on a LOAD
from the stack, to place that contiguous block of memory
into the correct registers.

by contrast RVV is specifically designed to be a completely
separate pipeline, completely separate register file, and,
as designed, could in no way be utilised for this purpose.

effectively, twin-predication on LD/ST covers the behaviour
of LOAD/STORE-multi... without actually requiring such an
instruction.

there are a few other instances where SV provides desirable
and often-requested characteristics of CISC instruction
sets that new RISC implementors often ask for: they're
provided by not actually adding new instructions, instead
creating [virtual/real] parallelism semantics on top of
a *scalar* completely unmodified RISC ISA.

i really should write this up as an academic paper or
at least as a submission to ACM or something.

l.

George Neuner

unread,

Nov 7, 2018, 9:12:22 AM11/7/18

to

Exactly. I realized it ~15 years ago when I was playing with DSPs.
The trouble, IMHO, always has been the approach of using vector
registers (lots of real estate). I'd rather the data somehow were
just pumped through the core.

Anton Ertl

unread,

Nov 7, 2018, 9:26:45 AM11/7/18

to

MitchAlsup <Mitch...@aol.com> writes:
>On Tuesday, November 6, 2018 at 11:32:05 AM UTC-6, Anton Ertl wrote:
>> E.g., desktop hardware has two load units that can load a cache line
>> (typically 32 bytes on recent CPUs) without much more effort than it
>> needs for loading smaller amounts [1]. If you design the architecture
>> such that it uses a load unit for every 8 bytes, the load units will
>> have only a quarter of the troughput (for vectors of bytes, only
>> 1/32th of the throughput). Sure, you can then run gather loads at
>> similar speeds as contiguous loads, but you have achieved that by
>> turning the contiguous loads into gather loads and thus slowing them
>> down.
>
>The counter argument is that if one makes vector loads and vector stores,
>one assigns the semantic that they either run to completion or did not
>start. {At least everyone I studied did.}

At the end of my posting I am discussing vector operation chains that
can be interrupted in the middle, but the following still applies:

>Then you get in the loop of having stride 1 vector loads and stores,
>stride K loads and stores, and vector indirect (gather scatter) loads
>and stores.

Yes, you would have stride 1 vector loads and stores, and possibly
also the others (which are slower or require more resources to
implement).

>Whereas doing it loop-by-loop and each load/store is its own unit
>of visible pipeline work, all those variants are unnecessary. The
>power overhead is all in operand routing and not in address calculation
>or memory access.

You are more of an expert on area and power consumption than I am, but
I assume that there is an area and/or power reason that Skylake can do
2 32-byte loads per cycle, but not 64 1-byte loads per cycle, (or, as
a compromise, 8 8-byte loads per cycle).

>> In any case, when building on scalar operations, you also exercise the
>> OoO machinery 4-32 times as much as with 32-byte wide operations.
>
>But at the same time, the FETCH-DECODE-ISSUE stages of the pipeline
>are quiescent. For simple instruction sequences, this part of the
>pipe can represent 50% of the power to perform instructions. Leaving
>this quiet balances out with the RSs firing more often and capturing
>operands more often.

There are uOp caches that quiesce the decoder, and the loop buffer
that helps with fetch power, and with wider operations you get that in
addition to the benefit of reducing RS firing and operand captures.

>> Concerning the bow to the Cray-1 that I see in several of these works,
>> I think that one cool thing that the Cray-1 did was that the program
>> could start some vector operations, and then do scalar work while the
>> vector stuff was being processed.

...

>> But consider an architecture where vector length is limited only by
>> memory. Vector operations could be performed almost independently of
>> scalar stuff; only scalar operands and memory dependencies would cause
>> interactions between the scalar and the vector part of the CPU. One
>> can imagine various ways to implement the vector part (with different
>> implementations using different processing granularity); I think this
>> can be quite a bit simpler than exercising the full reservation
>> station machinery for each operation on each grain.
>
>On the other hand: those M-M Vector machines performed one instruction
>for lots of cycles, and then performed the next instruction for lots of
>cycles. There was no overlap across instructions--probably for interrupt
>reasons--but routing could have been another rational.

What I have in mind is not a machine with load-op-store vector
instructions; instead, I think about generalized chains as
architectural feature: A generalized chain (mesh) is a set of vector
operations with some vector-source operations (in particular, loads),
some ALU operations, and some vector-sink operations (stores and
reductions).

Vector register names (or maybe a vector belt) are used for describing
the data flow between the operations of a mesh. While these vector
registers are conceptually of arbitrary length, each grain produced is
directly consumed by following vector operations, and at the end of a
mesh, they are dead; so the needed storage is just a single grain
(unit of processing, e.g., 32 bytes) for an ALU result or a few grains
(for reduction inputs).

While the vectors are arbitrary length, the number of operations in a
mesh is architecturally limited, in order to limit the storage needed
for recording in-progress vector operations and intermediate results.

>> The downside of arbitrarily long vectors is that you have to be able
>> to interrupt it in the middle. This means that the interrupt state
>> includes information about which vector operations are in progress and
>> how far they are. However, if you let each dependence chain run to
>> completion of processing all started grains, you only need to save
>> metadata on an interrupt, and do not need to save registers that
>> contain intermediate result grains.
>
>No the other hand, if the instructions run loop-by-loop and an interrupt
>takes place, the IP and loop variable contains everything needed to
>restart the calculation. No extra stat needed.

Yes, but your machine cannot do anything else until your loop is done,
while the machine I have in mind can do scalar stuff in parallel. Of
course, it remains to be seen whether that is a significant advantage.

Anton Ertl

unread,

Nov 7, 2018, 10:00:26 AM11/7/18

to

lk...@lkcl.net writes:
>On Tuesday, November 6, 2018 at 5:32:05 PM UTC, Anton Ertl wrote:
>> The downside of arbitrarily long vectors is that you have to be able
>> to interrupt it in the middle.
>
> it's not just arbitrarily-long vectors: it's all vectors.

That depends on the purpose and design of the machine. I guess that a
Cray-1 could let vector operations run to completion in case of an
interrupt (did it have interrupts?).

>> However, if you let each dependence chain run to
>> completion of processing all started grains, you only need to save
>> metadata on an interrupt, and do not need to save registers that
>> contain intermediate result grains.
>
> so how would you deal with a TLB miss?

Synchronous faults such as page misses (or software-handled TLB
misses) impose the most stringent requirements. One way to deal with
that is to just stop each vector operation where it is; then you have
to store where it is, and also store intermediate results.

Another way to deal with that is to get the whole mesh to be at the
same stage as the faulting vector operation; i.e., drop all the
intermediate results of vector operations that have progressed
further, and continue all vector operations that are further back; if
another fault happens in a continued operation, that operation
determines the state at the interrupt. For vector sources and ALU
operations, dropping the results is enough; for vector sinks, some
buffering may be needed in order to be able to roll back the operation
to the stage of the faulting operation (or you use the other approach
for these operations). The advantage of this approach is that fewer
data needs to be saved; but that advantage may not be worth the
effort.

> how would real-time latency guarantees be provided if the vector length
> is 4096 entries long, resulting in potentially tens of thousands of
> cycles passing?

Vector operation meshes are interrupted and resumed in the middle.
BTW, I would not be worried about "tens of thousands of cycles" (maybe
0.01ms) on the kind of machine I have in mind. They are not that
great at real-time processing anyway. But stuff like page faults (not
present in the Cray-1) require being able to interrupt in the middle.

MitchAlsup

unread,

Nov 7, 2018, 11:24:04 AM11/7/18

to

The vector is really scalar code that has been preloaded into various stations,
and when an exception or interrupt transpires, the vector illusion collapses
back into scalar form. Thus, no added state; no vector registers, no vector
length, no vector predicate,...

lk...@lkcl.net

unread,

Nov 7, 2018, 11:40:47 AM11/7/18

to

On Wednesday, November 7, 2018 at 4:24:04 PM UTC, MitchAlsup wrote:

> > how? i am probably going to understand this aspect of 66000 vectorisation
> > when i see some microarchitectural pseudo-code.
>
>
> The vector is really scalar code that has been preloaded into various stations,
> and when an exception or interrupt transpires, the vector illusion collapses
> back into scalar form. Thus, no added state; no vector registers, no vector
> length, no vector predicate,...

i guess what i am asking, mitch, is down to me not being able to
mentally grasp the data with a register "hand" :) as in, mentally
i am literally visualising my hands trying to clutch at the data
only to find it's not in a register.

could you give a walkthrough, or is there some simulator source
code available online?

i wish i could recall the name of that ISA i heard of which took
the concept you're referring to to its full logical extent (it
was a scalar ISA with zero registers:
memory -> chain-chain-chain -> memory.
memory ----------^

l.

lk...@lkcl.net

unread,

Nov 7, 2018, 11:45:30 AM11/7/18

to

On Wednesday, November 7, 2018 at 3:00:26 PM UTC, Anton Ertl wrote:

> Another way to deal with that is to get the whole mesh to be at the
> same stage as the faulting vector operation; i.e., drop all the

> [...]

> for these operations). The advantage of this approach is that fewer
> data needs to be saved; but that advantage may not be worth the
> effort.

if over time you have more thoughts on this over time, especially if it
is ever implemented, i'd be interested to hear how it works out.

l.

lk...@lkcl.net

unread,

Nov 7, 2018, 3:20:25 PM11/7/18

to

On Wednesday, November 7, 2018 at 4:40:47 PM UTC, lk...@lkcl.net wrote:
> On Wednesday, November 7, 2018 at 4:24:04 PM UTC, MitchAlsup wrote:
>
> > > how? i am probably going to understand this aspect of 66000 vectorisation
> > > when i see some microarchitectural pseudo-code.
> >
> >
> > The vector is really scalar code that has been preloaded into various stations,
> > and when an exception or interrupt transpires, the vector illusion collapses
> > back into scalar form. Thus, no added state; no vector registers, no vector
> > length, no vector predicate,...
>
> i guess what i am asking, mitch, is down to me not being able to
> mentally grasp the data with a register "hand" :) as in, mentally
> i am literally visualising my hands trying to clutch at the data
> only to find it's not in a register.
>
> could you give a walkthrough, or is there some simulator source
> code available online?

so, mitch, i had another think about this, going back to the original
post you wrote, google group link here:
https://groups.google.com/d/msg/comp.arch/bGBeaNjAKvc/59my0OrRAgAJ

i believe the source of confusion is down to the [premature] optimisation
to utilise the "in-flight" terminology. i do not believe it to be
necessary - at all - to have a full-on OoO architecture in order to implement
the vectorisation concept that you describe.

the clue is in the marking of certain registers as "vectorised" and
others as "scalar invariant" as part of the loop, which is translated /
implemented as "scalar-marked registers are loaded once, vector-marked
registers are loaded on every loop".

i believe it would be perfectly reasonable to do exactly that, *even*
in an absolutely dead-simple micro-architecture with zero "in-flight"
capability, OoO renaming, or other complex infrastructure that's assumed
to be necessary.

in other words, each element marked as "vectorised" is just... a scalar
register that "happens to be updated on each loop".

the second clue that this is reasonable lies in the fact that you say
that it's possible to "fall back" to a scalar paradigm.

with that in mind, i have a hunch that it may be possible to extend
well beyond the current limit that is believed to be imposed on the
66000 vectorisation system as envisaged, by simply falling back onto
scalar registers when the limitations of "in-flight" data are reached.

also, i see no reason why the loop-vectorisation system that you propose
should not be applied in a nested fashion. the hidden scalar/vector "tags"
may be applied to an inner loop, marking them as scalar/vector, *on top*
of and *in addition* to the outer loop's scalar/vector "tags". the
tags just happen to get longer and longer, accumulating in a stacked
fashion, where only the top of the stack will be relevant for any
given decision as to whether the register should be treated as scalar
or vector.

i would be interested to hear your thoughts on the above.

l.

MitchAlsup

unread,

Nov 7, 2018, 5:12:39 PM11/7/18

to

On Wednesday, November 7, 2018 at 2:20:25 PM UTC-6, lk...@lkcl.net wrote:
> On Wednesday, November 7, 2018 at 4:40:47 PM UTC, lk...@lkcl.net wrote:
> > On Wednesday, November 7, 2018 at 4:24:04 PM UTC, MitchAlsup wrote:
> >
> > > > how? i am probably going to understand this aspect of 66000 vectorisation
> > > > when i see some microarchitectural pseudo-code.
> > >
> > >
> > > The vector is really scalar code that has been preloaded into various stations,
> > > and when an exception or interrupt transpires, the vector illusion collapses
> > > back into scalar form. Thus, no added state; no vector registers, no vector
> > > length, no vector predicate,...
> >
> > i guess what i am asking, mitch, is down to me not being able to
> > mentally grasp the data with a register "hand" :) as in, mentally
> > i am literally visualising my hands trying to clutch at the data
> > only to find it's not in a register.
> >
> > could you give a walkthrough, or is there some simulator source
> > code available online?
>
> so, mitch, i had another think about this, going back to the original
> post you wrote, google group link here:
> https://groups.google.com/d/msg/comp.arch/bGBeaNjAKvc/59my0OrRAgAJ
>
> i believe the source of confusion is down to the [premature] optimisation
> to utilise the "in-flight" terminology. i do not believe it to be
> necessary - at all - to have a full-on OoO architecture in order to implement
> the vectorisation concept that you describe.

Correct, one can implement this in a 1 wide in order machine, too.

>
> the clue is in the marking of certain registers as "vectorised" and
> others as "scalar invariant" as part of the loop, which is translated /
> implemented as "scalar-marked registers are loaded once, vector-marked
> registers are loaded on every loop".

The vector/scalar designation is useful in setting up the data flow
data capture points (i.e., reservation stations) where instructions
wait for all operands to become available prior to launching the
calculation. On the SuperScalar OoO machines one would only read
Scalar registers once, capture them in all the RS entries, and
deliver operands to the <ignorant> function units. On a Scalar
machine, one could simply read the register from the file.

>
> i believe it would be perfectly reasonable to do exactly that, *even*
> in an absolutely dead-simple micro-architecture with zero "in-flight"
> capability, OoO renaming, or other complex infrastructure that's assumed
> to be necessary.

Yes, it scales down as far as one would build a RISC processor.

>
> in other words, each element marked as "vectorised" is just... a scalar
> register that "happens to be updated on each loop".

That is one way to think of it.

But I think of the loop being vectorized, not the instructions--an
important distinction, I believe.

>
> the second clue that this is reasonable lies in the fact that you say
> that it's possible to "fall back" to a scalar paradigm.

It is mandatory for exceptions and debug. Virtual vectors just occupy the
"machine" without needing an excessive number of Fetchers and Decoders.
The data path from the RF(s) to the calculation units is not altered, but
it has been organized such that if an implementation wants to perform
multi-lane wide vectors it is straightforward to extend the implementation
while leaving the ISA and arch alone.

>
> with that in mind, i have a hunch that it may be possible to extend
> well beyond the current limit that is believed to be imposed on the
> 66000 vectorisation system as envisaged, by simply falling back onto
> scalar registers when the limitations of "in-flight" data are reached.

Probably correct.

>
> also, i see no reason why the loop-vectorisation system that you propose
> should not be applied in a nested fashion. the hidden scalar/vector "tags"
> may be applied to an inner loop, marking them as scalar/vector, *on top*
> of and *in addition* to the outer loop's scalar/vector "tags". the
> tags just happen to get longer and longer, accumulating in a stacked
> fashion, where only the top of the stack will be relevant for any
> given decision as to whether the register should be treated as scalar
> or vector.

I agree with you to the point of filling up the reservation stations. I mean
after all, the calculations are connected by renamed registers (GBOoO model)
and adding a count to the rename is straightforward. But only so long as
everything fits in the stations.

Past this point I don't see much performance dropping to the table.

jim.bra...@ieee.org

unread,

Nov 8, 2018, 12:01:07 AM11/8/18

to

On Monday, November 5, 2018 at 7:20:58 PM UTC-6, jim.bra...@ieee.org wrote:
> On Monday, November 5, 2018 at 2:36:53 PM UTC-6, MitchAlsup wrote:
> > On Monday, November 5, 2018 at 10:16:32 AM UTC-6, MitchAlsup wrote:
> > > On Monday, November 5, 2018 at 5:54:59 AM UTC-6, lk...@lkcl.net wrote:
> > > > over the past several months i've been developing an abstraction layer
> > > > for vectorisation and parallelisation of RISC-V. i've never encountered
> > > > anything remotely similar in any commercial processor design, hence
> > > > why it could seriously benefit from a review.
> > >
> > > Over the last year I have been working on a Virtual Vector extension to
> > > My 66000 ISA. In this extension the compiler is given the illusion of
> > > a register to register CRAY-like vector instructions set, but there are
> > > only 2 instructions added to the ISA in support of vectors. In particular;
> > > a) there are no vector registers (saving massive real estate)

> >
> > Consider the CRAY-1 with 8*64-entry-64-bit vector registers (4096 bytes
> > of 5 ported storage). The vector register file is larger than the area
> > of all the function units connected to the 512 64-bit registers in the
> > VRF.
> >

> > Now consider a machine with reservation stations in front of each function
> > unit. As long as the RS storage is deeper than the pipeline depth of the
> > function unit, one can use the RS operand capture mechanism to stand in
> > the place of the vector registers and associated forwarding/chaining.
> >
> > Getting rid of the VFR was simply an enabling device to slip the vector
> > extension into My 66000, and once thought through, it works perfectly
> > and seamlessly.
> >
> > > b) no vector length register (nor a conceptual length of the vector)
> > > c) no vector control registers (no vector state, full vector perf)
> > > d) no vector condition codes (scalar predication covers this)
> >
> > Getting rid of this stuff causes one to design the vector facility as if
> > one were building a LOOP around scalar code--which one actually is.
> >
> > void conditional(size_t n, double a, const double x[], double y[])

> > {
> > for (size_t i = 0; i < n; i++) {

> > if( x[i] > 0.0 )
> > y[i] = a*x[i] + y[i];
> > else
> > y[i] = x[i] + a*y[i];
> > }
> > }
> >
> > could be compiled as:
> >
> > conditional:
> > MOV Ri,0
> > CMP Rt,Ri,Rn
> > BGE Rt,exit
> > top: VEC {VSI*2,V,VSVV,VSI,ISS}
> > LDD Rx,[Rxb+Ri<<3]
> > LDD Ry,[Ryb+Ri<<3]
> > PGT Rx,TF // predicate on condition
> > FMAC Ry,Ra,Rx,Ry // executed if Rx > 0
> > FMAC Ry,Rx,Ry,Rx // executed otherwise (inc. NaNs)
> > STD Ry,[Ryb+Ri<<3]
> > LOOP Ri,1,Rn
> > exit:
> > JMP R0
> >
> > Note: VERY IMPORTANT! Individual instructions do NOT waste bits supporting
> > predication! These bits are provided by predicate instructions (which are
> > degenerate branches in effect).
> >
> > The PREDICATE instructions casts predicates over the subsequent
> > instructions--in this case the first instruction is in the THEN
> > clause (T) while the next instruction is in the ELSE (F) clause.
> > One or the other is issued, the instructions buffering mechanism
> > (reservation stations) can figure out whether to fire or not
> > eliminating the need to nullify a calculation that is not needed.

> >
> > In the My 66000 ISA predicates can be combined to perform && an ||
> > branches as well as nested if-the-else stuff.
>

> Perhaps off topic
>
> Would like to introduce an alternative view:
> Reserve n registers in the register file as a "barrel"
> Vector results go into the barrel one after another.
> "n" is large enough to hold all the intermediate results for one loop iteration.
>
> So announcing the barrel with a special instruction starts the loop
> And a repeat instruction is at end of loop.
>
> The intermediate results don't actually have to reside in the barrel, instead, say, in function unit output shift registers as in the Mill. Or the barrel is just a compiler slight-of-hand and an exception mechanism.
>
> Makes it easier for my mind to comprehend what is going on: Every result has a place. Use of wider registers and multi-media instructions is just optimization of resources done either at compile time or run time. The problem of partially filled registers at the end of the loop can be addressed generically?

]>Perhaps off topic

Would like to add to my comments:

Possible constraints on vector hardware architecture:
(eg an opportunity for hardware design optimization)

Rant on:
If you have never programmed at the array level (Matlab, Numerical Python, IDL, etc); it is different. You try to do as much as possible with vector/array operations. Call it "programming in the large". One uses a variety of helper functions not seen in "programming in the small":
For instance the sort routine returns a vector of array indices. Sort does not move any of the items to be sorted. For instance to find the ten best employees one forms a sort criteria and does the sort using this criteria. One then chops the returned array indices to length ten and forms a new vector of the selected array elements. In this context the employee array is not a vector of structs, but a struct of vectors.

Say there is a vector expression and an assignment. It includes a transcendental evaluation. The expression evaluator normally generates machine code loop for the entire vector expression. In this case the evaluator is likely to do the transcendental separately (call the transcendental with a vector) and send the result to a temporary vector/array. This breaking of vector expressions into pieces and creating intermediate temporaries means that vector code loops will typically be relatively short using a minimum of temporaries.

Thus IMHO compiler architecture for vector processing will always strive to maximize memory bandwidth utilization first (perhaps bypassing 1st & 2nd level caches) and secondly minimize vector temporaries (as creating them and accessing temporaries entails additional memory bandwidth utilization).

End of rant

Lowest hanging fruit (optimization for contiguous vector data): Memory move.

lk...@lkcl.net

unread,

Nov 8, 2018, 1:56:34 AM11/8/18

to

On Wednesday, November 7, 2018 at 10:12:39 PM UTC, MitchAlsup wrote:
> On Wednesday, November 7, 2018 at 2:20:25 PM UTC-6, lk...@lkcl.net wrote:

> > so, mitch, i had another think about this, going back to the original
> > post you wrote, google group link here:
> > https://groups.google.com/d/msg/comp.arch/bGBeaNjAKvc/59my0OrRAgAJ
> >
> > i believe the source of confusion is down to the [premature] optimisation
> > to utilise the "in-flight" terminology. i do not believe it to be
> > necessary - at all - to have a full-on OoO architecture in order to implement
> > the vectorisation concept that you describe.
>
> Correct, one can implement this in a 1 wide in order machine, too.

very cool. ok. so now i have a mental "handle" on the concept
[without the optimisations], it's easier to conceptualise.

for example, it should be possible to due N-wide in-order
leaving the scalars alone and transparently loading multiple(s) of
the registers: the only thing i've not yet got my head round
yet is, if you have an N-wide operation underway how do you handle
exceptions/traps.

again, here, i believe setting the exact same "sequential"
rules as RVV and SV (throwing away even perfectly good
results of elements indexed greater or equal to the element
in the N-wide vector that caused the exception) would
suffice, what do you think?

> >
> > the clue is in the marking of certain registers as "vectorised" and
> > others as "scalar invariant" as part of the loop, which is translated /
> > implemented as "scalar-marked registers are loaded once, vector-marked
> > registers are loaded on every loop".
>
> The vector/scalar designation is useful in setting up the data flow
> data capture points (i.e., reservation stations) where instructions
> wait for all operands to become available prior to launching the
> calculation.

this sounds again like an optimisation (more below)

> On the SuperScalar OoO machines one would only read
> Scalar registers once, capture them in all the RS entries, and
> deliver operands to the <ignorant> function units. On a Scalar
> machine, one could simply read the register from the file.

ok so in the simplest 1-wide case, i am not perceiving any
need to delay the computation, or even perform any reservations.

in the 2-wide case, the simplest naive implementation would
be to have a duplicate (2nd) full register file. the
state information scalar/vector would be interpreted as
performing the loop "for i in range(2)" on every operation,
pseudo-code something like this:

uint64_t regs[2][32];
bool reg_vectorised[32];
int loop_idx; # a global

read_reg_value(reg_idx):
if !reg_vectorised[reg_idx]:
el_idx = 0
else:
el_idx = loop_idx
return regs[el_idx][reg_idx]

and a corresponding version for write_reg. now an operation becomes
a simple for-loop:

for loop_idx in range(2): # yes set global loop_idx
operand1 = read_reg(OP1)
operand2 = read_reg(OP2)
result = op(operand1, operand2)
write_reg(DEST, result)

more complex variants of this simple in-order implementation concept
may use a CAM instead of wasting huge amounts of register file, or...
various other tricks: mild register-renaming and so on.

i see absolutely no need to pre-analyse, reserve, delay, or anything
which would prevent this from being absolutely dead-simple and
straightforward to implement in a RISC architecture.

which, if correct, would be extremely cool. can you see any issue
with the above (oversimple) implementation?

> >
> > i believe it would be perfectly reasonable to do exactly that, *even*
> > in an absolutely dead-simple micro-architecture with zero "in-flight"
> > capability, OoO renaming, or other complex infrastructure that's assumed
> > to be necessary.
>
> Yes, it scales down as far as one would build a RISC processor.

very cool.

> >
> > in other words, each element marked as "vectorised" is just... a scalar
> > register that "happens to be updated on each loop".
>
> That is one way to think of it.

i think the pseudo-code above (aside from being wasteful of
register space) makes it really clear what's going on under
the hood

> But I think of the loop being vectorized, not the instructions--an
> important distinction, I believe.

if you look at the pseudocode above, the opposite is true, yet
i believe you'll find that it's compliant with the design
paradigm that you envisage.

and that makes the potential for implementation reach much greater
than previously thought. if correct, that is.

> > with that in mind, i have a hunch that it may be possible to extend
> > well beyond the current limit that is believed to be imposed on the
> > 66000 vectorisation system as envisaged, by simply falling back onto
> > scalar registers when the limitations of "in-flight" data are reached.
>
> Probably correct.

:)

so let's break it down, as i think it may be possible to not only
do infinite nested loop depth, i believe function calls may be possible
as well.

the 66000 ISA does the following 2 things in one opcode:

* set up the scalar-vector "tagging"
* initialise the loop

if those are conceptually separated, and also joined by the following:

* push scalar-vector "tag" state onto stack
* pop scalar-vector "tag" state from stack
* push current loop state onto stack
(including N-wide part-execution status,
only relevant for trap handlers!)
* pop loop state off of stack into "current"

i *believe* that provides sufficient coverage to not only do
infinite nested loop depth, not only to do function calls,
also it is the basic primitives necessary to do context-switching
and greatly simplify and make really straightforward any
exception trap implementations as well (in an explicit form).

at the moment a 66000 implementation will have to do the above
push/pop of state, anyway, in order to handle interrupts
and debugging (as you say): the hardware definitely has to there.

so... why not provide an instruction front-end on them, make them
explicit, have the traps and OS call the instructions explicitly,
and as icing on the cake be able to use the exact same instructions
to do infinite nested loop depth and infinite arbitrary function
call depth as well?

what do you think?

l.

lk...@lkcl.net

unread,

Nov 8, 2018, 6:36:55 AM11/8/18

to

On Thursday, November 8, 2018 at 5:01:07 AM UTC, jim.bra...@ieee.org wrote:

> ]>Perhaps off topic
>
> Would like to add to my comments:

not at all, jim: whilst it may look like i know what i'm doing,
i'm actually clueless and winging it. some genuine use-cases
from people who actually do a lot of vector and scientific
work is a huge relief.

below, the following stand out for me:

> In this context the employee array is not a vector of structs,
> but a struct of vectors.

ok so from this, i'm getting that being able to handle arrays of pointers
to data rather than just straight linear data is critical.

> This breaking of vector expressions into pieces and creating
> intermediate temporaries means that vector code loops will typically
> be relatively short using a minimum of temporaries.

that's _particularly_ interesting, given that the implications are
that much more data (the intermediates) would typically have to
go all the way out to memory and back than would otherwise be
the case.

> Thus IMHO compiler architecture for vector processing will always
> strive to maximize memory bandwidth utilization first
> (perhaps bypassing 1st & 2nd level caches)

yes, if you have a massively large array, and yet still have to
use a chain of temporaries, 1st and even 2nd level data caches are
pretty much completely useless.

fascinating.

> and secondly minimize vector temporaries (as creating them
> and accessing temporaries entails additional memory bandwidth
> utilization).

i mentioned it a couple times already, it's worth reiterating:
the Vulkan3D API basically hands you a huge set of arrays where
if you do not have enough registers to handle the complete
workload, *or* if you do not have a means to "remap" the registers,
you absolutely must put some of the data out through LOAD/STORE.

we had to increase the Libre RISCV register count to a whopping
128 64-bit FP and INT, just to cover the lowly goal of 1280x720 @ 25fps
3D resolution. staggering.

> Lowest hanging fruit (optimization for contiguous vector data):
> Memory move.

ok. so there's three types of LOAD/STORE paradigms that i'm
aware of, from hwacha and early drafts of RVV:

* LD - straight contiguous loads of elwidth-bits wide, no gaps
* LD.C - straight contiguous loads of elwidth-bits wide, offset of C bytes
* LX.X - an array of *addresses* is passed in, each element is
loaded from the corresponding address in the *array*
of addresses.

in SV i managed to fit straight LD and also LD.X into one of
the opcodes (after noting that LD.X could be used to cover
LD.C). twin-predication was added on top, given that there
is a source register and a destination register.

i believe that LD.X covers the array-of-pointers case you describe
above, jim. was there anything more specific? RISC-V doesn't
actually have a (scalar) memory-to-memory instruction, it's kinda
assumed that you'd use a memory-to-register-to-memory sequence.

what did you have in mind?

l.

lk...@lkcl.net

unread,

Nov 8, 2018, 6:39:31 AM11/8/18

to

On Thursday, November 8, 2018 at 6:56:34 AM UTC, lk...@lkcl.net wrote:

> uint64_t regs[2][32];
> bool reg_vectorised[32];
> int loop_idx; # a global
>
> read_reg_value(reg_idx):
> if !reg_vectorised[reg_idx]:
> el_idx = 0
> else:
> el_idx = loop_idx
> return regs[el_idx][reg_idx]
>
> and a corresponding version for write_reg. now an operation becomes
> a simple for-loop:
>
> for loop_idx in range(2): # yes set global loop_idx
> operand1 = read_reg(OP1)
> operand2 = read_reg(OP2)
> result = op(operand1, operand2)
> write_reg(DEST, result)
>
> more complex variants of this simple in-order implementation concept
> may use a CAM instead of wasting huge amounts of register file, or...
> various other tricks: mild register-renaming and so on.
>
> i see absolutely no need to pre-analyse, reserve, delay, or anything
> which would prevent this from being absolutely dead-simple and
> straightforward to implement in a RISC architecture.

sorry: i meant to say, "dead simple brain-dead micro-architecture
with no frills, bells or whistles". some RISC micro-architectures are
*really* complex.

l.

jim.bra...@ieee.org

unread,

Nov 8, 2018, 10:27:19 AM11/8/18

to

]> some genuine use-cases from people who actually do a lot of vector and
]> scientific work
I used Numpy and IDL for several years but am not fully expert:
Numpy and SciPy are open software and have conferences. The Julia language is also headed in this direction. A sizable number of vector "primitive" routines are implemented and optimized (every time the dimension of an array increases, the number of primitive functions seems to increase by an order of magnitude). There are probably two classes of users of vector/array machines, those that use vectors/arrays as one uses a pocket calculator and those that fully optimize for a particular machine.

]> ok so from this, i'm getting that being able to handle arrays of

]> pointers to data rather than just straight linear data is critical.

The way Numpy handles it is by flattening arrays to one dimensional and using the one dimensional subscript, not a pointer, so the index vector can index other same size arrays as well.

]> > be relatively short using a minimum of temporaries.
Temporary vectors that is

]> we had to increase the Libre RISCV register count
An unfortunate consequence of being a vector machine?

]> RISC-V doesn't actually have a (scalar) memory-to-memory instruction,

]> it's kinda assumed that you'd use a memory-to-register-to-memory
]> sequence.

Wish I knew the statistics on vector moves. Obviously one can not move a complete vector into a register or register buffer. So this is one area where RISC philosophy has a problem?

Jim Brakefield

MitchAlsup

unread,

Nov 8, 2018, 12:40:15 PM11/8/18

to

On Wednesday, November 7, 2018 at 11:01:07 PM UTC-6, jim.bra...@ieee.org wrote:

> Would like to add to my comments:
>
> Possible constraints on vector hardware architecture:
> (eg an opportunity for hardware design optimization)
>
> Rant on:
> If you have never programmed at the array level (Matlab, Numerical Python, IDL, etc); it is different. You try to do as much as possible with vector/array operations. Call it "programming in the large". One uses a variety of helper functions not seen in "programming in the small":
> For instance the sort routine returns a vector of array indices. Sort does not move any of the items to be sorted. For instance to find the ten best employees one forms a sort criteria and does the sort using this criteria. One then chops the returned array indices to length ten and forms a new vector of the selected array elements. In this context the employee array is not a vector of structs, but a struct of vectors.
>
> Say there is a vector expression and an assignment. It includes a transcendental evaluation. The expression evaluator normally generates machine code loop for the entire vector expression. In this case the evaluator is likely to do the transcendental separately (call the transcendental with a vector) and send the result to a temporary vector/array. This breaking of vector expressions into pieces and creating intermediate temporaries means that vector code loops will typically be relatively short using a minimum of temporaries.
>
> Thus IMHO compiler architecture for vector processing will always strive to maximize memory bandwidth utilization first (perhaps bypassing 1st & 2nd level caches) and secondly minimize vector temporaries (as creating them and accessing temporaries entails additional memory bandwidth utilization).

Any real vector machine will be memory limited primarily because it is easier
to add function units than memory ports. So, if machine k started off and was
function unit limited, throwing a few function units at it would convert it
into being memory limited.

>
> End of rant
>
>
> Lowest hanging fruit (optimization for contiguous vector data): Memory move.

Page Move, Page zero--used by OS mainly--should be a single command from CPU and performed entirely at DRAM controller.

lk...@lkcl.net

unread,

Nov 8, 2018, 12:55:03 PM11/8/18

to

On Thursday, November 8, 2018 at 5:40:15 PM UTC, MitchAlsup wrote:

> Any real vector machine will be memory limited primarily because it is easier
> to add function units than memory ports. So, if machine k started off and was
> function unit limited, throwing a few function units at it would convert it
> into being memory limited.

you should have seen what we had to do for the Aspex Array-String Processor
(ASP). A DMA engine provided a constant-time data load (into the memory
of each of a massive 4096-long array of SIMD processors with a 2-bit ALU).

as the 2-bit ALUs were chained together and had an opcode which could
be used as a "carry" between processors, it was possible to either:

* use 16 processors to perform one 32-bit add
* use 8 processors to perform two 32-bit adds in 2x the time
* use 4 processors to perform 4 32-bit adds in 4x the time
* use 2 processors to perform 8 32-bit adds in 8x the time
* use 1 processor to perform 16 32-bit adds in 16x the time

therefore... *deep breath*... we had to write *spreadsheets* to
calculate, for all possible (hand-written!!) assembly programs,
what the most optimal balance between memory load/store time
and computation time was!

unsurprisingly, programming the ASP was measured in DAYS per line of
[assembly] code.

innnsane... and also absolutely fascinating at the same time.

it kinda drummed home to me the scary balance that has to be made
between computation and memory management when it comes to vector /
parallel processing.

> > Lowest hanging fruit (optimization for contiguous vector data): Memory move.
>
> Page Move, Page zero--used by OS mainly--should be a single command from CPU and performed entirely at DRAM controller.

agreed, for contiguous blocks.

i believe jim may have been referring to much more fragmented
"cherry-picking": pointers to data structures, structs with
lots of regular data where you need to get access to data
at regular offsets, that sort of thing.

would you be able to clarify, here, jim?

l.

Stephen Fuld

unread,

Nov 8, 2018, 1:09:58 PM11/8/18

to

On 11/8/2018 9:40 AM, MitchAlsup wrote:

snip

> Any real vector machine will be memory limited primarily because it is easier
> to add function units than memory ports.

Is the limit due to the need to add ports to the cache or to add more
pins to the chip to provide more DRAM accesses, or something else?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

lk...@lkcl.net

unread,

Nov 8, 2018, 1:33:27 PM11/8/18

to

On Thursday, November 8, 2018 at 6:09:58 PM UTC, Stephen Fuld wrote:
> On 11/8/2018 9:40 AM, MitchAlsup wrote:
>
> snip
>
> > Any real vector machine will be memory limited primarily because it is easier
> > to add function units than memory ports.
>
> Is the limit due to the need to add ports to the cache or to add more
> pins to the chip to provide more DRAM accesses, or something else?

i was going to say it's the entire chain right through register
memory ports (NxRead,MxWrite, usually N = 2M), L1, L2 and
DRAM, however it's application dependent.

also, some vector processors i heard of actually had their own
separate external SRAM/DRAM bus *just* for the register file,
which complicates matters.

taking a ridiculously simple example of an array of 100 million
integer values and you just want to add one to all of them.
naturally we might think, "oh, everything will run at the speed
of memory as that's the main bottleneck".

however if the register file is single-ported write (the usual
porting is 2R1W), then it doesn't actually matter how wide
the vector unit is: there will also be a logjam getting the
data out and back into the register file.

once you've beefed up the register file to at least N*2 reads N writes,
where N is the width of the vector unit (only N*2 R, N W wide has
the possibility for 3-operand instructions to cause a pipeline stall),
the next bottleneck is going to be the data lanes to and the bandwidth
of the L1 cache. after that, L2 cache, and so on.

anywhere you fail to improve the throughput of the full chain,
you get a bottleneck. the problem is: the improvements don't
actually provide a corresponding linear increase in *scalar*
performance, for the actual area needed to increase the throughput.

i don't know if you've ever seen a floorplan of a RISC processor
or not: the actual main core occupies a tiny corner (0.5%),
the I/O pads are about the same size as the core (!!!), the L1
caches occupy 4-5% each...

... oh and then there's this MASSIVE lane down the middle of the
die, occupying a whopping 25% to 30% of the total area, for the
memory address and data bus.

that's what i saw for an ARC processor that had extreme SIMD
VPU processing capacity.

l.

Stefan Monnier

unread,

Nov 8, 2018, 2:37:31 PM11/8/18

to

> What I have in mind is not a machine with load-op-store vector
> instructions; instead, I think about generalized chains as
> architectural feature: A generalized chain (mesh) is a set of vector
> operations with some vector-source operations (in particular, loads),
> some ALU operations, and some vector-sink operations (stores and
> reductions).

Sounds like it's basically the same as what Mitch is describing (where
his "mesh" is encoded as a sequence of scalar instructions).

Stefan

MitchAlsup

unread,

Nov 8, 2018, 5:06:11 PM11/8/18

to

On Thursday, November 8, 2018 at 12:09:58 PM UTC-6, Stephen Fuld wrote:
> On 11/8/2018 9:40 AM, MitchAlsup wrote:
>
> snip
>
> > Any real vector machine will be memory limited primarily because it is easier
> > to add function units than memory ports.
>
> Is the limit due to the need to add ports to the cache or to add more
> pins to the chip to provide more DRAM accesses, or something else?

Function units are independent, you can ship a FMUL instruction to ANY FMUL
unit.

Memory units are necessarily banked, each bank servicing a unique portion
of the physical address space. You can only ship a known address to the
propr memory unit. It can take any reasonable route to get there and back
but it has to be serviced by the memory unit holding the data.

MitchAlsup

unread,

Nov 8, 2018, 5:11:26 PM11/8/18

to

On Thursday, November 8, 2018 at 12:33:27 PM UTC-6, lk...@lkcl.net wrote:
> On Thursday, November 8, 2018 at 6:09:58 PM UTC, Stephen Fuld wrote:
> > On 11/8/2018 9:40 AM, MitchAlsup wrote:
> >
> > snip
> >
> > > Any real vector machine will be memory limited primarily because it is easier
> > > to add function units than memory ports.
> >
> > Is the limit due to the need to add ports to the cache or to add more
> > pins to the chip to provide more DRAM accesses, or something else?
>
> i was going to say it's the entire chain right through register
> memory ports (NxRead,MxWrite, usually N = 2M), L1, L2 and
> DRAM, however it's application dependent.
>
> also, some vector processors i heard of actually had their own
> separate external SRAM/DRAM bus *just* for the register file,
> which complicates matters.
>
> taking a ridiculously simple example of an array of 100 million
> integer values and you just want to add one to all of them.
> naturally we might think, "oh, everything will run at the speed
> of memory as that's the main bottleneck".
>
> however if the register file is single-ported write (the usual
> porting is 2R1W)

2R/1W per Scalar/Vector Width.

> , then it doesn't actually matter how wide
> the vector unit is: there will also be a logjam getting the
> data out and back into the register file.

You are making the assumption that the value has to go THROUGH the register.
The value can go to the function unit directly by forwarding without, or at
least prior to, being written into the RF.

>
> once you've beefed up the register file to at least N*2 reads N writes,
> where N is the width of the vector unit (only N*2 R, N W wide has
> the possibility for 3-operand instructions to cause a pipeline stall),

Danger Will Robinson:: FMAC is seen as up to 40% of FP activities.

MitchAlsup

unread,

Nov 8, 2018, 5:14:54 PM11/8/18

to

On Monday, November 5, 2018 at 8:50:10 PM UTC-6, jim.bra...@ieee.org wrote:
> On Monday, November 5, 2018 at 8:00:16 PM UTC-6, Stephen Fuld wrote:

> > > Perhaps off topic
> > >

> > > Would like to introduce an alternative view:
> > > Reserve n registers in the register file as a "barrel"
> > > Vector results go into the barrel one after another.
> > > "n" is large enough to hold all the intermediate results for one loop iteration.
> >

> > This has the advantage of providing a place to hold "partial values" for
> > loops that e.g. sum the elements of a vector, do a dot product,or even
> > compute a checksum.
> >
> > But it has the significant disadvantage of losing one of the nicest
> > features of Mitch's design, the code is identical independent of the
> > number of functional units i.e., no need to specify the number of
> > registers to use for the "barrel", and even potentially allows for an
> > implementation with more functional units than registers.

> >
> >
> > --
> > - Stephen Fuld
> > (e-mail address disguised to prevent spam)
>

> Need to plead inexperience WRT gate counts for all the different approaches.
> Associating a loop count with each operand solves the alignment over multiple loops. And Mitch's design has a lower register count than CRAY-1.
>
> When I see input reservation stations I think data flow. And wonder how it gets setup and torn down? Presumably no branches outside of the loop nor any subroutine calls within the loop?

Having given this some thought, there is no conceptual reason that CALL-RET
or If-Then-Else cannot be inside the LOOP.
>
> Am thinking any of the approaches needs to be opportunistic (grab low
> hanging fruit) when short operands are adjacent in memory.
>
> IMHO it becomes a question of where to place the "crossbar" and whether it can be time multiplexed to reduce the number of write ports?

MitchAlsup

unread,

Nov 8, 2018, 5:16:56 PM11/8/18

to

On Monday, November 5, 2018 at 9:42:34 PM UTC-6, lk...@lkcl.net wrote:
> On Tuesday, November 6, 2018 at 3:06:15 AM UTC, MitchAlsup wrote:
>
> > As far as scope, I would like to be able to vectorize great big loops
> > such as FPPPP, but that is simply impossible; as the loop is 8 KBytes
> > of instructions long (2K instructions) and bigger than any conceivable
> > set of RS entries.
>
> i love the idea of vectorisation being opaque: i am nervous about
> the impact the idea has on both microarchitectures (restricting
> implementors to not being able to use certain microarchitectures
> as the base) and on its understandability.
>
> > Looking smaller, I demanded of the mechanism to handle radix 2 FFT
> > (20 instructions plus loop overhead.) So, somewhere in the 16-64
> > instruction range is the scale of vectorizing this mechanism is
> > optimized for.
>
> ok, so this would be where RVV and SV have a distinct advantage, as:
>
> * in RVV the vector register file is separate and therefore
> may be passed as parameters to functions at an unrestricted
> stack depth

In VVM, one does not need vector functions, one simply calls the scalar
function inside a vector LOOP. The instructions of the function are
required to fit in the size of the LOOP instruction buffering mechanism.

>
> * in SV, registers may be "tagged" and left that way, at the
> programmer's discretion, such that the same effect is
> achieved.
>
> so both have no such limit on the number of instructions that
> may be executed using vectorised registers, nor on the stack
> depth.
>
> this does mean that it is necessary to save/restore registers
> and associated state on entry/exit to function calls (and
> on context-switches). this being nothing new and done all
> the time in scalar programs, the only question being how
> much in terms of power/computing-time resources it takes to
> do so.
>
> l.

Stephen Fuld

unread,

Nov 8, 2018, 5:31:10 PM11/8/18

to

On 11/8/2018 10:33 AM, lk...@lkcl.net wrote:
> On Thursday, November 8, 2018 at 6:09:58 PM UTC, Stephen Fuld wrote:
>> On 11/8/2018 9:40 AM, MitchAlsup wrote:
>>
>> snip
>>
>>> Any real vector machine will be memory limited primarily because it is easier
>>> to add function units than memory ports.
>>
>> Is the limit due to the need to add ports to the cache or to add more
>> pins to the chip to provide more DRAM accesses, or something else?
>
> i was going to say it's the entire chain right through register
> memory ports (NxRead,MxWrite, usually N = 2M), L1, L2 and
> DRAM, however it's application dependent.
>
> also, some vector processors i heard of actually had their own
> separate external SRAM/DRAM bus *just* for the register file,
> which complicates matters.
>
> taking a ridiculously simple example of an array of 100 million
> integer values and you just want to add one to all of them.
> naturally we might think, "oh, everything will run at the speed
> of memory as that's the main bottleneck".
>
> however if the register file is single-ported write (the usual
> porting is 2R1W), then it doesn't actually matter how wide
> the vector unit is: there will also be a logjam getting the
> data out and back into the register file.

But note that in Mitch's design, the values that are loaded/stored in
virtual vector mode are, in general, not actually written to the
registers, but are forwarded to other functional units within the core.

> once you've beefed up the register file to at least N*2 reads N writes,
> where N is the width of the vector unit (only N*2 R, N W wide has
> the possibility for 3-operand instructions to cause a pipeline stall),
> the next bottleneck is going to be the data lanes to and the bandwidth
> of the L1 cache. after that, L2 cache, and so on.

I guess so - that was the essence of my question. But perhaps for
virtual vector mode, you should be using uncached loads???

But perhaps this is pushing Mitch's design beyond where it is reasonable
to go???

jim.bra...@ieee.org

unread,

Nov 8, 2018, 7:30:46 PM11/8/18

to

]> i believe jim may have been referring to much more fragmented

]> "cherry-picking": pointers to data structures, structs with
]> lots of regular data where you need to get access to data
]> at regular offsets, that sort of thing.

Consider use of multimedia instructions on wide registers as an optimization for greater throughput. And memory move as something that is easy to optimize (eg improve performance).

The issue with structs is full of complications. By converting to struct of vectors is is possible to improve the performance on Boolean and byte struct elements considerably (eight bytes to a word, 64 Booleans to a word). The various RISC approaches can often load an entire struct into registers and rapidly do operations thereon. So no clear win for struct of vectors over vector of structs.

Mitch's memory move thinking:
If bypassing L1 and L2 when accessing vectors, then one would want to align vectors on "page" boundaries as much as possible. Some loss of memory density between adjacent vectors which is small potatoes.

What probably needs to be done is to rewrite the benchmarks into NumPy and see where there are bottlenecks (also look at http://numba.pydata.org/)?

jim.bra...@ieee.org

unread,

Nov 8, 2018, 7:42:23 PM11/8/18

to

]> Having given this some thought, there is no conceptual reason that

]> CALL-RET or If-Then-Else cannot be inside the LOOP.

Ugh, all depends on the number of function units and their setup? If the subroutine can be pulled into the loop and become part of it? Otherwise must revert to scalar mode? Would it not be simpler to treat the vector setup as one very large VLIW instruction?

Ugh, need to study the implementation details.

Jim Brakefield

lk...@lkcl.net

unread,

Nov 8, 2018, 8:29:33 PM11/8/18

to

On Thursday, November 8, 2018 at 10:16:56 PM UTC, MitchAlsup wrote:

> In VVM, one does not need vector functions, one simply calls the scalar
> function inside a vector LOOP. The instructions of the function are
> required to fit in the size of the LOOP instruction buffering mechanism.

what if the function parameters happen to all be marked as "vectorised"?
there is an opportunity, there, to call the function once and continue
as if it was inline?

l.

lk...@lkcl.net

unread,

Nov 8, 2018, 8:51:22 PM11/8/18

to

On Thursday, November 8, 2018 at 10:31:10 PM UTC, Stephen Fuld wrote:

> > however if the register file is single-ported write (the usual
> > porting is 2R1W), then it doesn't actually matter how wide
> > the vector unit is: there will also be a logjam getting the
> > data out and back into the register file.
>
> But note that in Mitch's design, the values that are loaded/stored in
> virtual vector mode are, in general, not actually written to the
> registers, but are forwarded to other functional units within the core.

i believe that this is a kind of cacheing for registers (an
additional optimisation), and have seen something like it,
described here:

https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf

so when talking about register data being "in flight" - to use
the term "RF" i recall you using, mitch (?), i believe it's a
type of register cache design.

with the addition of register renaming on top of a register cache,
i believe it becomes possible to achieve direct functional
equivalence of the "forwarding" that you refer to.

i have a design concept in mind that is based on the above paper
the way i perceive that design (figure 4b) is that it looks remarkably
similar to a L1/L2 cache (except with registers instead of memory).

so it got me thinking, um... why not *actually* have an SMP-coherent
"Bank1 per ALU" with associated coherency lines, and a global "Bank2"
corresponding to an L2 cache?

the number of entries would be tiny in Bank1: only 6-8 registers,
as any more than that would start to become problematic on clashes.
i have absolutely no idea how many would be needed in the global
Bank2.

note that there would be absolutely no communication whatsoever between
the Bank1/Bank2 reg-caches of SMP *cores*. each SMP cores' Bank1/Bank2
reg-cache would be completely independent.

the concept's definitely at the "hmmmm" phase :)

l.

lk...@lkcl.net

unread,

Nov 8, 2018, 8:54:43 PM11/8/18

to

On Thursday, November 8, 2018 at 10:11:26 PM UTC, MitchAlsup wrote:

> > , then it doesn't actually matter how wide
> > the vector unit is: there will also be a logjam getting the
> > data out and back into the register file.
>
> You are making the assumption that the value has to go THROUGH the register.
> The value can go to the function unit directly by forwarding without, or at
> least prior to, being written into the RF.

with different experience(s) and terminology, i'm slowly getting a
handle on where you're coming from, mitch. does the concept of
"forwarding" look anything like "register cacheing" that's described
in the paper i referenced in the reply to stephen?

> >
> > once you've beefed up the register file to at least N*2 reads N writes,
> > where N is the width of the vector unit (only N*2 R, N W wide has
> > the possibility for 3-operand instructions to cause a pipeline stall),
>
> Danger Will Robinson:: FMAC is seen as up to 40% of FP activities.

appreciate the heads-up.

MitchAlsup

unread,

Nov 8, 2018, 9:04:35 PM11/8/18

to

If the length of the vector is longer than the amount of instruction buffers available, some piece of SW in the tool path needs to avoid vectorizing.

Also note:: everything I have written is for a Virtual Vector Machine
architecture which needs to be implementable from the multi-clock per
instructions tiny machines, through the 1-wide scalar machines (think
1st gen RISC), though 10-wide, trace cached monster machines.

The implementation details of a few of these (6-wide packetized ICache)
machine have been used as reference points (thick reservation stations
in front of each function unit, complete renaming, branch predictions,...)

These are not the only possible embodiments, and the architectural spec
needs to prevent handicapping implementations but explain the concept
with sufficient clarity that lots of implementations are possible.

MitchAlsup

unread,

Nov 8, 2018, 9:10:33 PM11/8/18

to

Consider a machine with only 64-bit registers and the Virtual Vector Mechanism.
A function unit can easily perform {8B,4H,2W,1D} SIMD calculations with no need
to grow registers to 128-bits (loop twice), 256-bits (loop 4 times), or 512-bits (loop 8 times).

At some scale in CPU design, there can be 4 (or 8) lanes of vector function
units, so the combined effect is AVX performance without the massive growth
in ISA. VVM effectively replaces 600 instructions with 2; and minimizes
restrictions on lesser members of an architectural family; without impeding
the performance of the highest members of that family.

>
> The issue with structs is full of complications. By converting to struct of vectors is is possible to improve the performance on Boolean and byte struct elements considerably (eight bytes to a word, 64 Booleans to a word). The various RISC approaches can often load an entire struct into registers and rapidly do operations thereon. So no clear win for struct of vectors over vector of structs.
>
> Mitch's memory move thinking:
> If bypassing L1 and L2 when accessing vectors, then one would want to align vectors on "page" boundaries as much as possible. Some loss of memory density between adjacent vectors which is small potatoes.

That is a question for SW. For HW cache line alignment is sufficient.

MitchAlsup

unread,

Nov 8, 2018, 9:14:22 PM11/8/18

to

The values flow through the calculation on forwarding paths and in the
meet of the vector, they arrive from memory, get mangled in the FUs,
and get written back to memory without ever touching the register file.

>
> > once you've beefed up the register file to at least N*2 reads N writes,
> > where N is the width of the vector unit (only N*2 R, N W wide has
> > the possibility for 3-operand instructions to cause a pipeline stall),
> > the next bottleneck is going to be the data lanes to and the bandwidth
> > of the L1 cache. after that, L2 cache, and so on.
>
> I guess so - that was the essence of my question. But perhaps for
> virtual vector mode, you should be using uncached loads???

For LDs that hit the cahe, go ahead and use the data. For loads that miss
the cache, get the data from the Miss buffer multiple times, and then
discard to avoid thrashing the cache.

>
> But perhaps this is pushing Mitch's design beyond where it is reasonable
> to go???

We won't know the answer to that until someone pushes.......

MitchAlsup

unread,

Nov 8, 2018, 9:22:16 PM11/8/18

to

On Thursday, November 8, 2018 at 7:51:22 PM UTC-6, lk...@lkcl.net wrote:
> On Thursday, November 8, 2018 at 10:31:10 PM UTC, Stephen Fuld wrote:
>
> > > however if the register file is single-ported write (the usual
> > > porting is 2R1W), then it doesn't actually matter how wide
> > > the vector unit is: there will also be a logjam getting the
> > > data out and back into the register file.
> >
> > But note that in Mitch's design, the values that are loaded/stored in
> > virtual vector mode are, in general, not actually written to the
> > registers, but are forwarded to other functional units within the core.
>
> i believe that this is a kind of cacheing for registers (an
> additional optimisation), and have seen something like it,
> described here:
>
> https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
>
> so when talking about register data being "in flight" - to use
> the term "RF" i recall you using, mitch (?), i believe it's a
> type of register cache design.

It is only a register cache if you consider the operand capture points of
a reservation station a register. I, personally, would not, as the Reservation
Stations are distributed in from of Function units, possible quite far from the
register file itself.

>
> with the addition of register renaming on top of a register cache,
> i believe it becomes possible to achieve direct functional
> equivalence of the "forwarding" that you refer to.

Consider operand_name{ register_name, loop_counter[k,0]} You are just running
data flow with a single renamed "name" and augmenting it with a counter (of
limited length).

MitchAlsup

unread,

Nov 8, 2018, 9:24:29 PM11/8/18

to

On Thursday, November 8, 2018 at 7:54:43 PM UTC-6, lk...@lkcl.net wrote:
> On Thursday, November 8, 2018 at 10:11:26 PM UTC, MitchAlsup wrote:
>
> > > , then it doesn't actually matter how wide
> > > the vector unit is: there will also be a logjam getting the
> > > data out and back into the register file.
> >
> > You are making the assumption that the value has to go THROUGH the register.
> > The value can go to the function unit directly by forwarding without, or at
> > least prior to, being written into the RF.
>
> with different experience(s) and terminology, i'm slowly getting a
> handle on where you're coming from, mitch. does the concept of
> "forwarding" look anything like "register cacheing" that's described
> in the paper i referenced in the reply to stephen?

Not really, it is much more like operand capture in reservation stations.
In particular is is a decentralized snarf mechanism to capture the data
as it is broadcast, and hold onto it until it can be used. The stations
are intimately associated with function units, and are far from the
decode section of the core.

lk...@lkcl.net

unread,

Nov 8, 2018, 9:28:34 PM11/8/18

to

On Friday, November 9, 2018 at 2:24:29 AM UTC, MitchAlsup wrote:

> > with different experience(s) and terminology, i'm slowly getting a
> > handle on where you're coming from, mitch. does the concept of
> > "forwarding" look anything like "register cacheing" that's described
> > in the paper i referenced in the reply to stephen?
>
> Not really, it is much more like operand capture in reservation stations.
> In particular is is a decentralized snarf mechanism to capture the data
> as it is broadcast, and hold onto it until it can be used. The stations
> are intimately associated with function units, and are far from the
> decode section of the core.

mmm *stress!* :) do you have any kind of high-level functional
diagrams which outline this?

l.

lk...@lkcl.net

unread,

Nov 9, 2018, 7:26:01 AM11/9/18

to

On Friday, November 9, 2018 at 2:24:29 AM UTC, MitchAlsup wrote:
> On Thursday, November 8, 2018 at 7:54:43 PM UTC-6, lk...@lkcl.net wrote:
> > On Thursday, November 8, 2018 at 10:11:26 PM UTC, MitchAlsup wrote:
> >
> > > > , then it doesn't actually matter how wide
> > > > the vector unit is: there will also be a logjam getting the
> > > > data out and back into the register file.
> > >
> > > You are making the assumption that the value has to go THROUGH the register.
> > > The value can go to the function unit directly by forwarding without, or at
> > > least prior to, being written into the RF.
> >
> > with different experience(s) and terminology, i'm slowly getting a
> > handle on where you're coming from, mitch. does the concept of
> > "forwarding" look anything like "register cacheing" that's described
> > in the paper i referenced in the reply to stephen?
>
> Not really, it is much more like operand capture in reservation stations.
> In particular is is a decentralized snarf mechanism to capture the data
> as it is broadcast, and hold onto it until it can be used. The stations
> are intimately associated with function units, and are far from the
> decode section of the core.

so i thought about this overnight, and i believe we may be
miscommunicating. when i said, "is the register cacheing
(including register renaming) similar" i was referring to
the scalar operations (alone).

the *full* question should have clarified (apologies) by
including the [naive] pseudo-code that i wrote out earlier, where
there are effectively duplicate register file entries for registers
marked as "vectorised":

lanes = 2 # number of parallel (vector) lanes
num_regs = 32 # number of standard (scalar) registers
uint64_t int_regfile[lanes][num_regs]
bool reg_is_vectorised[num_regs]

when it comes to what you term as "in-flight" (or "broadcasting")
and it is time to issue a "rename" as the register contents go
into the register cache, i believe it is as simple as including
the *lane* number as part of the rename.

the clue to the above algorithm may be in what you said earlier:

> I agree with you to the point of filling up the reservation stations. I mean
> after all, the calculations are connected by renamed registers (GBOoO model)
> and adding a count to the rename is straightforward. But only so long as
> everything fits in the stations.

i am still unclear as to what the terminology of "stations" and
"reservation stations", they are words that i have never encountered
before in microarchitectural design: i *believe* they may be understood
as being a combination of lane number *plus* register number, with or
without register-renaming.

am i along the right track, yet?

l.

Anton Ertl

unread,

Nov 9, 2018, 11:51:46 AM11/9/18

to

Not at all. Vector operations of a mesh can occur interspersed with
scalar code and, in particular, control flow. Meshes are woven from
the vector operations at run-time; at least that's the ideal, the
necessity to limit meshes to a bounded number of operations reduces
the flexibility. The vector operations of a mesh must have no memory
dependences (stores must not overlap loads, or they must be to the
exact same address and length). It is relatively easy to make wide
implementations. The programming model behind this is manual
vectorization.

By contrast, Mitch Alsup describes a loop of scalar operations that
works like a loop of scalar operations (including for memory
dependences). This has the advantage of making compilation to this
model from scalar code easy, but the disadvantage of making wide
implementations hard.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,

Nov 9, 2018, 1:10:14 PM11/9/18

to

Reservation stations go back to 360/91 and are THE key development enabling
great big out of order processing. Google up "An efficient Algorithm for
Exploiting Multiple Function Units".

A reservation station entry has a place to store the instruction, and to store
each operand (in the modern context, it may also store condition codes.)
Once all needed operands have arrived, the instruction is free to execute.

A function unit performs calculations and is attached to a result delivery mechanism (called common data bus in Tomasulo; we use multiple common data
busses in modern machines.

A reservation station has multiple entries (typically 5-16) and each entry
can hold an instruction. The stations are topologically adjacent to the
function unit they service, and each entry monitors all result busses so
that when the result they are waiting for is delivered, they can snarf the
result as an operand.

The reservation stations are topologically far from the register file.
The decode process reads the register file while decoding the instruction.
And delivers the instruction and read registers to the reservation station.
There are at least 3 way for the decode/Issue process to decide if the
value from the register file is present or "in flight". When an instruction
is inserted into the reservation station, in flight registers are marked
"not ready". WHen the tag of the register is broadcast (1 or 2 cycles before
the data) the station entries matching that tag all capture the broadcast
result, and set that operand to the ready state.

When all operands are present, a picker chooses a station entry to enter
execution and launches the instruction into execution.

This is essentially how all large machines since about 1990 have been built.

>
> am i along the right track, yet?

Perhaps an education. Go read Tomasulo.
>
> l.

Stephen Fuld

unread,

Nov 9, 2018, 1:17:51 PM11/9/18

to

Let me take a shot at clarifying things a little. Mitch, please check
if this is at least loosely correct.

Conceptually, at the output of each functional unit (FU), ALU, Load
unit, FP unit, etc. there is a "place". I don't want to call it a
register, as that will confuse things. The place holds the output of
the FU until it can be written to a register, or forwarded directly to
an input of another FU. There is no "tag" or count or "address" for the
place, as none is needed since each is identified with the FU whose
output it holds. They aren't really a cache for anything.

So, for example, if the next instruction is say another add with one of
its inputs being the output from the previous add, the results from the
previous add are sent "forwarded" from the place to one the inputs of
the ALU. This is standard OoO stuff, as it saves the cost of the
register read for the second add.

What VVM does is "synch" all the FUs of a given type (e.g.) integer
ALU), such that they are all doing, in parallel, the same operation on
different elements of the vector. Thus if you have 4 ALUs, and the
instruction being executed is an add, you are actually doing the add on
four vector elements. In VVM, this forwarding is done exactly the same
way, except that it is done for all four ALUs at the same time.

The nice thing is that the place is already there for non-VVM mode
operations. No additional registers, tags, addresses,etc. are needed
for vector operations.

EricP

unread,

Nov 9, 2018, 2:56:19 PM11/9/18

to

lk...@lkcl.net wrote:
>
> with different experience(s) and terminology, i'm slowly getting a
> handle on where you're coming from, mitch. does the concept of
> "forwarding" look anything like "register cacheing" that's described
> in the paper i referenced in the reply to stephen?

Operand forwarding
https://en.wikipedia.org/wiki/Operand_forwarding

Topics related to this discussion:

https://en.wikipedia.org/wiki/Tomasulo_algorithm
https://en.wikipedia.org/wiki/Register_renaming
https://en.wikipedia.org/wiki/Reservation_station

MitchAlsup

unread,

Nov 9, 2018, 7:18:52 PM11/9/18

to

Stevan is essentially correct.

Some terminology to avoid confusion.
Dfn: register; a container that holds bits and seen by software
Dfn: flip-flop; a container that holds bits and is seen only by hardware
and in particular a container that changes state on ONE
edge of the clock cycle (typically the rising edge.)
Flip-flops have the property that for an entire clock cycle they hold the
bit pattern.

Staging flip-flops define the stages of a pipeline and are invisible to
software. Software only sees architectural registers not the flip-flops
used to implement the hardware pipelines.

Hardware is made up of "stuff" sitting between flip-flops.

A register file is a multi-ported set of storage nodes, the register addresses
held in flip-flops during access, the read-out results held in flip-flops at
the output.

A function unit has flip flops to capture the instructions and operands so
that a calculation can be performed. A function unit performs said calculation
and delivers its result to its output flip-flop.

A data path is a series of flip-flops used to transport data from one set
of flip-flops to another set of flip-flops. There is a data path that routes
register file data to function unit operand flip-flops. There is a data path
that routes function unit results flip-flops to register file write flip-flops.
There is forwarding logic that cross couples the result data path back to the operand data path.

And that is pretty much all a computer mechanic needs to know in order to
assemble any computer (less the control logic.)

---------------------------------------------------------------------------

+--------+ +--+ +--+ +--------+ +--+ +--+ +---
|reg file|->|FF|->operandbus->|FF|->|Fctn Unt|->|FF|->resultbus->|FF|->|reg
+--------+ +--+ ^ +--+ +--------+ +--+ | +--+ +---
1 | 2 3 | 4
+------------------------------------+

---------------------------------------------------------------------------

When flip-flop 1 has multiple entries, it becomes a register cache
When flip-flop 2 has multiple entries, it becomes a reservation station
When flip-flop 3 has multiple entries, it becomes a belt

jim.bra...@ieee.org

unread,

Nov 9, 2018, 9:01:46 PM11/9/18

to

Thanks Mitch:
]>|reg file|->|FF1|->operandbus->|FF2|->|Fctn Unt|->|FF3|->resultbus->|FF4|-|reg
]> When flip-flop 1 has multiple entries, it becomes a register cache

]> When flip-flop 2 has multiple entries, it becomes a reservation station
]> When flip-flop 3 has multiple entries, it becomes a belt

Can quible over the terminology, in the context of high performance uP it is close to correct (generally applicable)?

Thanks Eric:
]>Operand forwarding

]>https://en.wikipedia.org/wiki/Operand_forwarding
]>Topics related to this discussion:
]>https://en.wikipedia.org/wiki/Tomasulo_algorithm
]>https://en.wikipedia.org/wiki/Register_renaming
]>https://en.wikipedia.org/wiki/Reservation_station

Before this thread shuts down would like to suggest two additional vector mode functional units:

Multi-tap shift register
This is a functional unit with a single? input and possibly multiple outputs.

Fully pipelined CORDIC and/or divide & square-root
E.g. pipelined transcendentals

Both of these can have 20+ pipeline stages which conflicts with functional units of typically under 6 stages.
Still handy to have. See Livermore Loops.
The thinking is that if used carefully and cautiously there is a performance benefit?

Jim Brakefield

MitchAlsup

unread,

Nov 9, 2018, 9:40:34 PM11/9/18

to

I prefer general purpose extract and insert (bit manipulation) instructions.
I am willing to entertain Double input single output shift.

>
> Fully pipelined CORDIC and/or divide & square-root
> E.g. pipelined transcendentals

I (will soon) own the IP that enables constructing the easier transcendentals
that perform:
SIN: 19 cycles including argument reduction
COS: 19 cycles including argument reduction
Ln2: 14 cycles
Ln: 17 cycles
EXP2:14 cycles
EXP: 17 cycles

This stuff can be added to an FMUL unit that already performs Goldschmidt
FDIV and SQRT for the cost of about 2X growth in the seed tables. This is
about a 8% adder to the size of the FPU.

Like FDIV and SQRT these are unpipelined.
Unlike FDIV and SQRT the results are faithfully rounded but not correctly
rounded. Expect 0.5024 ULP error.

For the cost of 5-6X the FPU size, they can be fully pipelined.

>
> Both of these can have 20+ pipeline stages which conflicts with functional units of typically under 6 stages.
> Still handy to have. See Livermore Loops.
> The thinking is that if used carefully and cautiously there is a performance benefit?

Huge perf benefit.
>
> Jim Brakefield

jim.bra...@ieee.org

unread,

Nov 9, 2018, 9:58:15 PM11/9/18

to

]> I prefer general purpose extract and insert (bit manipulation) instructions.
Agree

Should have said: register size FIFOs
So one can obtain previous data or results without going to cache/memory.

lk...@lkcl.net

unread,

Nov 9, 2018, 11:32:40 PM11/9/18

to

On Friday, November 9, 2018 at 7:56:19 PM UTC, EricP wrote:
> lk...@lkcl.net wrote:
> >
> > with different experience(s) and terminology, i'm slowly getting a
> > handle on where you're coming from, mitch. does the concept of
> > "forwarding" look anything like "register cacheing" that's described
> > in the paper i referenced in the reply to stephen?
>
> Operand forwarding
> https://en.wikipedia.org/wiki/Operand_forwarding

oh! right! ok, so that's the name of the concept.
it's funny/strange (japanese word, "o-ka-shii-neh"),
to understand some of these concepts yet not know
the names in order to be able to communicate about
them with other people.

so, thank you erik.

so, the register cacheing concept would still
introduce extra clock cycles.

* the tomasulo algorithm (below) has the ALU
indicating that it's waiting for results to be
ready, such that it picks them up from another
ALU on "broadcast".

* the register cache concept doesn't do that,
however at least the results don't travel
all the way back to the main register file
(which may have reduced data ports compared
to the register file caches)

* mitch's microarchitecture has some advantage
over the tomasulo algorithm that means that
ALUs do not spend time "waiting"? synchronisation
of the chain-of-results?

lk...@lkcl.net

unread,

Nov 9, 2018, 11:49:56 PM11/9/18

to

On Friday, November 9, 2018 at 6:10:14 PM UTC, MitchAlsup wrote:

> > i am still unclear as to what the terminology of "stations" and
> > "reservation stations", they are words that i have never encountered
> > before in microarchitectural design: i *believe* they may be understood
> > as being a combination of lane number *plus* register number, with or
> > without register-renaming.
>
> Reservation stations go back to 360/91 and are THE key development enabling
> great big out of order processing. Google up "An efficient Algorithm for
> Exploiting Multiple Function Units".

ok, so thank you to everyone, mitch, stephen, erik, jim, for helping
to clarify here, and provide really valuable links and information.

so, allow me to go through some logical reasoning and outline some
context. the goal (the reason i created SV) is to allow a *small*
team to very quickly implement a full general-purpose SMP-based GPGPU
capable of achieving a modest goal of 5-6 GFLOPs, 100 MPixels/sec and
around 30MTriangles/sec, which roughly translates to around 1280x720
25fps, with an additional goal of being able to decode MP4/MPEG 1080p60.

the kicker: it has to be under 2-2.5 watts @ 28nm.

so we're not going to go at 4ghz, or even 3ghz, or even 2ghz: the target
is 4 SMP cores at only 800mhz, maybe 1ghz. power being a square law
based on speed, we will put in parallel ALUs to reach the target
performance.

given that over the past few years, Allwinner has sold hundreds of millions
of 900mhz $4 dual-core in-order "tablet" style processors, and introduced a
$3 64-bit in-order 1.5ghz core only a couple years ago, there's no
compelling financial reason to go all-out for speed, particularly when
doing so would require far more resources and far more risk than i'm
comfortable taking.

basically what i'm saying is, mitch: whilst other implementors *may*
choose an out-of-order microarchitectural implementation, it's an
optimisation that i can't take a risk on at this early phase, so am
going to skip it entirely unless someone comes along with the expertise
and wants to join the project.

now. i noted in one of the earlier messages that you were interested
to see the 66000 ISA be proven that it can be implemented on non-OoO
simpler and more straightforward microarchitectures.

with that in mind, would you be interested to continue the discussion
on this thread *without* referring to anything that involves out-of-order,
reservation stations, and so on?

by engaging with me and helping me to understand the 66000 vectorisation
*without* such optimisations, it will help open up the reach of the
66000 ISA by demonstrating to people that the modern (extremely complex)
optimisations used in multi-billion-dollar processor designs are not
a hard requirement.

what do you think?

l.

MitchAlsup

unread,

Nov 10, 2018, 4:27:13 PM11/10/18

to

My 66000 ISA is a fairly standard RISC like ISA except that displacements and
immediates can be 16-bit, 32-bit or 64-bit (i.e. variable length instructions.)
IT contains the typical integer, logical, and shift instructions, along with
a modern IEEE 754-2008, including a CVT (convert) instruction that can convert
{signed, unsigned, single, double} into {signed, unsigned, single, double}
and allow the instruction to specify the rounding mode used (when useful). The
ISA also includes 20 transcendental instructions, and your typical set of
memory reference instructions, and a clever way of shoostringing predication
into an ISA where instructions themselves do not waste bits on predication.

Nothing about the Instruction set architecture (ARCH) is necessarily out-of
order; but several things were done in such a way to avoid hindering wide
superscalar OoO machine implementations.

These kinds of ISAs (almost invariably) require OoO in order to achieve 6-wide
superscalar implementations. However, if 6-wide is restricted to vector only
I can see straightforward path to building a 6-wide vector unit at the hip
of a 1-wide in order core.

MitchAlsup

unread,

Nov 11, 2018, 11:17:58 AM11/11/18

to

On Friday, November 9, 2018 at 10:49:56 PM UTC-6, lk...@lkcl.net wrote:

Consider a 1-wide RISC machine pipelined into 6 <classic> cycles.
The core will have 4 function units {Integer, AGEN, FMAC, FADD}
LD will be 3 cycles:: AGEN cache access, Load Align and cache hit.
FADD will be 4 cycles
FMUL will be 4 cycles
ST will use the HP (1986-ish) store pipelining trick

Now consider the FFT2 from above.

In std RISC the inner loop has 21 instructions {4 LD, 4 F+, 4 Fx, 2 F+, 4 ST,
3 Int} In VVM there are only 19 instructions {4 LD, 4 F+, 4 Fx, 2 F+, 4 ST,
1 Loop}

Each of the 4 FUs have up to 8 instructions in a station (not a reservation
station, just a station) and these stations watch the other stations and the
data flow and decide when to run. The general goal is to feed the instruction
to the FU with operands using as little of the register file as possible.

So the following execution schedule falls out:

LD |+$>|
LD |+$>|
LD |+$>|
LD |+$>|
FADD |++++|
FADD |++++|
FADD |++++|
FADD |++++|
FMUL |****|
FMUL |****|
FMUL |****|
FMUL |****|
FADD |++++|
FADD |++++|
ST |+$ >|
ST |+$ >|
ST |+$ >|
ST |+$ >|
Loop |+:b|
nextloop
LD |+$>|

The subsequent loop can begin after the greater of::
{ 3 cycles for LOOP, 8 cycles for AGEN, 6 cycles for FADD, 4 cycles for FMUL}
So, in this case given proper buffering the loop can start after 8 cycles.
Thus we have the equivalent of 21 instructions taking only 8 cycles of an
IPC = 2.6 from a 1 wide machine.

In order to achieve this, the FADD needs to be able to forward to FMUL and
FMUL needs to be able to forward to FADD. LD have to be able to forward to
FADD and FADD has to be able to forward to ST without interfering with
each other.

The LDs get their base register from RF and their index register from LOOP
The first 4 FADDs get both of their operands from LD forwarding
The 4 FMULs get one operand from the RF and one operand from FADD
The second 2 FADDs get both operands from FMUL
The STs get their base register from RF and their index register from LOOP
and get the data to be stored from FADD
The LOOP instruction gets its input data from the previous LOOP instruction

The RF is read 12 times every 8 cycles; easily within its read port
available bandwidth.

Many std RISC pipelines have a RF write port dedicated to service LDs, and
the second port services the FUs. In this vectorized loop, 4 of the LD
cycles are used by LD, 2 are used by the second set of FADDs. The FU
write ports are used by the first 4 FADDs and FMULs. The LOOP instruction
does not bother to write the RF write port until the vector completes.
Presto, a little multiplexing on the LD wr port and all the RF write traffic
is absorbed by the RF of a 1-wide std RISC while operating at 2.6 IPC.

The flip-flops to orchestrate all this vectorization will basically double
the number of flip-flops in all of the FUs of the data path of the sdt RISC
1-wide implementation. That is take the std 1-wide RISC processor data path;
remove all logic other than flip-flops; compress these into the stations;
then add back an entire std RISC core--and you have the estimated size of
the vectorized core.

Paul A. Clayton

unread,

Nov 11, 2018, 3:16:18 PM11/11/18

to

On Monday, November 5, 2018 at 11:16:32 AM UTC-5, MitchAlsup wrote:
[snip]

> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)

> b) no vector length register (nor a conceptual length of the vector)
> c) no vector control registers (no vector state, full vector perf)
> d) no vector condition codes (scalar predication covers this)

[snip]
> The memory system performs store-to-load forwarding analysis so that loops
> such as:
>
> void looprecurance(size_t n, double a, const double x[])

> {
> for (size_t i = 0; i < n; i++) {

> x[i] = a*x[i-10] + x[i-7];
> }
> }
>
> remain vectorized; one calculation simply becomes dependent on the 7th and 10th
> previous calculations through a memory system. No special analysis is required
> in the compiler to vectorize this loop, nor is any special detection necessary
> to avoid vectorizing this loop. The basic tenet is that if the scalr code can
> be written in a simple loop that one can change the top and bottom of the
> loop and presto-it is vectorized.
>
> All of the advantages of a CRAY-like vector ISA, virtually no change to the
> actual ISA (2 instructions), and precise exceptions too.

It is not clear that blocking is practical with this
interface. In a CRAY-like vector ISA, a vector register
could be reused (i.e., the vector length could be
recognized as a blocking unit). If the load of this vector
is a generic gather or more expensive strided load,
avoiding the extra work involved in the load might be
a significant benefit.

It seems plausible that a microarchitecture could detect
such opportunities, but I think discovering with some
effort during decode or Icache load what the compiler can discover (and inexpensively communicate) is suboptimal.
(How such value reuse could be communicated is not clear,
nor whether such communication would be sufficiently
inexpensive to be worthwhile.)

In some cases, scalar-vector operations might be used to
express value reuse (it is not clear if the MY66000
virtual vector extension would support nested vectors
where the out loop treats an inner loop scalar as a
vector), but such might not present the best ordering
of communication/storage/operation.

It seems that an implementation of such virtual vectors
could exploit reduced forwarding to reduce the cost of
extremely wide execution. Even in non-vector (and non-
mulithreaded) code with significant parallelism, various
reductions in generality may be worthwhile in using
much of the available parallelism without the overhead of
supporting all-to-all communication (at full bandwidth and
minimal latency). (This is distinct from the "bolted on"
vector unit.)

A related optimization issue is the size of loop bodies.
With an exposed microarchitecture, code can be optimized
for latencies and width (at the cost of reoptimizing for
each microarchitecture or significant loss on untargeted
microarchitectures). It is not clear how an implementation
can efficiently implement compiler optimizations like
loop splitting; two loops which each fit within the
forwarding network could be more efficient than a single
loop with less nominal work.

Earlier Mitch Alsup may have mentioned optimizations
for reductions. I don't know whether he expects an
implementation to recognize such a reduction operation
in a vector loop and allow (possibly variant result)
parallel execution.

Dynamically detecting associativity, commutativity,
value reuse, and approximate value reuse (as well as
whether and how such can be used) seems problematic in
general. Similar optimization issues seem to apply for
precision, where portions of a computation may be known
to have lower precision demands.

For some decision-oriented applications, approximation
can optimize the worth of the result relative to the cost
of the result and the worth and cost can be variable in
time (e.g., advertisement targeting might use more data
and computation in "off peak hours" at similar cost) and
inputs (aside from caching and distributing recently or
frequently used data, spending more resources may be
justified for more timing-sensitive, accuracy-sensitive,
or valuable results (e.g., it may be known that
advertising is relatively ineffective for a target so
decision and presentation cost may be reduced [it is
somewhat surprising to me that so few ads are placed
"below the fold" where load latency would be less of an
issue; this may speak of the weakness of content and
linking, that most visitors will not read past the first
screen of information, or weak layout design or
implementation (where the latency would not be hidden)]).

("Cache oblivious" blocking techniques could be applied
where multiple reasonable blocking factors are used, but
this introduces code bloat. Some tasks would probably
benefit from implementation- or use-specific code and
caching such optimizations (or at least caching their
usefulness) at a time scale larger than expected for an
instruction cache may be worthwhile.)

The more general problem is what compiler-time information
to express and how to express it in the presence of
post-compiler modification (binary translation, architecture
family based software distribution format, optimizing
trace/instruction cache, decode-time optimization, issue
queue based optimizations, et al.) or merely
microarchitectural variation. (A related issue is how
much loss of efficiency is acceptable under microarchitectural
variation (and the spatial and temporal characteristics of
such variation — physical resource sharing/proximity and
timing of code allocation to a "microarchitecture" (which
includes variations such as what code is run in a
multithreaded core; optimal code for one pairing of threads,
e.g., may be different than for another allocation).)

(Sorry about the length and diversions. I initially only
intended to mention the blocking/value reuse issue and the
side general issue of timing of specialized optimization.
At least the diversions might be mildly entertaining.)

MitchAlsup

unread,

Nov 11, 2018, 4:54:45 PM11/11/18

to

But this is exactly what is compiler determined:: an instruction at the top
of the vector operation denoting which registers are <mimicking> vector registers and which are <mimicking> scalar registers. At the bottom of the
vectorized loop is a single instruction that embodies the {ADD, CMP, BR}
seantic in compressed form which is used by the HW to order loop incarnations.

> (How such value reuse could be communicated is not clear,
> nor whether such communication would be sufficiently
> inexpensive to be worthwhile.)

But it is true that without the vector register file, one cannot hang onto
multiple handfuls of data between certain vectorizable sections. So if one
is willing to pay for the storage in area, power, CRAY-style will out
perform at least some of the time. But at the cost of area and power.

>
> In some cases, scalar-vector operations might be used to
> express value reuse (it is not clear if the MY66000
> virtual vector extension would support nested vectors
> where the out loop treats an inner loop scalar as a
> vector), but such might not present the best ordering
> of communication/storage/operation.

The VVM extension is targeting the inner loops as its main goal,
and it seems to me that this is where most of the gains will be
found.

>
> It seems that an implementation of such virtual vectors
> could exploit reduced forwarding to reduce the cost of
> extremely wide execution. Even in non-vector (and non-
> mulithreaded) code with significant parallelism, various
> reductions in generality may be worthwhile in using
> much of the available parallelism without the overhead of
> supporting all-to-all communication (at full bandwidth and
> minimal latency). (This is distinct from the "bolted on"
> vector unit.)

By having the virtual vectors organized as loops, the HW designer
can decide to throw 0, 1, 2, 4, 8 lanes of vector data path at
the performance end of things. [S]He can also decide the balance
between calculation BW and memory BW. ISA will not care if it
is run on the {0, 1, 2, 4, 8, ...} machine unless it monitors
wall clock time, and even here, only the time changes.

>
> A related optimization issue is the size of loop bodies.
> With an exposed microarchitecture, code can be optimized
> for latencies and width (at the cost of reoptimizing for
> each microarchitecture or significant loss on untargeted
> microarchitectures). It is not clear how an implementation
> can efficiently implement compiler optimizations like
> loop splitting; two loops which each fit within the
> forwarding network could be more efficient than a single
> loop with less nominal work.

Indeed!

>
> Earlier Mitch Alsup may have mentioned optimizations
> for reductions. I don't know whether he expects an
> implementation to recognize such a reduction operation
> in a vector loop and allow (possibly variant result)
> parallel execution.

For the typical kinds of reductions (inner product) I have a way
to denote that the loop is a reduction. Exactly how, I an not
ready to talk about.
>
> Dynamically detecting associativity, commutativity,
which I don't
> value reuse,
which I do

> and approximate value reuse (as well as
> whether and how such can be used) seems problematic in
> general. Similar optimization issues seem to apply for
> precision, where portions of a computation may be known
> to have lower precision demands.

If is known, it should be expressed thusly.

>
> For some decision-oriented applications, approximation
> can optimize the worth of the result relative to the cost
> of the result and the worth and cost can be variable in
> time (e.g., advertisement targeting might use more data
> and computation in "off peak hours" at similar cost) and
> inputs (aside from caching and distributing recently or
> frequently used data, spending more resources may be
> justified for more timing-sensitive, accuracy-sensitive,

HW should make no assumptions of this ilk. HW should execute
programs as expressed in ISA. Should SW be able to make
such decisions, SW can express same in ISA.

> or valuable results (e.g., it may be known that
> advertising is relatively ineffective for a target so
> decision and presentation cost may be reduced [it is
> somewhat surprising to me that so few ads are placed
> "below the fold" where load latency would be less of an
> issue; this may speak of the weakness of content and
> linking, that most visitors will not read past the first
> screen of information, or weak layout design or
> implementation (where the latency would not be hidden)]).
>
> ("Cache oblivious" blocking techniques could be applied
> where multiple reasonable blocking factors are used, but
> this introduces code bloat. Some tasks would probably
> benefit from implementation- or use-specific code and
> caching such optimizations (or at least caching their
> usefulness) at a time scale larger than expected for an
> instruction cache may be worthwhile.)

Once again thi is SW. HW should make no decisions other than
to faithfully execute the instructions that are to be performed.

>
> The more general problem is what compiler-time information
> to express and how to express it in the presence of
> post-compiler modification (binary translation, architecture
> family based software distribution format, optimizing
> trace/instruction cache, decode-time optimization, issue
> queue based optimizations, et al.)

This seems to ba an Ivan question

> or merely
> microarchitectural variation. (A related issue is how
> much loss of efficiency is acceptable under microarchitectural
> variation (and the spatial and temporal characteristics of
> such variation — physical resource sharing/proximity and
> timing of code allocation to a "microarchitecture" (which
> includes variations such as what code is run in a
> multithreaded core; optimal code for one pairing of threads,
> e.g., may be different than for another allocation).)

Any implementation should be able to process essentially ANY
vector code loop faster than the scalar machine can. As illustrated
above, a low end RISC can do 2.5x-4x better than the scalar ISA
would allow.

MitchAlsup

unread,

Nov 12, 2018, 12:22:39 PM11/12/18

to

On Friday, November 9, 2018 at 10:49:56 PM UTC-6, lk...@lkcl.net wrote:
>

> with that in mind, would you be interested to continue the discussion
> on this thread *without* referring to anything that involves out-of-order,
> reservation stations, and so on?
>
> by engaging with me and helping me to understand the 66000 vectorisation
> *without* such optimisations, it will help open up the reach of the
> 66000 ISA by demonstrating to people that the modern (extremely complex)
> optimisations used in multi-billion-dollar processor designs are not
> a hard requirement.

So, I engauged your question, where is your response?

Stephen Fuld

unread,

Nov 12, 2018, 3:11:25 PM11/12/18

to

On 11/11/2018 1:54 PM, MitchAlsup wrote:
> On Sunday, November 11, 2018 at 2:16:18 PM UTC-6, Paul A. Clayton wrote:

snip

>>
>> Earlier Mitch Alsup may have mentioned optimizations
>> for reductions. I don't know whether he expects an
>> implementation to recognize such a reduction operation
>> in a vector loop and allow (possibly variant result)
>> parallel execution.
>
> For the typical kinds of reductions (inner product) I have a way
> to denote that the loop is a reduction. Exactly how, I an not
> ready to talk about.

Great! I have been thinking about this, but have not come up with any
good solutions. I expect you to do much better.

Whenever you are ready, let us know.

lk...@lkcl.net

unread,

Nov 12, 2018, 3:23:52 PM11/12/18

to

On Monday, November 12, 2018 at 5:22:39 PM UTC, MitchAlsup wrote:

> So, I engauged your question, where is your response?

sorry, mitch, i've been ill for the past 12 hours, and dealing
with some other issues before that. i'll reply later today
(it's 4am at the moment here in taiwan).

l.

lk...@lkcl.net

unread,

Nov 13, 2018, 11:30:05 AM11/13/18

to

On Sunday, November 11, 2018 at 4:17:58 PM UTC, MitchAlsup wrote:

> Consider a 1-wide RISC machine pipelined into 6 <classic> cycles.
> The core will have 4 function units {Integer, AGEN, FMAC, FADD}
> LD will be 3 cycles:: AGEN cache access, Load Align and cache hit.
> FADD will be 4 cycles
> FMUL will be 4 cycles
> ST will use the HP (1986-ish) store pipelining trick
>
> Now consider the FFT2 from above.
>
> In std RISC the inner loop has 21 instructions {4 LD, 4 F+, 4 Fx, 2 F+, 4 ST,
> 3 Int} In VVM there are only 19 instructions {4 LD, 4 F+, 4 Fx, 2 F+, 4 ST,
> 1 Loop}
>
> Each of the 4 FUs have up to 8 instructions in a station (not a reservation
> station, just a station) and these stations watch the other stations and the
> data flow and decide when to run. The general goal is to feed the instruction
> to the FU with operands using as little of the register file as possible.

that sounds like a really good optimisation goal, which would be fantastic
for a modern high-end processor. what i'm trying to do is to understand
and confirm the core 66000 ISA vectorisation concept in terms of completely
non-optimised microarchitectural primitives, such that even a CS1 student
could seriously consider implementing a 3-stage pipeline version running
in a $99 FPGA.

or, think in terms of a single highly experienced designer being given a
total of about 6 weeks to complete an entire from-scratch 66000 "compliant"
reference implementation. it has to be correct... it does not have to be
optimised in any way.

then, due to the simplicity, with no optimisations, it becomes possible
to really get to the core of what 66000 vectorisation does.

with that in mind, i believe that a "correct" implementation would
involve an augmentation on the register file, done as a CAM (key-value
store):

uint64_t regfile[32]; // scalars
struct regcam {
int regkey;
uint64_t value;
int vector_lane; // always greater than 0
};
struct regcam vectors[64];

this is slightly different from the "naive" implementation
i previously posted, which had vastly wasted resources.

on the 66000 ISA "start vectorisation" opcode, the number
of required vectors is added up. the goal will be to determine
if the number of key-value entries is sufficient to be able
to even _do_ vectors rather than scalars. so for that,
the number of vectorised registers has to be less than
the number of free, unused key-value store entries.

we can use the *original* scalar regfile as "lane 0",
and in an all-or-nothing integer-divide way, if there are
e.g. 14 vectorised registers required for the loop, and 64
key-value regcam entries, 14*4=56, we can get 4 sets of 14
so that means that, including the scalar regfile for "lane 0",
we can get a total of a 5-wide vector lane.

the nice thing is, that given that OoO optimisations
are not being used here, it's *actual* registers, there
is no longer a limit on the number of instructions inside
the loop.

on a context-switch, we also know that the state required
to be saved is very simple: it's the register file *and* the
key-value regcam file.

also, there's no delay at the start of the loop. the only
calculation needed is a popcount on the vectorisation
bitfield (i assume it's a bitfield marking which registers
are vectors?), a divide and a subtract to work out how
many lanes (plus 1) could be created out of the key-value
store. that should be easily possible to do in a single
pipeline stage, so there's no pipeline stalls either.

what i'm saying is, then, that there may be a fallback
position for the implementation that you envisage, which
does not rely on OoO optimisations, which, if used, could
remove some of the architectural limitations (number of
instructions allowed in the loop).

does that sound reasonable? if so, i think i may have
a way to make it possible to do arbitrary depth *vectorised*
function calls, as well, however i wanted to check first
if the above sounds reasonable before discussing that.

l.

lk...@lkcl.net

unread,

Nov 13, 2018, 11:52:52 AM11/13/18

to

On Sunday, November 11, 2018 at 8:16:18 PM UTC, Paul A. Clayton wrote:

> It is not clear that blocking is practical with this
> interface. In a CRAY-like vector ISA, a vector register
> could be reused (i.e., the vector length could be
> recognized as a blocking unit). If the load of this vector
> is a generic gather or more expensive strided load,
> avoiding the extra work involved in the load might be
> a significant benefit.
>
> It seems plausible that a microarchitecture could detect
> such opportunities, but I think discovering with some
> effort during decode or Icache load what the compiler can discover
> (and inexpensively communicate) is suboptimal.
> (How such value reuse could be communicated is not clear,
> nor whether such communication would be sufficiently
> inexpensive to be worthwhile.)

so, allow me to outline a summary:

* my understanding of 66000 is that it has a
scalar regfile which can also act as a
traditional "lane 0" of a vector system,
with additional "hidden tails" that may
be placed onto any register with a "vector"
tag:

r0 [scalar]
r1 [scalar]
r2 [scalar] [hidden1] [hidden2]
r3 [scalar] [hidden1] [hidden2]
r4 ..

and it also has a Zero-Overhead Loop mechanism
which marks one (and only one) of the scalar
registers to be used as the loop counter. that
loop counter is decremented in multiples of
the length of the "tails" plus one. all tails
*must* be of equal length within any given loop.

if there are insufficient resources to add
equal-length (hidden) tails covering lane1 and
upwards, the vectorisation does *not* need to
actually take place: the entire loop decrements
by one each time i.e. the entire loop is done
as a *scalar* operation.

it's pretty neat.

* traditional vectorisation systems have completely
separate vector registers. they normally have an
SRAM as backing store, and divide it up on-demand
depending on the element width and vector length.

r0 scalar
.. scalar
r31

v0 [lane0] [lane1] .... [laneN]
v1 [lane0] [lane1] .... [laneN]

unlike 66000, however, Zero-overhead Looping is
not normally part of the ISA. it *could* be added,
it's just never been seen [publicly] as part of
any [public] published ISA.

* SV is weird. it has a vector length, just like
a vectorisation system. it uses the *standard*
register file... just like the 66000... except
instead of hidden tails the vectors sit COMPLETELY
inside the standard register file. in the example
below, VL=3 and both r3 and r7 have been marked
in a CSR key-value store as being "vectorised":

r0 (zero - it's RISC)
r1 scalar
r3 scalar [v0.0]
r4 scalar [v0.1]
r5 scalar [v0.2]
r6 scalar
r7 scalar [v1.0]
r8 scalar [v1.1]
r9 scalar [v1.2]

it's *real* simple: any operation checks the
key-value store, and if one of the src or dest
are marked as "vectorised" a loop is activated
that *literally* chucks incremental operations
into the execution pipeline.

ADD r3, r3, r7 (with the example context above) results in:

ADD r3, r3, r7 *AND*
ADD r4, r4, r8 *AND*
ADD r5, r5, r9

[side-note: again, it is perfectly possible to add ZOLC to SV].

so when i first saw discussions on vectorisation, i found
it extremely puzzling to see the kinds of issues being raised
that you bring up, paul. i was thinking to myself, *surely*
vectorisation systems don't have these kinds of memory issues,
or require lanes, and so on: *surely* they can just drop
multiple element-based instructions into a massively-multi-issue
execution engine, and the instruction execution FIFO can pick
those up, even do out-of-order algorithms on them, element-based
register-renaming, and everything, right?

so... what's the deal? why are traditional vectorisation
systems so much more difficult to implement?

l.

MitchAlsup

unread,

Nov 13, 2018, 6:56:07 PM11/13/18

to

You might be able to do this, but I add no state to the register files to build VVM on top of My 66000 ISA. What I did was to add something that smells a lot more like a CDC Scoreboard as a control unit to an instruction buffering scheme that smells a lot like reservation stations, but there is NO CAMming going on!

>
> on the 66000 ISA "start vectorisation" opcode, the number
> of required vectors is added up. the goal will be to determine
> if the number of key-value entries is sufficient to be able
> to even _do_ vectors rather than scalars. so for that,
> the number of vectorised registers has to be less than
> the number of free, unused key-value store entries.

There is a one-to-one correspondence between instructions in a scalar loop
and instructions in a vector loop--with the caveat that the LOOP instruction
takes the place of 3 scalar bottom of loop instructions.

>
> we can use the *original* scalar regfile as "lane 0",
> and in an all-or-nothing integer-divide way, if there are
> e.g. 14 vectorised registers required for the loop, and 64
> key-value regcam entries, 14*4=56, we can get 4 sets of 14
> so that means that, including the scalar regfile for "lane 0",
> we can get a total of a 5-wide vector lane.

I have a different model, that each lane of vector calculation is a different
iteration of the loop. Lane[0] <= Loop[1]; lane[k] <= Loop[k+1]. It is easy
enough to take an index value and use a parallel prefix adder to generate all
of the indexes needed for each loop/each lane simultaneously for a reasonable
number of lanes (8-ish). Then each lane is independent except for the memory
addresses (ST-toLD aliasing) but the loops never get so far out of order as to
prevent the disambiguator from having a hard time finding the special cases.

Not getting too far out of order is a key.

>
> the nice thing is, that given that OoO optimisations
> are not being used here, it's *actual* registers, there
> is no longer a limit on the number of instructions inside
> the loop.
>
> on a context-switch, we also know that the state required
> to be saved is very simple: it's the register file *and* the
> key-value regcam file.
>
> also, there's no delay at the start of the loop. the only
> calculation needed is a popcount on the vectorisation
> bitfield (i assume it's a bitfield marking which registers
> are vectors?), a divide and a subtract to work out how
> many lanes (plus 1) could be created out of the key-value
> store. that should be easily possible to do in a single
> pipeline stage, so there's no pipeline stalls either.

I just take the VEC data and use it to set up the instruction buffers as
the loop is decoded and inserted into the instruction buffers. I don't have
to count it per-sé

>
> what i'm saying is, then, that there may be a fallback
> position for the implementation that you envisage, which
> does not rely on OoO optimisations, which, if used, could
> remove some of the architectural limitations (number of
> instructions allowed in the loop).

I want the smallest <reasonable> implementation to accept and process
vectorized loops. They may be no faster than 1-wide in order base RISC
processor, but they are not any slower either.

>
> does that sound reasonable? if so, i think i may have
> a way to make it possible to do arbitrary depth *vectorised*
> function calls, as well, however i wanted to check first
> if the above sounds reasonable before discussing that.

There will be dozens of ways to implement VVM in typical RISC processors.
>
> l.

MitchAlsup

unread,

Nov 13, 2018, 7:05:51 PM11/13/18

to

On Tuesday, November 13, 2018 at 10:52:52 AM UTC-6, lk...@lkcl.net wrote:
> On Sunday, November 11, 2018 at 8:16:18 PM UTC, Paul A. Clayton wrote:
>
> > It is not clear that blocking is practical with this
> > interface. In a CRAY-like vector ISA, a vector register
> > could be reused (i.e., the vector length could be
> > recognized as a blocking unit). If the load of this vector
> > is a generic gather or more expensive strided load,
> > avoiding the extra work involved in the load might be
> > a significant benefit.
> >
> > It seems plausible that a microarchitecture could detect
> > such opportunities, but I think discovering with some
> > effort during decode or Icache load what the compiler can discover
> > (and inexpensively communicate) is suboptimal.
> > (How such value reuse could be communicated is not clear,
> > nor whether such communication would be sufficiently
> > inexpensive to be worthwhile.)
>
> so, allow me to outline a summary:
>
> * my understanding of 66000 is that it has a
> scalar regfile

so far so good.

> which can also act as a
> traditional "lane 0" of a vector system,
> with additional "hidden tails" that may
> be placed onto any register with a "vector"
> tag:
>
> r0 [scalar]
> r1 [scalar]
> r2 [scalar] [hidden1] [hidden2]
> r3 [scalar] [hidden1] [hidden2]
> r4 ..
>

It could, but none of the implementations I have conceived needed to do any
of this.

> and it also has a Zero-Overhead Loop mechanism
> which marks one (and only one) of the scalar
> registers to be used as the loop counter. that
> loop counter is decremented in multiples of
> the length of the "tails" plus one. all tails
> *must* be of equal length within any given loop.

Truth be told, the I designation is a placeholder. For the code sequences I
have investigated, I==V represents a value which is unique for each passage
through the loop. I am hoping that I!=V will enable some simplification of
the partially ordered control machine. But if I can't make a good case for it,
it is easy enough to drop back to I==V.

>
> if there are insufficient resources to add
> equal-length (hidden) tails covering lane1 and
> upwards, the vectorisation does *not* need to
> actually take place: the entire loop decrements
> by one each time i.e. the entire loop is done
> as a *scalar* operation.

That is your model. In my model the vectorized register file is the scalar
register file but the vast majority of the vector accesses are done by
forwarding and never pass through the RF. Finally, no state bits were added
to RF.

>
> it's pretty neat.
>
> * traditional vectorisation systems have completely
> separate vector registers. they normally have an
> SRAM as backing store, and divide it up on-demand
> depending on the element width and vector length.

The FRF versus VRF creates a big routing headache in classical vector machines
which is completely unnecessary in VVM.

>
> r0 scalar
> .. scalar
> r31
>
> v0 [lane0] [lane1] .... [laneN]
> v1 [lane0] [lane1] .... [laneN]
>
> unlike 66000, however, Zero-overhead Looping is
> not normally part of the ISA. it *could* be added,
> it's just never been seen [publicly] as part of
> any [public] published ISA.
>
> * SV is weird. it has a vector length, just like
> a vectorisation system. it uses the *standard*
> register file... just like the 66000... except
> instead of hidden tails the vectors sit COMPLETELY
> inside the standard register file.

This sounds completely insane!!

The key to making a Great Big Out of Order (BGOoO) machine do vectors
well, is the reduce the amount of out-of-orderness and institute a lot
more partial-orderness. Mostly just to allow the heads and tails of one
loop overlap the the previous or successive.

Stephen Fuld

unread,

Nov 14, 2018, 12:39:11 PM11/14/18

to

On 11/13/2018 3:56 PM, MitchAlsup wrote:

snip

> There is a one-to-one correspondence between instructions in a scalar loop
> and instructions in a vector loop--with the caveat that the LOOP instruction
> takes the place of 3 scalar bottom of loop instructions.

Question. Is it reasonable to use the extra two instructions for a loop
that, for various reasons, is not to be vectorized, in order to reduce
the bottom of loop overhead?