On Monday, November 5, 2018 at 4:16:32 PM UTC, MitchAlsup wrote:
> Over the last year I have been working on a Virtual Vector extension to
> My 66000 ISA. In this extension the compiler is given the illusion of
> a register to register CRAY-like vector instructions set, but there are
> only 2 instructions added to the ISA in support of vectors. In particular;
> a) there are no vector registers (saving massive real estate)
yes. there is a huge disadvantage of traditional vector ISAs:
massive duplication and specialisation of the opcodes.
in RVV they have had to add approximately a 25-30% duplication
of the xBitManip extension for the sole purpose of managing predicate
bits... *without* that hardware being actually available for the
purposes of general-purpose bit manipulation.
a combination of SV and xBitManip extensions provides near 100%
complete coverage of RVV's vectorisation functionality.
there is one perceived advantage: the vector pipeline may be entirely
separated from the scalar pipeline. i would perceive that as a
*dis*advantage, on the basis that it results in huge duplication of
silicon, particularly arithmetic pipelines.
> b) no vector length register (nor a conceptual length of the vector)
that's interesting.
> c) no vector control registers (no vector state, full vector perf)
some "setup" is required however. the state information needs to
be set up one way or the other. CSRs or instructions, it's one
or the other.
> d) no vector condition codes (scalar predication covers this)
yes it does. RISC-V scalar operations do not have the concept
of predication. therefore, SV has an additional CSR "predication"
table that allows the *use* of a given (ordinarily scalar) register
to indicate "hey this operation is now predicated, because you
used register x5, and the CSR lookup table says that register x3's
bits are to be used as the predicate".
in this way there was absolutely no need to modify the RISC-V
scalar ISA *at all*.
> Consider for a moment DAXPY::
we like DAXPY :)
> Note: This takes 2 fewer instructions that RV both overall and in the loop.
> In order to convert this function into a vector function, one needs two
> pieces of decoration::
>
> daxpy:
> MOV Ri,0
> CMP Rt,Ri,Rn
> BGE exit
> top: VEC {SIV,SIV,VVV,SIV,ICI} // decorate the registers
> LDD Rx,[Rxb+Ri<<3]
> LDD Ry,[Ryb+Ri<<3]
> FMAC Ry,Ra,Rx,Ry
> STD Ry,[Ryb+Ri<<3]
> LOOP Ri,1,Rn,LT // compress loop overhead
> exit:
> JMP R0
>
> The decoration at the top of the loop annotates which registers in the
> following instructions are Scalars (S), Vectors (V), and Loop-Indexes (I).
ok, so yes: the "mark as vectorised" is the basic core principle of
SV. in SV User-Mode there are 16 available CSR entries which may be used to
mark up to 16 registers (FP or INT) as "vectorised".
what i have *not* done is add "loop indices". there is a very simple
reason: SV is *not* about adding new instructions, it's about
expressing *parallelism*.
so for loop indices to be added, a *separate* extension would be needed
that... added looping.
in DSP terminology this is called zero-overhead looping, and at its
most effective (TI DSPs for example) a full FFT may be achieved
*in a single instruction* with 100% utilisation, thanks to a zero
overhead loop control mechanism.
i therefore looked up and contacted someone who created something
called "ZOLC", and obtained the verilog source code. if anyone
is interested i can provide a copy on request, or put you in touch
with the author.
the author of ZOLC provided an amazing example of implementing
MPEG decode, and coming up with a whopping 38% decrease in
completion time... *with no branch prediction unit*.
> The mechanism of execution is as follows:: The VEC instruction casts a
> shadow over the instructions in the vectorized loop. Scalar registers
> only have to be fetched from the register file ONCE, Vector registers
> will be captured Once per loop,
interesting that you've noted the loop invariance of scalar registers.
where are the actual vector registers? *are* there any actual vector
registers? in SV, the scalar register file is "remapped" to
"vectorised" by being treated as if it was a contiguous memory block,
and the "scalar" register is a pointer to the starting element.
in this way, an ADD r10 r20 r30 with a VL of 3 actually becomes:
ADD r10 r20 r30
ADD r11 r21 r31
ADD r12 r22 r33
assuming that all of r10, r20 and r30 have been marked as "vectorised",
that is.
> and Writes to the Index register cause
> another LOOP to be initiated in <wherever the instructions are being
> buffered::but most likely something like reservation stations.>
>
> The LOOP instruction does not need a branch target, that was already
> supplied by the VEC instruction; all it needs is the Loop index register,
> an increment, a comparison, and a termination.
this is classic DSP-style zero-overhead looping. it's pretty effective,
and completely does away with the need for branch prediction. in
the ZOLC implementation it can even be nested, and there can be
multi-variable condition codes causing jumps *between nesting levels*...
all producing 100% full pipelines, no stalls, *no branch prediction at all*.
> Architecturally, there can be as many lanes as the designers want to
> build (NEC SX5+) with the caution that the memory system needs to scale
> proportionate to the calculation system.
absolutely. it's been well-known for many many decades that the
bottleneck is never the computational aspect of vector processors:
it's the memory bandwidth.
> If an enabled exception is raised, the vector calculations preceding the
> exception are allowed to complete, and vector data is stored in the
> scalar registers, exception control transfer is performed, and the
> exception handled as if there were no vectorization at all.
in SV, one of the design criteria was that it had to be realistically
achievable for a single computer science student or a libre hardware
engineer to implement in a reasonably short time-frame.
when i first encountered RVV, the RISC-V Vectorisation proposal, i
was very surprised to learn of the design criteria that, firstly,
elements and the operations on them must be treated as being executed
sequentially, and secondly, that if an exception occurred and there
were elements further along (higher indices), the perfectly-good
results had to be THROWN OUT.
six months later i began to understand the value of this design decision.
the issue is that exceptions, such as virtual memory page table
misses, have to be re-entrant. anything more complex than the above
design criteria is just too complex to handle.
so the state that's stored in SV (and i believe RVV as well) has to
include the loop indices. in SV there are actually two indices (due
to twin-predication on certain operations).
there's seperate state for M-Mode, S-Mode (supervisor mode) and U-Mode,
allowing each to continue to use their own separate vectorisation
to manage lower-privilege execution levels.
> In the absence of exceptions, the FETCH/DECODE/ISSUE sections of the pipeline
> are quiet, waiting for the vector to complete (and getting FETCH out of the
> way of vectors in memory.)
interesting. what happens if the loop is say 2,000 to 4,000 instructions
in length?
> The memory system performs store-to-load forwarding analysis so that loops
> such as:
>
> void looprecurance(size_t n, double a, const double x[])
> {
> for (size_t i = 0; i < n; i++) {
> x[i] = a*x[i-10] + x[i-7];
> }
> }
>
> remain vectorized; one calculation simply becomes dependent on
> the 7th and 10th previous calculations through a memory system.
this is accidentally provided in SV as well :)
the CSR "register tagging" system actually allows redirection
of the registers as well. so, at the assembly level the
instruction says "ADD x3 x5 x7" however register x3 is redirected
to *actually* point at x10, x5 is redirected to point at x12,
and so on.
in this way, it should be clear that if the loop length is
say 10, then due to the hardware-macro-loop unrolling that ADD to
multiple contiguous registers, they're *going* to overlap:
ADD x3 x5 x7, VL=10, x3 redirects to x10, x5 redirects to x12:
becomes:
ADD x10 x12 x7
ADD x11 x13 x8
ADD x12 x14 x9 <- overlap already begins
ADD x13 x15 x10
...
> No special analysis is required
> in the compiler to vectorize this loop, nor is any special
> detection necessary to avoid vectorizing this loop. The basic
> tenet is that if the scalr code can
> be written in a simple loop that one can change the top and bottom
> of the loop and presto-it is vectorized.
exactly. it's pretty elegant, isn't it?
> All of the advantages of a CRAY-like vector ISA, virtually no change to the
> actual ISA (2 instructions), and precise exceptions too.
one of the things that drives SV is the need to minimise the
amount of modifications and maintenance to the existing
RISCV toolchain (including LLVM and various back-ends such
as clang, rustlang, and so on).
by not adding any new instructions *at all* i can write unit tests
purely in *standard* RISC-V assembler. the *behaviour* is different...
and that's absolutely fine.
in the target primary application we can go straight ahead with
writing a 3D GPU driver (Kazan3D, a Vulkan LLVM software renderer),
use the *existing* scalar RISCV64 compilers, and just put in
hand-coded assembler..
...or, better yet, cheat a little by writing scalar functions,
and, before calling them (and on exit) have a bit of inline assembler
that sets up and tears down the required vectorisation :)
l.