Thank you !
One specific aspect of VVM is that if the code has a memory alias
the hardware will process the loop correctly if the alias is real
or if the alias is fake (fake runniing faster than real). So the
same source code works both ways, the HW adjusting to the
performance it can achieve obeying all the rules.
This means the compiler does not have to perform alias analysis to
the nth degree; and codes like::
for( unsigned i = k; i < MAX; i++ )
y[i] = y[i-5]*a + y[i-7]*b + y[i-j];
vectorize, but run based on the actual memory address aliasing.
>
> I do not have enough experience with CPU design to fully envision
> the potential pitfalls etc of such a design, so I would not know how
> to approach it - hence my explicit vector register file
> >> (The last point does not have much to do with your question).
> >>
> >> The first point aims to solve the problem that x86 is in right now,
> >> namely gradually evolving and incompatible SIMD ISA:s with wider and
> >> wider registers. Pretty much every modern ISA tries to avoid that
> >> situation (e.g. see ARM SVE, RISC-V V and ForwardCom).
> >
> > This is the curse of SIMD--also addressed (without adding state) in
> > My 66000 through the Virtual Vector Method. Each implementation
> > can decide for itself how wide to make the various paths.
> Exactly. I think that this is very important.
> >>
> >> The second point is kind of a natural byproduct of the first point,
> >> but it also makes it easier to scale the ISA to different hardware
> >> implementations if it does not make any assumptions about how the
> >> vector elements are processed.
> >
> > Exactly why no VRF is desirable.
> In my ISA the vector length (i.e. the number of vector elements to
> process by each instruction) is controlled by the software. There is
> an upper limit to the vector length that is dictated by the hardware
> implementation (i.e. the size of each vector register), but this can
> (and usually must) be queried by the software. The extra overhead for
> this is usually one instruction before the loop (query the max vector
> length) and one instruction inside the loop (a min instruction to
> decide the vector length for each iteration).
Yes this is the path CRAY and NEC went down (I believe RISC-V is going
down this route also) and this was a big advance in 1975 ! where 1 FP
multiplier, 1 memory port, 1 FP adder,... could be assembled into a several
calculations per cycle for long duration's in time.
>
> Thus I think that the extra visible state that the VRF introduces does
> not limit hardware implementations to freely select the actual vector
> width.
No, but the VRF is as big as the FP function units, and has significant
cost at context switch time. CRAY 1 (et. al.) had 4KBytes of VRF.
>
> Regarding VVM, there are a couple of things that I'm curious about.
>
> First, it seems to me that it is designed to vectorize loops where
> the loop count is known before entering the loop (which, I'm sure,
> is the most important use case for vectorization).
This is not true. VVM vectorizes Loops rather than vectorizing
instructions. This gives VVM the opportunity to see the memory
aliasing and adjust its throughput accordingly. Almost all of
the inner loops of Livermore loops get vectorized in Brian's compiler.
> Is there any
> provision for terminating a loop based on other conditions? (E.g.
> consider a Mandelbrot iteration loop where there are typically two
> different termination criteria: a loop count and a distance
> threshold), or is the idea that such loops are handled by regular
> super-scalar OoO scheduling?
Almost all the entire C str* and mem* library functions are vectorized!
There three kinds of LOOP behavior, straight forward counted loops,
counted loops with termination clause, incremented loops with
comparison termination.
>
> Second, is there a way to do "horizontal" operations (e.g. common
> when calculating the sum or min/max of an array or for logical
> operations)? In traditional SIMD ISA:s this is typically implemented
> as successive pairwise operations (as in working your way through a
> binary tree), so that you get a dependency chain of log2(N)
> operations instead of N operations (and usually better FP rounding
> behavior).
Yes, and it is all choice of the implementation. Small lowest power
machines may simply iterate via the FETCH unit. Slightly bigger
machines may iterate via the instructions scheduler. GBOoO
machines can bring as much memory resources and calculation
resources as they can figure out.
For example, the memmove subroutine, when vectorized, can process
a cache-port access width per cycle on small or large machines--equi-
valent to ~32-64 instructions per cycle. I suspect the small machines
will do 1/2 a cache line per cycle, while larger machines may do 2-3 cache
lines per cycle.
Given an implementation that can afford 4 FMAC units and a Loop
with 1 FMAC in it, that loop can be performed at 4 iterations per cycle.
Given a strcmp subroutine, you can do string comparison (byte by byte)
cache-access width per cycle, or 4-to-16 iterations per cycle.
And the code doing 16 iterations per cycle is the exact same code as
that doing 1 on the lowest implementation. You do not have to special-
ize the code to the machine you are on. You specify what needs to happen
and the HW chooses how to pull it all off.
>
> For integer operations there are usually no observable differences
> between sequential and pairwise operations, but for floating-point
> there can be (and usually is) a real difference - which means that
> you can't automatically do such transformations without having a hint
> from the programmer (so that the CPU will produce the expected
> result).
VVM gets rid of the need to hint:: basically if the compiler can find
a loop and if the loop has "reasonable" iteration and termination
properties, the loop can be vectorized! Even when vectorized, all
memory and data ordering are obeyed !
>
> In my ISA I have implemented this with successive folding (combine
> the upper and lower halves of two vector operands - usually the same
> vector register). Since you have to explicitly specify that you are
> doing a folding operation, it is clearly defined what the result will
> be.
I wish you luck.....
>
> /Marcus