> MitchAlsup <?????> wrote:
> >As I have discussed here before, I have invented a means to process
> >instructions quickly on not-very-wide machines called the "Virtual Vector
> >Method". Today's topic concerns some "other" stuff one may want to do when
> >processing Vector Loops that would be unavailable when one is processing
> >scalar instructions (even several at a time.)
>
> [ snip ]
>
> >In the VVM version of My 66000 ISA the above loop is written:
> >
> > VEC 0
> > LDB R5,[R1+R4]
> > STB R5,[R2+R4]
> > LOOP LT,R4,#1,R3
>
> I've wondered about the need for the "VEC" instruction. I think the
> hardware cam figure out which registers are scalar or vector by the end of the
> 2nd iteration.
The VEC instruction serves 2 purposes (one of them slightly different than
when we talked 3-5 months ago on VVM).
Right now VEC does 2 things::
a) it directly annotates the registers with "loop carried dependencies"
b) it provides the position of the top of the loop (so the LOOP instruction
does not need a branch offset)
One can argue that one can figure out LCDs in 2 passes, however I assigned
a third unit of utility to VEC. Those registers marked in VEC are copied
out of the Loop back into scalar registers, the vector registers "in the
loop" are only used for data dependencies (and are not copied out.) Since
more data in a loop is never used again, thinning this out streamlines
the epilogue of the Loop.
>
> Imagine having a 32-bit hidden VEC register which the hardware
> manages (roughly creating the state of the VEC instruction, where in the VEC
> regsiter each bit refers to R0-R31, and '1' means that register is scalar,
> and '0' means it's a vector register). And there's a 2-bit VEC_STATUS
> register indicating state. We need another VEC_RD 32-bit register. Normal
> machine operation is VEC_STATUS=0 (and exiting a LOOP sets VEC_STATUS=0,
> VEC_RD=0, VEC=0).
VVM has no state above that of the non-vectorized ISA. What you describe
above has state that must be managed at context switch,....
>
> Detecting a scalar register means detecting a register which is first read
> in the loop, and then later written. Other combinations are not scalar.
Scalar registers "in a loop" are read and are not written. Registers which
are not carrying loop dependencies, but are written are vector registers.
>
> Upon seeing a taken LOOP instruction while VEC_STATUS=0, set VEC_STATUS=1,
> meaning learn. Now each register accessed is determined
> whether it is scalar or not by whether it's living across iterations. Set
> VEC_RD[x]=1 if register x is read from. If register x is written to, set
> VEC[x]=1 if VEC_RD[x]=1. When LOOP is seen again and taken, VEC now knows
> which registers are scalars, so VEC_STATUS=2 and HW enters the usual VVM mode,
> using VEC to inform the execution units appropriately as each instruction
> is fetched, and then the vector mode takes over.
While you could do this, it ends up a total waste if the loop is traversed
2-3-4-5 times. That only pays off when the loop count is larger.
Point 2: the compiler KNOWS this at compile time, the VEC instruction gives
a way to communicate this to HW.
Point 3: If the compiler so communicates, all one has to see if a use as
a destination to qualify the register as vector.
>
> Why slow things down (since these first 2 iterations won't be as fast as
> the usual VVM)? First, to avoid the software problem: if you ask
> software to do special notation for loops (e.g., put in the VEC instruction),
> it will generally make poor choices. Either unimportant loops will be
> vectorized, or important loops will be missed.
I am trying to vectorize loops with trip counts as low as 3 and still
be faster than if vectorization was not performed at all. Thus, even
the symbol table stuff in a C compiler can use vectorized strcpy to
fill in the symbol table from the parsed text.
> Second, by leaving it to
> hardware, you can enhance the feature in the future without needing
> software changes. As I envision it, software would ALWAYS use the LOOP
> instruction for any loop, even ones it knew would be short, and just not
> care since the hardware just does the right thing in all cases.
There is a maximum number of instructions that can be co-vectorized.
But you are correct, I want all short loops to be vectorized--unless they
contain ATOMIC instructions.
>
> And third, a future implementation could allow nested LOOPs on GBOoO
> implementations, which would be even more powerful.
I thought about this a lot, and ended up with no.
It is perfectly feasible to have additional resources execute the code after
the Loop and figure out a new Loop is on the way WHILE the current Loop is
being iterated.
But it is not feasible to nest vectorization without getting lost in the
resource complexity.
> No software changes
> would be needed, as long as software just made most/all loops use LOOP, even
> nested ones. Hardware can nest LOOP as deep as it feels it can support,
> dynamically.
>
> And fourth, if you add LOOP to a LOOP prediction table, you can get these
> cases right for future calls, even nested cases. Like a branch lookup to
> get the target, you detect the start of a loop by remembering it, and pull
> out the VEC register information, and be vectorized from the first iteration
> (as long as you've seen it before). This table could be pretty small,
> since it would just contain loops that lasted long enough to really take
> advantage of the vectorized loops.
I think just taking Looping branches out of the Branch prediction Table will
be a big enough win to make Loop prediction rather smaller. Remember we are
performing the Increment, compare, and branch in a single cycle, so the
minimum Loop time is 1 cycle, but as discussed in this thread above, one
can perform several iterations of a Loop in a single cycle.
>
> Taking this idea to the extreme, you technically don't need the LOOP
> instruction either. But detecting loops to try to vectorize is much
> harder without it, and the LOOP operation is a very useful one to have.
I am actually considering making the LOOP instruction part of the std ISA
so that the only Vectorization instruction is VEC.
LOOP performs the ADD, the COMPARE, and the BRANCH in a single cycle from
a single instruction. LOOP can also perform ADD, a compare different
register to zero, and Branch NE0 in a single cycle and single instruction.
Several other constructs are in the works so that things like strncpy
can be performed with a single LOOP instruction.
>
> Kent