The only reason I see are instruction scheduling techniques that might
mix up life ranges but the literature does not mention these
optimizations in context of spilling and unrolling.
Best,
Tim
Your are right, there is no reason that loop unrolling increases the
register pressure if no instruction scheduling is applied afterwards.
The register pressure may increase after loop unrolling if instruction
scheduling is applied afterwards. If the instruction scheduler is
aggressive, then the register pressure (MAXLIVE) may increase, because
more instructions are scheduled, so more lifetime intervals may be in
conflict. If no care is taken, additional spill code may be introduced.
Bounding and/or analysing register pressure before instruction
scheduling is done using the register saturation concept. The following
article studies the relationship between loop unrolling and register
pressure increase:
Sid-Ahmed-Ali Touati. Register Saturation in Instruction Level
Parallelism. International Journal of Parallel Programming,
Springer-Verlag, Volume 33, Issue 4, August 2005. 57 pages.
Instruction scheduling is a very important optimization for pipelined
processors, they really should have mentioned how it affects
spilling/unrolling
Software pipelining is a good example that combines scheduling with
unrolling..
Anyway, when an array is accessed in the loop body, and it reads back from
the same array a few iterations after it had written that entry, unrolling
could cause the compiler to keep the value in a register (staying live
across the boundaries of the copies of the loop bodies), adding to the
pressure. I don't expect that to be overly common, but it's at least one
example of how unrolling could increase register pressure - but again it
only occurs in combination with other optimizations.
Some compilers may also be tricked by their own unrolling to think that the
index variable (loop counter) is hotter and therefore decide not to split
its live range, but that wouldn't be very clever of them. It could happen
though, heuristics for the hot-ness of variables are hard to get right,
especially for loops of which the length can not be determined at
compile-time.
This is the only example I can think of at the moment that will cause plain
unrolling (no reordering or caching of array elements etc) to affect
register pressure - which doesn't mean that there are no other ways.
I wrote a compiler and I don't see why this should be the case...
any spilling that would occur as a result of unrolling I would think would
also occur as a result of having had the loop.
but, then again, I also wrote a fairly stupid compiler, which tends to spill
its registers whenever running across a label or jump (the register
allocator/... is stupid, and so will not bother to take control flow and
similar into account, but will rather just spill any registers in order to
make sure everything is in a consistent state prior to either entering the
label, or initiating the jump).
so, in my case, unrolling would actually help, as since there is no longer a
need for labels and conditional jumps (resulting from the looping code), it
may be able to avoid spilling registers...
then again, I didn't read any of the existing literature prior to originally
writing my compiler, rather I just used the powers of past experience and
programmer intuition.
or such...
> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be
> overly common...
Maybe you don't do numerical analysis codes...
A simple example is a numerical second-derivative code; let's chase
through that:
For pseudocode, you have:
input array Y[0:N+1]. For realism, assume N large.
spacing DX, scratch variable DXSQ=DX*DX
output array D2Y[1:N]
Loop: I=1,...,N
D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ
With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
FP registers.
And long pipeline depth will kill your performance.
Unrolled by 4:
Loop:
D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ
The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.
For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(