Unrolling and spilling

Stephan Ceram

unread,

Jan 28, 2009, 6:24:40 PM1/28/09

to

In many compiler books and papers you read that loop unrolling
possibly increases register pressure that leads to extra spill code. I
don't see the reason how this can happen. When you perform unrolling
at assembly level, the loop bodies are just replicated, so the
register life ranges are not really increased. Also, when unrolling is
done at the source code, I don't really see a reason why the register
allocation should add more spill instructions. More spilling would
mean that you have a change in the life ranges of the variables that
adversely influenced the interference graph. But why should I get
more conflicting life ranges after unrolling?

The only reason I see are instruction scheduling techniques that might
mix up life ranges but the literature does not mention these
optimizations in context of spilling and unrolling.

Best,
Tim

Sid Touati

unread,

Jan 29, 2009, 7:21:50 AM1/29/09

to

Stephan Ceram a icrit :

Your are right, there is no reason that loop unrolling increases the
register pressure if no instruction scheduling is applied afterwards.

The register pressure may increase after loop unrolling if instruction
scheduling is applied afterwards. If the instruction scheduler is
aggressive, then the register pressure (MAXLIVE) may increase, because
more instructions are scheduled, so more lifetime intervals may be in
conflict. If no care is taken, additional spill code may be introduced.

Bounding and/or analysing register pressure before instruction
scheduling is done using the register saturation concept. The following
article studies the relationship between loop unrolling and register
pressure increase:

Sid-Ahmed-Ali Touati. Register Saturation in Instruction Level
Parallelism. International Journal of Parallel Programming,
Springer-Verlag, Volume 33, Issue 4, August 2005. 57 pages.

Harold Aptroot

unread,

Jan 29, 2009, 9:53:44 AM1/29/09

to

"Stephan Ceram" <linuxkaffee_@_gmx.net> wrote in message

> In many compiler books and papers you read that loop unrolling
> possibly increases register pressure that leads to extra spill code. I
> don't see the reason how this can happen. When you perform unrolling
> at assembly level, the loop bodies are just replicated, so the
> register life ranges are not really increased.

>...

> The only reason I see are instruction scheduling techniques that might
> mix up life ranges but the literature does not mention these
> optimizations in context of spilling and unrolling.

Instruction scheduling is a very important optimization for pipelined
processors, they really should have mentioned how it affects
spilling/unrolling
Software pipelining is a good example that combines scheduling with
unrolling..

Anyway, when an array is accessed in the loop body, and it reads back from
the same array a few iterations after it had written that entry, unrolling
could cause the compiler to keep the value in a register (staying live
across the boundaries of the copies of the loop bodies), adding to the
pressure. I don't expect that to be overly common, but it's at least one
example of how unrolling could increase register pressure - but again it
only occurs in combination with other optimizations.

Some compilers may also be tricked by their own unrolling to think that the
index variable (loop counter) is hotter and therefore decide not to split
its live range, but that wouldn't be very clever of them. It could happen
though, heuristics for the hot-ness of variables are hard to get right,
especially for loops of which the length can not be determined at
compile-time.
This is the only example I can think of at the moment that will cause plain
unrolling (no reordering or caching of array elements etc) to affect
register pressure - which doesn't mean that there are no other ways.

cr88192

unread,

Jan 29, 2009, 3:41:08 PM1/29/09

to

"Stephan Ceram" <linuxkaffee_@_gmx.net> wrote in message

I wrote a compiler and I don't see why this should be the case...

any spilling that would occur as a result of unrolling I would think would
also occur as a result of having had the loop.

but, then again, I also wrote a fairly stupid compiler, which tends to spill
its registers whenever running across a label or jump (the register
allocator/... is stupid, and so will not bother to take control flow and
similar into account, but will rather just spill any registers in order to
make sure everything is in a consistent state prior to either entering the
label, or initiating the jump).

so, in my case, unrolling would actually help, as since there is no longer a
need for labels and conditional jumps (resulting from the looping code), it
may be able to avoid spilling registers...

then again, I didn't read any of the existing literature prior to originally
writing my compiler, rather I just used the powers of past experience and
programmer intuition.

or such...

Carlie Coats

unread,

May 23, 2009, 11:52:46 AM5/23/09

to

Harold Aptroot wrote:
[snip...]

> Instruction scheduling is a very important optimization for pipelined
> processors, they really should have mentioned how it affects
> spilling/unrolling
> Software pipelining is a good example that combines scheduling with
> unrolling..

> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be

> overly common...

Maybe you don't do numerical analysis codes...

A simple example is a numerical second-derivative code; let's chase
through that:

For pseudocode, you have:

input array Y[0:N+1]. For realism, assume N large.
spacing DX, scratch variable DXSQ=DX*DX
output array D2Y[1:N]
Loop: I=1,...,N
D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ

With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
FP registers.

And long pipeline depth will kill your performance.

Unrolled by 4:
Loop:
D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ

The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.

For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(