loop unrolling tricks

Wei Dai

unread,

Mar 16, 2007, 9:59:55 AM3/16/07

to

I'd like to report an assembly language optimization technique that I
haven't seen used before. The idea is to unroll a loop while
not increasing the code size as much as the usual way, by putting the loop
body into a separate naked function (i.e., without any explicit parameter
passing, prologue, or epilogue since it has knowledge of the caller's stack
and register allocations), and then calling it a fixed number of times
corresponding to the loop iteration count.

On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:

// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;
}

// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret

Using these ideas I was able to do a 2048-bit RSA decryption in 9.7e6 cycles
on an Intel Core 2 (single-threaded, in 32-bit mode using SSE2
instructions).
(This code will be in the next version of Crypto++.) Naive loop unrolling
was
about 30% slower, apparently due to thrashing of the L1 instruction cache.

Douglas Eagleson

unread,

Mar 16, 2007, 10:39:09 AM3/16/07

to

> instructions).
> (This code will be in the next version of Crypto++.) Naive loop unrolling
> was
> about 30% slower, apparently due to thrashing of the L1 instruction cache.

I have a nine nested loop I need to make faster. But the array indexs
are functionally related.

If i had a large example of

s=1000
k=f(x)

> for (int i=k; i<=s; i++)
> {
> for (int j=i-k; j<s+k; j++)
> {
> inner_loop_body;
> }
> outer_loop_body;
>
> }

Does you speed method work on this kind og nesting?

thanks in advance, DOug

Ilmari Karonen

unread,

Mar 19, 2007, 7:02:03 PM3/19/07

to

Douglas Eagleson <eagleso...@yahoo.com> kirjoitti 16.03.2007:

>Wei Dai <use...@weidai.com> kirjoitti 16.03.2007:
>> I'd like to report an assembly language optimization technique that I
>> haven't seen used before. The idea is to unroll a loop while
>> not increasing the code size as much as the usual way, by putting the loop
>> body into a separate naked function (i.e., without any explicit parameter
>> passing, prologue, or epilogue since it has knowledge of the caller's stack
>> and register allocations), and then calling it a fixed number of times
>> corresponding to the loop iteration count.
>>

>> In the first case this technique can be combined with another trick, which
>> could be called overlapping functions, to make the branches fully
>> predictable.

This overlap trick seems sort of similar to Duff's Device, except that
you fully unroll the inner loop and just jump in at the right point.

> I have a nine nested loop I need to make faster. But the array indexs
> are functionally related.
>
> If i had a large example of
>
> s=1000
> k=f(x)
>
>> for (int i=k; i<=s; i++)
>> {
>> for (int j=i-k; j<s+k; j++)
>> {
>> inner_loop_body;
>> }
>> outer_loop_body;
>>
>> }
>
> Does you speed method work on this kind og nesting?

If I understand this trick correctly, I believe a requirement for the
overlap technique is that the ending point for the inner loop must be
the same for all iterations of the outer loop (since in the unrolled
version there _is_ no check for the ending condition, just a single
"return" at the end of the unrolled code).

Of course, it should be technically possible to rewrite just about any
loop in such a way that this condition becomes true; depending on what
the code in the loop does, such a rewrite may or may not incur its own
performance penalty.

--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.

Wei Dai

unread,

Mar 19, 2007, 11:16:59 PM3/19/07

to

"Ilmari Karonen" <use...@vyznev.invalid> wrote in message
news:slrnevu5jb....@boojum.home.vyznev.net...

> If I understand this trick correctly, I believe a requirement for the
> overlap technique is that the ending point for the inner loop must be
> the same for all iterations of the outer loop (since in the unrolled
> version there _is_ no check for the ending condition, just a single
> "return" at the end of the unrolled code).
>
> Of course, it should be technically possible to rewrite just about any
> loop in such a way that this condition becomes true; depending on what
> the code in the loop does, such a rewrite may or may not incur its own
> performance penalty.

Yes, that's a good point that I forgot to mention. For Comba multiplication,
it is possible to do this rewrite without penalty, but it may not be true in
other cases.

tum_

unread,

Mar 20, 2007, 8:19:23 AM3/20/07

to

Would you mind giving an example of "Naive loop unrolling", just to be
sure that we understand it in the same way.

Douglas Eagleson

unread,

Mar 20, 2007, 6:06:57 PM3/20/07

to

> sure that we understand it in the same way.- Hide quoted text -
>
> - Show quoted text -

I have a real need for loop speed. And and example or reference would
be real helpfull.

Thanks.

Wei Dai

unread,

Mar 20, 2007, 7:47:38 PM3/20/07

to

"tum_" <atoumant...@mail.ru> wrote in message
news:1174393163.5...@n76g2000hsh.googlegroups.com...

> Would you mind giving an example of "Naive loop unrolling", just to be
> sure that we understand it in the same way.

Using this same example,

>> // original code
>> for (int i=1; i<=3; i++)
>> {
>> for (int j=0; j<i; j++)
>> {
>> inner_loop_body;
>> }
>> outer_loop_body;
>>
>> }

the straightforward way to unroll it would be:

inner_loop_body
outer_loop_body
inner_loop_body
inner_loop_body
outer_loop_body
inner_loop_body
inner_loop_body
inner_loop_body
outer_loop_body

If we replace the iteration count 3 on the outer loop with n, "naive loop
unrolling" causes a code size of O(n^2), compared to O(n) for unrolling
using the "overlapping functions" idea.