On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:
// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;
}
// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret
Using these ideas I was able to do a 2048-bit RSA decryption in 9.7e6 cycles
on an Intel Core 2 (single-threaded, in 32-bit mode using SSE2
instructions).
(This code will be in the next version of Crypto++.) Naive loop unrolling
was
about 30% slower, apparently due to thrashing of the L1 instruction cache.
I have a nine nested loop I need to make faster. But the array indexs
are functionally related.
If i had a large example of
s=1000
k=f(x)
> for (int i=k; i<=s; i++)
> {
> for (int j=i-k; j<s+k; j++)
> {
> inner_loop_body;
> }
> outer_loop_body;
>
> }
Does you speed method work on this kind og nesting?
thanks in advance, DOug
This overlap trick seems sort of similar to Duff's Device, except that
you fully unroll the inner loop and just jump in at the right point.
> I have a nine nested loop I need to make faster. But the array indexs
> are functionally related.
>
> If i had a large example of
>
> s=1000
> k=f(x)
>
>> for (int i=k; i<=s; i++)
>> {
>> for (int j=i-k; j<s+k; j++)
>> {
>> inner_loop_body;
>> }
>> outer_loop_body;
>>
>> }
>
> Does you speed method work on this kind og nesting?
If I understand this trick correctly, I believe a requirement for the
overlap technique is that the ending point for the inner loop must be
the same for all iterations of the outer loop (since in the unrolled
version there _is_ no check for the ending condition, just a single
"return" at the end of the unrolled code).
Of course, it should be technically possible to rewrite just about any
loop in such a way that this condition becomes true; depending on what
the code in the loop does, such a rewrite may or may not incur its own
performance penalty.
--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.
Yes, that's a good point that I forgot to mention. For Comba multiplication,
it is possible to do this rewrite without penalty, but it may not be true in
other cases.
Would you mind giving an example of "Naive loop unrolling", just to be
sure that we understand it in the same way.
I have a real need for loop speed. And and example or reference would
be real helpfull.
Thanks.
Using this same example,
>> // original code
>> for (int i=1; i<=3; i++)
>> {
>> for (int j=0; j<i; j++)
>> {
>> inner_loop_body;
>> }
>> outer_loop_body;
>>
>> }
the straightforward way to unroll it would be:
inner_loop_body
outer_loop_body
inner_loop_body
inner_loop_body
outer_loop_body
inner_loop_body
inner_loop_body
inner_loop_body
outer_loop_body
If we replace the iteration count 3 on the outer loop with n, "naive loop
unrolling" causes a code size of O(n^2), compared to O(n) for unrolling
using the "overlapping functions" idea.