SSE2 Comba multiplication

131 views

Skip to first unread message

Wei Dai

unread,

Apr 10, 2009, 6:14:04 PM4/10/09

to crypto-op...@googlegroups.com

eBATS results seem to show that Crypto++ in x86 32-bit mode
(http://bench.cr.yp.to/web-impl/x86-berlekamp-crypto_dh.html) is competitive
in speed with GMP in 64-bit mode
(http://bench.cr.yp.to/web-impl/amd64-berlekamp-crypto_dh.html) for some
operations. This note describes a couple of the techniques used by Crypto++
in x86 32-bit mode since version 5.5.

[The first part is copied from an email I sent to Mike Scott a couple of
years ago.]

When using the XMM registers, the Core 2 can do 2 32x32 multiplications per
cycle in parallel. It seems to make sense to do 4 such multiplications at
once, resulting in 4 64-bit products in 2 XMM registers. Each of the 2 XMM
registers is then separated into a high 16-bit half and a low 16-bit half
and added to a column sum. (So a total of 4 sum registers.) 16-bit parts are
used because the 32-bit vector addition instruction is twice as fast as
64-bit vector addition. The code looks like this:

x1 = _mm_mul_epu32(x1, x2);
x2 = _mm_srli_epi32(x1, 16);
x1 = _mm_and_si128(x1, x3); // x3 =
0x0000ffff0000ffff0000ffff0000ffff
x7 = _mm_add_epi32(x7, x1);
x6 = _mm_add_epi32(x6, x2);
x0 = _mm_mul_epu32(x0, x2);
x2 = _mm_srli_epi32(x0, 16);
x1 = _mm_and_si128(x0, x3);
x5 = _mm_add_epi32(x5, x1);
x4 = _mm_add_epi32(x4, x2);

[The second part is a repost of
http://groups.google.com/group/sci.crypt/browse_thread/thread/e1a0f08f2331290/.]

I'd like to report an assembly language optimization technique that I
haven't seen used before. The idea is to unroll a loop while
not increasing the code size as much as the usual way, by putting the loop
body into a separate naked function (i.e., without any explicit parameter
passing, prologue, or epilogue since it has knowledge of the caller's stack
and register allocations), and then calling it a fixed number of times
corresponding to the loop iteration count.

On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:

// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;

}

// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret

Reply all

Reply to author

Forward

0 new messages