sigma_i(x) and SIGMA_i(x)
These 4 functions require 3 ROR (or SHR) and 2 XOR.
They also require that we make 2 copies of the parameter,
since ROR and SHR will destroy it.
Total = 7 instructions
Ch(e,f,g)
We can write Ch(x,y,z) = (e & (y ^ z)) ^ z
Ch() then requires 2 XOR and 1 AND. We also need to make
a copy of either y or z.
Total = 4 instructions
Maj(x,y,z)
We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
Maj() then requires 3 XOR and 1 AND. We also need to make
a copy of either x or y and a copy of either x or z.
Total = 6 instructions
First loop (prepare message schedule):
W[t] = sigma_1(W[t-2]) + W[t-7] + sigma_0(W[t-15]) + W[t-16]
Total = (7 + 7 + 3)*48 = 816 instructions
Second loop (is it called compress?):
T1 = SIGMA_1(e) + Ch(e,f,g) + h + W[t] + K[t]
T2 = SIGMA_0(a) + Maj(a,b,c)
d = d + T1
h = T1 + T2
Line1 = 7 + 4 + 4
Line2 = 7 + 6 + 1
Line3 = 1
Line4 = 1
Total = 31*64 = 1984 instructions
Therefore, if we use only integer code (I will investigate SIMD
later, improvements are unlikely) we need to execute *AT LEAST* 2800
instructions for one round of SHA-256.
Take an Athlon which can retire 3 instructions per cycle. Assume
ideal throughput, a round of SHA-256 requires AT LEAST 933 cycles.
933 cycles/block = 14.58 cycles/byte
This would translate to:
68,571,000 byte/s on a 1 GHz Athlon.
137,143,000 byte/s on a 2 GHz Athlon.
Any comments? Did I get the numbers right?
Cheers,
Shill
I looked at it before, and I think unrolling can remove unnecessary copies
in the second loop. (You did not include those in your lower bound, so this
should not affect the count.)
Also, perhaps full unrolling of the algorithm could eliminate some of the
computations? You only take h at the end, right?
-Matt
I believe you are leaving out the endian swapping. Some-one needs to do
this and you may as well use bswap in the ASM code. There is also the
addition of the previous values at the end of the loop.
I have, in my implementation on x86_64, 2891 instructions in the inner
loop. For the x86 (I back-ported the x86_64 version), I have 3061,
mostly due to register pressure.
CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58
So while in theory things should take 933 cycles, the x86 architecture
has register issues, and instruction re-ordering issues. The
Hammer/Opteron/Athlon64 does a very good job at re-ordering x86
instructions, hence the IPC (Instructions Per Cycle) getting close to
the maximum of 3.
eric
Very true. I confess that I didn't want to deal with the details
right away ;-)
Let's add 16 BSWAP instructions to my lower bound computation.
Copying H to a-h and adding a-h back to H: assume I keep H inside 4
MMX registers (I will have to figure out how to hide the
store-to-load forwarding penalty). Each operation requires 4
instructions.
This brings my lower bound total to 2800+16+8 = 2824 instructions.
Assume optimal throughput, that's a lower bound (an optimum to shoot
for) of 941 cycles per 64-byte block.
> I have, in my implementation on x86_64, 2891 instructions in the inner
> loop. For the x86 (I back-ported the x86_64 version), I have 3061,
> mostly due to register pressure.
Wow! I am impressed... 2891 instructions, that is very close to the
lower bound I've derived above. Did you use SIMD instructions, or
just integer instructions? Did you use the full 64-bit registers or
just the lower 32-bits? In other words, did you take advantage of
the wider registers?
> CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
> Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
> Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
> Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
> Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58
It's great to actually see some numbers! Thanks.
> So while in theory things should take 933 cycles, the x86 architecture
> has register issues, and instruction re-ordering issues. The
> Hammer/Opteron/Athlon64 does a very good job at re-ordering x86
> instructions, hence the IPC (Instructions Per Cycle) getting close to
> the maximum of 3.
I wasn't saying it should take 933 cycles, I was only deriving a
lower bound :-)
BTW, I think my implementation on the Athlon is faster than yours. I
didn't put all the pieces together but I run the first loop in
approximately 350 cycles (I don't have my results with me) and the
second loop in 717 cycles. The total should come in under 1200
cycles ;-)
I'm still working on a genetic algorithm to find the optimal
instruction schedule. It's a cool little project :-)
Shill
Do you have similar data for SHA-1?
Thanks
Andrew
He already has, in reply to a post you wrote...
http://www.google.com/groups?selm=3f8d51ec$0$23594$5a62...@freenews.iinet.net.au
Sorry, let me be more specific. I am looking particularly for #inst,
cycles/bl and cycles/block.
I see.
You can derive the last two, since Eric provided throughput and
frequency.
e.g. for a 1.6 GHz CPU i.e. 1.6e9 cycle/s
138,084,000 byte/s = 11.6 cycle/byte = 741.6 cycle/block
(SHA-1's blocks are 512 bits, i.e. 64 bytes.)
However, there is no way to deduce the number of instructions,
unless you make an educated guess for the IPC.
So to fill in the missing parts for sha1,
cygwin(win32) ASM | Athlon 1.6ghz | 32bit | 172,897,000
linux ASM | Hammer 1.2ghz | 32bit | 162,574,000
linux ASM | Hammer 1.2ghz | 64bit | 177,902,000
linux ASM | Pentium4 2ghz | 32bit | 172,516,000
CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
Hammer 1.2ghz 64bit 177,902,000 1265 431 6.74 2.93
Hammer 1.2ghz 32bit 162,574,000 1353 472 7.38 2.86
Athlon 1.6ghz 32bit 172,897,000 1353 592 9.25 2.28
Pentium4 2.0ghz 32bit 172,516,000 1353 741 11.59 1.82
The code sequence for Hammer-32, Athlon and Pentium4 is the same in each
case, so Hammer is quite a bit better at re-ordering things than the
Athlon. I have code sequences for 586, 686, 786, k7 and k8_32, and the
best on all of these chips is the 'k7' version (my k8_32 was a dead end :-(.
The hammer does a very good job at untangling the x86 instruction stream
on things like the shaXXX algorithms.
eric
Yuch, you're saying that even with 64 bit operations, SHA256 is almost
3x slower on Hammer than SHA-1 is? I'd always heard that SHA256's
slowness on earlier x86 was an artifact of trying to do 64 bit
operations on 32 bit processors. But apparently a 64 bit processor
doesn't help much.
You are mixing sha256, which is all 32bit, with sha512, which all 64bit.
For sha512, with ASM, I get
CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
Hammer 1.2ghz 64bit 116,080,000 3699 1323 10.34 2.79
I have not even contemplated trying to do a 32bit x86 version of sha1
:-). sha512 is definitely faster than sha256 on 64bit machines.
eric
If only I could proof read.....
eric
> I looked at it before, and I think unrolling can remove unnecessary copies
> in the second loop. (You did not include those in your lower bound, so this
> should not affect the count.)
Hmmm. I didn't count any loop overhead, my computation implies full
unrolling (no loops at all). Would you mind expanding your comment
about the second loop?
> Also, perhaps full unrolling of the algorithm could eliminate some of the
> computations? You only take h at the end, right?
At the end of one iteration, a through h are added back to the
previous hash vector. Perhaps I can shave a store or two...
[ My true email address is my nickname at free.fr ]