SHA-256 performance on x86

56 views
Skip to first unread message

Shill

unread,
Oct 15, 2003, 1:44:50 PM10/15/03
to
This is my attempt to derive a lower bound for the number of
instructions required to perform one round of SHA-256 on a modern
x86 CPU.

sigma_i(x) and SIGMA_i(x)

These 4 functions require 3 ROR (or SHR) and 2 XOR.

They also require that we make 2 copies of the parameter,
since ROR and SHR will destroy it.

Total = 7 instructions

Ch(e,f,g)

We can write Ch(x,y,z) = (e & (y ^ z)) ^ z

Ch() then requires 2 XOR and 1 AND. We also need to make
a copy of either y or z.

Total = 4 instructions

Maj(x,y,z)

We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))

Maj() then requires 3 XOR and 1 AND. We also need to make
a copy of either x or y and a copy of either x or z.

Total = 6 instructions


First loop (prepare message schedule):

W[t] = sigma_1(W[t-2]) + W[t-7] + sigma_0(W[t-15]) + W[t-16]

Total = (7 + 7 + 3)*48 = 816 instructions


Second loop (is it called compress?):

T1 = SIGMA_1(e) + Ch(e,f,g) + h + W[t] + K[t]
T2 = SIGMA_0(a) + Maj(a,b,c)
d = d + T1
h = T1 + T2

Line1 = 7 + 4 + 4
Line2 = 7 + 6 + 1
Line3 = 1
Line4 = 1
Total = 31*64 = 1984 instructions

Therefore, if we use only integer code (I will investigate SIMD
later, improvements are unlikely) we need to execute *AT LEAST* 2800
instructions for one round of SHA-256.

Take an Athlon which can retire 3 instructions per cycle. Assume
ideal throughput, a round of SHA-256 requires AT LEAST 933 cycles.

933 cycles/block = 14.58 cycles/byte

This would translate to:
68,571,000 byte/s on a 1 GHz Athlon.
137,143,000 byte/s on a 2 GHz Athlon.

Any comments? Did I get the numbers right?

Cheers,

Shill


Matt Taylor

unread,
Oct 15, 2003, 9:28:04 PM10/15/03
to
"Shill" <dev...@kma.eu.org> wrote in message
news:3f8d87ec$0$27599$626a...@news.free.fr...

I looked at it before, and I think unrolling can remove unnecessary copies
in the second loop. (You did not include those in your lower bound, so this
should not affect the count.)

Also, perhaps full unrolling of the algorithm could eliminate some of the
computations? You only take h at the end, right?

-Matt


Eric Young

unread,
Oct 16, 2003, 5:21:05 AM10/16/03
to
Shill wrote:

I believe you are leaving out the endian swapping. Some-one needs to do
this and you may as well use bswap in the ASM code. There is also the
addition of the previous values at the end of the loop.

I have, in my implementation on x86_64, 2891 instructions in the inner
loop. For the x86 (I back-ported the x86_64 version), I have 3061,
mostly due to register pressure.

CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58

So while in theory things should take 933 cycles, the x86 architecture
has register issues, and instruction re-ordering issues. The
Hammer/Opteron/Athlon64 does a very good job at re-ordering x86
instructions, hence the IPC (Instructions Per Cycle) getting close to
the maximum of 3.

eric

Grumble

unread,
Oct 16, 2003, 9:51:35 AM10/16/03
to

Very true. I confess that I didn't want to deal with the details
right away ;-)

Let's add 16 BSWAP instructions to my lower bound computation.

Copying H to a-h and adding a-h back to H: assume I keep H inside 4
MMX registers (I will have to figure out how to hide the
store-to-load forwarding penalty). Each operation requires 4
instructions.

This brings my lower bound total to 2800+16+8 = 2824 instructions.
Assume optimal throughput, that's a lower bound (an optimum to shoot
for) of 941 cycles per 64-byte block.

> I have, in my implementation on x86_64, 2891 instructions in the inner
> loop. For the x86 (I back-ported the x86_64 version), I have 3061,
> mostly due to register pressure.

Wow! I am impressed... 2891 instructions, that is very close to the
lower bound I've derived above. Did you use SIMD instructions, or
just integer instructions? Did you use the full 64-bit registers or
just the lower 32-bits? In other words, did you take advantage of
the wider registers?

> CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
> Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
> Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
> Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
> Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58

It's great to actually see some numbers! Thanks.

> So while in theory things should take 933 cycles, the x86 architecture
> has register issues, and instruction re-ordering issues. The
> Hammer/Opteron/Athlon64 does a very good job at re-ordering x86
> instructions, hence the IPC (Instructions Per Cycle) getting close to
> the maximum of 3.

I wasn't saying it should take 933 cycles, I was only deriving a
lower bound :-)

BTW, I think my implementation on the Athlon is faster than yours. I
didn't put all the pieces together but I run the first loop in
approximately 350 cycles (I don't have my results with me) and the
second loop in 717 cycles. The total should come in under 1200
cycles ;-)

I'm still working on a genetic algorithm to find the optimal
instruction schedule. It's a cool little project :-)

Shill

Andrew Morgan

unread,
Oct 16, 2003, 3:00:34 PM10/16/03
to
Eric Young wrote:
> CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
> Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
> Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
> Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
> Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58

Do you have similar data for SHA-1?

Thanks
Andrew

Nudge

unread,
Oct 16, 2003, 5:08:45 PM10/16/03
to

Andrew Morgan

unread,
Oct 17, 2003, 3:26:52 AM10/17/03
to
Nudge wrote:
> He already has, in reply to a post you wrote...
> http://www.google.com/groups?selm=3f8d51ec$0$23594$5a62...@freenews.iinet.net.au
>

Sorry, let me be more specific. I am looking particularly for #inst,
cycles/bl and cycles/block.

Grumble

unread,
Oct 17, 2003, 4:52:43 AM10/17/03
to

I see.

You can derive the last two, since Eric provided throughput and
frequency.

e.g. for a 1.6 GHz CPU i.e. 1.6e9 cycle/s

138,084,000 byte/s = 11.6 cycle/byte = 741.6 cycle/block

(SHA-1's blocks are 512 bits, i.e. 64 bytes.)

However, there is no way to deduce the number of instructions,
unless you make an educated guess for the IPC.

Eric Young

unread,
Oct 17, 2003, 10:02:51 AM10/17/03
to

So to fill in the missing parts for sha1,
cygwin(win32) ASM | Athlon 1.6ghz | 32bit | 172,897,000
linux ASM | Hammer 1.2ghz | 32bit | 162,574,000
linux ASM | Hammer 1.2ghz | 64bit | 177,902,000
linux ASM | Pentium4 2ghz | 32bit | 172,516,000

CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC

Hammer 1.2ghz 64bit 177,902,000 1265 431 6.74 2.93
Hammer 1.2ghz 32bit 162,574,000 1353 472 7.38 2.86
Athlon 1.6ghz 32bit 172,897,000 1353 592 9.25 2.28
Pentium4 2.0ghz 32bit 172,516,000 1353 741 11.59 1.82

The code sequence for Hammer-32, Athlon and Pentium4 is the same in each
case, so Hammer is quite a bit better at re-ordering things than the
Athlon. I have code sequences for 586, 686, 786, k7 and k8_32, and the
best on all of these chips is the 'k7' version (my k8_32 was a dead end :-(.

The hammer does a very good job at untangling the x86 instruction stream
on things like the shaXXX algorithms.

eric

Paul Rubin

unread,
Oct 17, 2003, 11:16:55 AM10/17/03
to
Eric Young <eay_n...@pobox.com> writes:
> >>>CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
> >>>Hammer 1.2ghz 64bit 77,207,000 2891 994 15.53 2.90
> >>>Hammer 1.2ghz 32bit 71,727,000 3061 1070 16.71 2.86
> >>>Athlon 1.6ghz 32bit 78,104,000 3061 1311 20.48 2.33
> >>>Pentium4 2.0ghz 32bit 66,425,000 3061 1926 30.10 1.58
> ...

> So to fill in the missing parts for sha1,
> ...

> CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC
> Hammer 1.2ghz 64bit 177,902,000 1265 431 6.74 2.93
> Hammer 1.2ghz 32bit 162,574,000 1353 472 7.38 2.86
> Athlon 1.6ghz 32bit 172,897,000 1353 592 9.25 2.28
> Pentium4 2.0ghz 32bit 172,516,000 1353 741 11.59 1.82
> ...

> The hammer does a very good job at untangling the x86 instruction
> stream on things like the shaXXX algorithms.

Yuch, you're saying that even with 64 bit operations, SHA256 is almost
3x slower on Hammer than SHA-1 is? I'd always heard that SHA256's
slowness on earlier x86 was an artifact of trying to do 64 bit
operations on 32 bit processors. But apparently a 64 bit processor
doesn't help much.

Eric Young

unread,
Oct 17, 2003, 5:17:00 PM10/17/03
to

You are mixing sha256, which is all 32bit, with sha512, which all 64bit.
For sha512, with ASM, I get


CPU Speed mode bytes/sec #inst cycles/bl cycles/byte IPC

Hammer 1.2ghz 64bit 116,080,000 3699 1323 10.34 2.79

I have not even contemplated trying to do a 32bit x86 version of sha1
:-). sha512 is definitely faster than sha256 on 64bit machines.

eric

Eric Young

unread,
Oct 17, 2003, 6:25:30 PM10/17/03
to
Eric Young wrote:
> I have not even contemplated trying to do a 32bit x86 version of sha1
sha256

> :-). sha512 is definitely faster than sha256 on 64bit machines.

If only I could proof read.....

eric


Shill

unread,
Oct 18, 2003, 12:29:35 PM10/18/03
to Matt Taylor
Matt Taylor wrote:

> I looked at it before, and I think unrolling can remove unnecessary copies
> in the second loop. (You did not include those in your lower bound, so this
> should not affect the count.)

Hmmm. I didn't count any loop overhead, my computation implies full
unrolling (no loops at all). Would you mind expanding your comment
about the second loop?

> Also, perhaps full unrolling of the algorithm could eliminate some of the
> computations? You only take h at the end, right?

At the end of one iteration, a through h are added back to the
previous hash vector. Perhaps I can shave a store or two...

[ My true email address is my nickname at free.fr ]

Reply all
Reply to author
Forward
0 new messages