Dan Bernstein wrote:
> There was some discussion of SHA-256 performance on sci.crypt in 2003.
> Eric Young said he had code using 2891 instructions/block and running at
> 15.53 cycles/byte on an Opteron. The consensus was that 2824 arithmetic
> instructions were necessary; that means at least 14.7 cycles/byte on any
> current CPU.
Thanks for pointing to that discussion. After reading it, I thought of a
possible optimization that improves the 2824 instructions bound by 2*63 to
2698, and the theoretical speed limit to 14.1 cpb. From the newsgroup post
(http://groups.google.com/group/sci.crypt/browse_thread/thread/84bfde56e8ac6047/):
> Maj(x,y,z)
>
> We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
>
> Maj() then requires 3 XOR and 1 AND. We also need to make
> a copy of either x or y and a copy of either x or z.
We can instead write Maj(x,y,z) = y ^ ((x ^ y) & (y ^ z)). Then after
computing x^y, cache it until the next round, where it can be reused as y^z.
This reduces the number of XOR by 1, and the number of copies by 1, for
every round except the first.
> On the other hand, I haven't seen a serious analysis of the
> possibility of combining these instructions into vector instructions.
I found this: http://islab.oregonstate.edu/papers/04Aciicmez.html. The
abstract says:
> The SIMD (single-instruction, multiple-data) architecture is implemented
> in
> many popular general-purpose processor families, including Intel Pentium.
> In
> this paper, we examine if any performance gains can be obtained in the
> implementation of the Secure Hash Algorithms (SHA-1, SHA-256, SHA-384, and
> SHA-512) and present the details of our optimization methods and
> implementation details. It turns out that while the hashing of a single
> message is slower, the hashing of 2 or 4 independent message streams can
> be
> made significantly faster.
Unfortunately the full text of this thesis is not posted online.