SHA Maj trick

230 views
Skip to first unread message

Wei Dai

unread,
Mar 27, 2009, 10:15:43 PM3/27/09
to crypto-op...@googlegroups.com
[This is a repost that was originally written for the NIST SHA-3 forum.]

Dan Bernstein wrote:
> There was some discussion of SHA-256 performance on sci.crypt in 2003.
> Eric Young said he had code using 2891 instructions/block and running at
> 15.53 cycles/byte on an Opteron. The consensus was that 2824 arithmetic
> instructions were necessary; that means at least 14.7 cycles/byte on any
> current CPU.

Thanks for pointing to that discussion. After reading it, I thought of a
possible optimization that improves the 2824 instructions bound by 2*63 to
2698, and the theoretical speed limit to 14.1 cpb. From the newsgroup post
(http://groups.google.com/group/sci.crypt/browse_thread/thread/84bfde56e8ac6047/):

> Maj(x,y,z)
>
> We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
>
> Maj() then requires 3 XOR and 1 AND. We also need to make
> a copy of either x or y and a copy of either x or z.

We can instead write Maj(x,y,z) = y ^ ((x ^ y) & (y ^ z)). Then after
computing x^y, cache it until the next round, where it can be reused as y^z.
This reduces the number of XOR by 1, and the number of copies by 1, for
every round except the first.

> On the other hand, I haven't seen a serious analysis of the
> possibility of combining these instructions into vector instructions.

I found this: http://islab.oregonstate.edu/papers/04Aciicmez.html. The
abstract says:

> The SIMD (single-instruction, multiple-data) architecture is implemented
> in
> many popular general-purpose processor families, including Intel Pentium.
> In
> this paper, we examine if any performance gains can be obtained in the
> implementation of the Secure Hash Algorithms (SHA-1, SHA-256, SHA-384, and
> SHA-512) and present the details of our optimization methods and
> implementation details. It turns out that while the hashing of a single
> message is slower, the hashing of 2 or 4 independent message streams can
> be
> made significantly faster.

Unfortunately the full text of this thesis is not posted online.

Reply all
Reply to author
Forward
0 new messages