> I suggest working on this simplified problem first. If you cannot make
> your software adder do 32 adds in comparable time to the time to do the
> 32 adds consecutively, you are going to have trouble making up the time
> anywhere else.
Spot-on analysis, I was at that point about a month ago (except that I
used 6-gate adder). But I still made a program to calculate a whole
circuit and got suprising results.
SHA-256 has
7*64 + 48*3 + 8 = 600 add operations
6*64 + 4*48 = 576 bitwise rotations
2*48 = 96 bitwise shifts
12*64 + 4*48 = 960 bitwise operations
(number might be slightly different for different implementation)
Together that would be 2232 operations, only 600 of which are additions.
But additions make the meat of cryptographic diffusion and require many
logic gates to implement.
On a CPU which requires one cycle per each operation it takes 2232*32 =
71424 cycles to compute 32 SHA-256 outputs.
Logic gate implementation doesn't need to implement rotations and shifts
as they just reorder inputs.
So what's left is 600 * 160 + 960 * 32 = 126720.
(It can be further reduced a bit through constant propagation of zeroes
from shifts.)
So we have 71424 against 126720. That's worse, but not five times worse,
not even twice worse.
Furthermore, some CPU architectures do not implement rightrotate
natively and have to implement it though three operations (two shifts
and OR). They would require 3384*32 = 108288 operations, which is pretty
close.
So now if I could shave something like 20% through circuit minimization
I could compete with traditional implementation at least on some CPUs,
in theory. (Logic circuit, however, requires many temporary variables
which are reused so it either needs a CPU with lots of registers or some
further optimizations.)
Note that I'm interested in something like a preimage attack rather than
computing SHA-256 for arbitrary inputs (actually I'm trying to optimize
bitcoin 'mining' which uses an analog of preimage attack for proof-of-work).
In this case we can assume 736 of 768 bit variables fixed and only
32-bit 'nonce' value changing. (Of course, it is possible to leave even
less variables changing if that helps.) So it's possible to do a
constant propagation optimization just once for 2^32 computations.
Constant propagation is applicable to a traditional implementation as
well, but to a different extent.
For example, XOR of two variables are not different from XOR of variable
and constant when we speak about 32-bit words. But in logic circuit
implementation XOR with a constant corresponds to negating some inputs
while leaving others intact, so it is essentially optimized away.
Furthermore, it's possible to eliminate some of those negations through
de Morgan's law or double negation elimination.
Same thing ADD, when one addend is a constant it needs less than 160 bit
operations.
Maybe having rightrotates eliminated helps with constant propagation
too, although I'm not sure about this.
Other optimizations are possible as well. For example, precomputation:
you can notice that some nodes near 'fringe' depend only on few input
variables. For a node which depends on 8 bit variables only 256
different values are possible, thus it is possible to do that
computation in advance and store it in a table instead of doing it in
each of 2^32 SHA-256 computations.
This approach can be generalized with application of binary decision
diagrams -- they make use both of limited number of dependencies and
symmetries within the truth table.
But scope of these optimizations is rather different, so I'm exploring
other options.
Say, representing circuit with only ANDs and ORs has a nice property
that you can get 1 only if one of inputs is 1. So there is a possibility
to skip through certain circuits prematurely.
>> For some optimizations I might want to have independent computation
>> pieces which do not interact with each other, I wonder how many
>> additional operations it would require.
> You could get this sort of benefit much more simply by loop unrolling in
> a conventional implementation.
Well, I'm interested in other sort of benefit :)
1. Dependency elimination would mean that less loads/stores are required
which improves performance on architecture with limited number of
registers and expensive loads/stores. I want to know a trade-off between
doing redundant computations and eliminating loads/stores.
2. There are other ways to go through all possible input combinations,
such as operating on parts of truth table represented in one way or
another. I've found that when computation is represented in disjunctive
normal form it's possible to do very few operations per one bit of a
truth table because some paths are eliminated early. But then there is
problem to break whole computation into a set of DNFs in an optimal way.
E.g. for DNF: ab + ac + ad + bc + bd we can know that result is zero of
both a and b are zero, not even considering values of c and d.
Now suppose that we represent a truth table for each variable as a list
of numbers of rows where they equal 1.
If we know that a is first set on row 7 we can skip considering all
conjunctions involving a (ab+ac+ad) up to bit 7.