> So it looks like boolean function implementation has 32x advantage
> EXCEPT slow + operator. I doubt that having cheap + gives 32+x advantage.
Well, it turns out that "boolean function" approach is viable. Although
not as good as I thought.
Full SHA-256 (64 rounds + block expansion) requires 121438 operations
(DAG nodes AND, XOR, OR, NOT).
But when 480 bits are constant (32 left), only 86000 operations are left
after optimization,
This is a lot, but on 64-bit CPU I can try 64 combinations (of lower 6
bits) at once, as it can do AND, XOR, OR, NOT on 64 bits in parallel.
SSE can do 128 bits at once, or even 256 bits with AVX (?).
Now it appears that it is equivalent of 1343 operations per combination.
On the other hand, traditional SHA-256 requires:
19 and/or/xor/+ operations per round
+ 6 rightrotates per round which translate to 3 operations each
(AFAIK), thus 18 more operations
= 25*64 = 1600
+ 21 operations per block in block expansion (assuming ror32 being 3
instructions), thus (64-16)*21 = 1008 operations
So, together 2608 operations. (Or 166912 operations per 64 combinations.)
Of course, modern superscalar CPU can do more than one operation per
cycle, or it is possible to use SIMD (SSE, AVX) to do more than one
32-bit operation at once.
But same optimizations can be used in case of boolean function
implementation, except that it might require more loads and stores.
Thus:
* when variables are not fixed, we have 121k operations against 167k,
but 167k require almost no loads/stores, thus it might be faster
* fixing a number of variables reduces number of operations to 86k in
boolean function case, so it might be viable (twice less
operations), but then it requires dynamic code generation
(JIT-compilation). It isn't a rocket science, but requires some work.
* boolean function based implementation can be further optimized
through
* * pre-computation of some nodes at 'fringe', close to input block
material, ones which depend on limited number of variables. Some nodes
can be replaced with lookup tables, BDDs, disjunctive normal forms or
whatever, and thus their dependencies can be shaved off.
* * conditionally removing portions of graph which do not affect
result. E.g. if we have node AND(x1, x2) and x1 is FALSE, we do not need
to compute x2. Some nodes at wring are likely to be zero or one, so it
makes sense to compute their dependencies conditionally
* * tuning number of fixed variables
* * straight boolean function optimization, use of combined ops if
available for CPU, like NOTAND etc.
* * early rejection -- there is no need to compute everything before
we can rule out combination(s), and I think it is possible to find most
informative nodes through combination of statistics and boolean function
transformations
* I tried only pre-computation approach so far, and it appears that at
least 5k operations out of 86k can be optimized that way. Likely more
with deeper precomputation.
So, this approach is potentially viable optimization if you want to run
SHA-256 pre-image attack on CPU, although gain probably isn't huge. I
don't know whether it can be used on GPU (likely there will be some
obstacle like code size limit or something) or in hardware (FPGA).
It isn't interesting for bitcoin mining because it does
SHA256(SHA256(block_header))
and second SHA256 round cannot be optimized through pre-computations.
So even if first SHA256 is twice faster, second one will be same or
slower, so in best case it is only 25% faster. Hardly worth the effort
when people do mining on GPU.