circuit minization

Captain Obvious

unread,

Jan 15, 2012, 7:54:19 AM1/15/12

to

hi

I'm experimenting with calculating a cryptographic hash function (SHA-256)
in an unusual way: instead of computing this function in terms of 32-bit
words, the way it is defined and is usually implemented, I represent a whole
function as a logic circuit and compute it this way.

It is wasteful because 32-bit addition (and SHA-256 has a lot of it)
requires lot of logic gates, but on the other hand a lot of optimization
become possible, e.g. computing function for 32 different inputs at once. So
it might actually be faster.

I've got to a point where it appears viable, but I lack a knowledge to do
further optimizations like circuit minimization.

I know there are general algorithms, but optimizing whole thing at once seem
untractable.

I'd rather try smaller parts, but I totally lack knowledge and intuition on
what is optimizable and what isn't. So I'm looking for help here.

As I said, SHA-256 needs lots of 32-bit addition operations, e.g. here are
four of them in a row:

t1 := h + s1 + ch + k[i] + w[i]

I implement it as a ripple carry adder:

http://en.wikipedia.org/wiki/Adder_(electronics)#Ripple_carry_adder

as it has minimum number of operations (as far as I understand), and I don't
care about delay as I do it in software.
But is it still optimal for doing many additions at once?

Next, as I understand adders which have lower delay require more logic
gates, but how much more, what is the tradeoff?
For some optimizations I might want to have independent computation pieces
which do not interact with each other, I wonder how many additional
operations it would require.

Then, adder is usually implemented with operations XOR, AND, OR. But to
enable certain optimizations, I've converted my circuit to a form which has
only AND and OR and double number of input (i.e. X1 and NOT(X1) are
different inputs). I did this conversion mostly mechanically, but maybe
there is some clever special form of adder which requires only ANDs and ORs?

Also, I did conversion mentioned above to implement certain high-level
optimizations, but I can use XOR and NOT operations in software. Is there a
way to find places where additional operations will 'speed up' things?

(I've posted to this newsgroup because I've found discussion of circuit
minization in archive. Maybe there better places to discuss this, I
dunno...)

Patricia Shanahan

unread,

Jan 15, 2012, 2:30:21 PM1/15/12

to

On 1/15/2012 4:54 AM, Captain Obvious wrote:
> hi
>
> I'm experimenting with calculating a cryptographic hash function
> (SHA-256) in an unusual way: instead of computing this function in
> terms of 32-bit words, the way it is defined and is usually
> implemented, I represent a whole function as a logic circuit and
> compute it this way.
>
> It is wasteful because 32-bit addition (and SHA-256 has a lot of it)
> requires lot of logic gates, but on the other hand a lot of
> optimization become possible, e.g. computing function for 32
> different inputs at once. So it might actually be faster.

...

Consider a simpler, but related, problem: adding 32 pairs of 32 bit numbers.

Doing it using the processor's integer add, operating on one pair of
operands at a time, takes 32 adds.

A full adder has 5 bit-wise operations, so doing it in software
operating on all 32 pairs of operands at once, takes almost 160 bit-wise
operations. There is a small saving if the least significant bit
carry-in is always zero and you don't need the most significant bit
carry-out.

If your version has a different number of operations, multiply the
actual number of operations per full adder by 32 to get the number for
doing the 32 adds in parallel, then subtract any operations you can skip
on the first and last iterations.

Now you need to look at the performance data for the processor you are
using. How long does it take to do the ripple adder? How long to do 32
consecutive adds? My guess is that you are going to find that the ripple
adder takes longer even under the best of circumstances.

I suggest working on this simplified problem first. If you cannot make
your software adder do 32 adds in comparable time to the time to do the
32 adds consecutively, you are going to have trouble making up the time
anywhere else.

> For some optimizations I might want to have independent computation
> pieces which do not interact with each other, I wonder how many
> additional operations it would require.

You could get this sort of benefit much more simply by loop unrolling in
a conventional implementation.

Patricia

Captain Obvious

unread,

Jan 16, 2012, 6:42:10 AM1/16/12

to

> I suggest working on this simplified problem first. If you cannot make
> your software adder do 32 adds in comparable time to the time to do the
> 32 adds consecutively, you are going to have trouble making up the time
> anywhere else.

Spot-on analysis, I was at that point about a month ago (except that I
used 6-gate adder). But I still made a program to calculate a whole
circuit and got suprising results.

SHA-256 has
7*64 + 48*3 + 8 = 600 add operations
6*64 + 4*48 = 576 bitwise rotations
2*48 = 96 bitwise shifts
12*64 + 4*48 = 960 bitwise operations
(number might be slightly different for different implementation)

Together that would be 2232 operations, only 600 of which are additions.
But additions make the meat of cryptographic diffusion and require many
logic gates to implement.

On a CPU which requires one cycle per each operation it takes 2232*32 =
71424 cycles to compute 32 SHA-256 outputs.

Logic gate implementation doesn't need to implement rotations and shifts
as they just reorder inputs.
So what's left is 600 * 160 + 960 * 32 = 126720.
(It can be further reduced a bit through constant propagation of zeroes
from shifts.)

So we have 71424 against 126720. That's worse, but not five times worse,
not even twice worse.

Furthermore, some CPU architectures do not implement rightrotate
natively and have to implement it though three operations (two shifts
and OR). They would require 3384*32 = 108288 operations, which is pretty
close.

So now if I could shave something like 20% through circuit minimization
I could compete with traditional implementation at least on some CPUs,
in theory. (Logic circuit, however, requires many temporary variables
which are reused so it either needs a CPU with lots of registers or some
further optimizations.)

Note that I'm interested in something like a preimage attack rather than
computing SHA-256 for arbitrary inputs (actually I'm trying to optimize
bitcoin 'mining' which uses an analog of preimage attack for proof-of-work).

In this case we can assume 736 of 768 bit variables fixed and only
32-bit 'nonce' value changing. (Of course, it is possible to leave even
less variables changing if that helps.) So it's possible to do a
constant propagation optimization just once for 2^32 computations.

Constant propagation is applicable to a traditional implementation as
well, but to a different extent.

For example, XOR of two variables are not different from XOR of variable
and constant when we speak about 32-bit words. But in logic circuit
implementation XOR with a constant corresponds to negating some inputs
while leaving others intact, so it is essentially optimized away.
Furthermore, it's possible to eliminate some of those negations through
de Morgan's law or double negation elimination.

Same thing ADD, when one addend is a constant it needs less than 160 bit
operations.

Maybe having rightrotates eliminated helps with constant propagation
too, although I'm not sure about this.

Other optimizations are possible as well. For example, precomputation:
you can notice that some nodes near 'fringe' depend only on few input
variables. For a node which depends on 8 bit variables only 256
different values are possible, thus it is possible to do that
computation in advance and store it in a table instead of doing it in
each of 2^32 SHA-256 computations.

This approach can be generalized with application of binary decision
diagrams -- they make use both of limited number of dependencies and
symmetries within the truth table.

But scope of these optimizations is rather different, so I'm exploring
other options.

Say, representing circuit with only ANDs and ORs has a nice property
that you can get 1 only if one of inputs is 1. So there is a possibility
to skip through certain circuits prematurely.

>> For some optimizations I might want to have independent computation
>> pieces which do not interact with each other, I wonder how many
>> additional operations it would require.

> You could get this sort of benefit much more simply by loop unrolling in
> a conventional implementation.

Well, I'm interested in other sort of benefit :)

1. Dependency elimination would mean that less loads/stores are required
which improves performance on architecture with limited number of
registers and expensive loads/stores. I want to know a trade-off between
doing redundant computations and eliminating loads/stores.

2. There are other ways to go through all possible input combinations,
such as operating on parts of truth table represented in one way or
another. I've found that when computation is represented in disjunctive
normal form it's possible to do very few operations per one bit of a
truth table because some paths are eliminated early. But then there is
problem to break whole computation into a set of DNFs in an optimal way.

E.g. for DNF: ab + ac + ad + bc + bd we can know that result is zero of
both a and b are zero, not even considering values of c and d.

Now suppose that we represent a truth table for each variable as a list
of numbers of rows where they equal 1.

If we know that a is first set on row 7 we can skip considering all
conjunctions involving a (ab+ac+ad) up to bit 7.

Captain Obvious

unread,

Jan 16, 2012, 6:54:41 AM1/16/12

to

> Patricia

Thanks for your reply, by the way -- it have motivated me to go through
all numbers one more time as I've got suspicious there is a bug
somewhere, and now I have more certainty about what I'm dealing with.

Patricia Shanahan

unread,

Jan 16, 2012, 3:54:26 PM1/16/12

to

On 1/16/2012 3:42 AM, Captain Obvious wrote:
...

> On a CPU which requires one cycle per each operation it takes 2232*32 =
> 71424 cycles to compute 32 SHA-256 outputs.

Is one cycle per operation a reasonable assumption for whatever you
intend to run this on?

I think if you look a bit more closely at your processor you will find
it can do different numbers of different types of operations on each
cycle. It may affect the outcome either way. For example, if the
processor only has one shifter, it can only do one shift or rotate per
cycle.

Patricia

Captain Obvious

unread,

Jan 17, 2012, 4:35:39 AM1/17/12

to

>> On a CPU which requires one cycle per each operation it takes 2232*32 =
>> 71424 cycles to compute 32 SHA-256 outputs.

> Is one cycle per operation a reasonable assumption for whatever you
> intend to run this on?

Yes, GPUs generally work this way. They might or might not have native
1-cycle ROTR and MUX.
AMD's GPU has both, while NVIDIA's doesn't.

> I think if you look a bit more closely at your processor you will find
> it can do different numbers of different types of operations on each
> cycle. It may affect the outcome either way. For example, if the
> processor only has one shifter, it can only do one shift or rotate per
> cycle.

It's fairly hard to analyze performance on a superscalar CPU, especially
without a concrete implementation. I hope to get something which can at
least in theory be fast and then do a test of a concrete program.

Patricia Shanahan

unread,

Jan 17, 2012, 9:47:21 AM1/17/12

to

On 1/17/2012 1:35 AM, Captain Obvious wrote:
> >> On a CPU which requires one cycle per each operation it takes 2232*32 =
> >> 71424 cycles to compute 32 SHA-256 outputs.
>
> > Is one cycle per operation a reasonable assumption for whatever you
> > intend to run this on?
>
> Yes, GPUs generally work this way. They might or might not have native
> 1-cycle ROTR and MUX.
> AMD's GPU has both, while NVIDIA's doesn't.

I'm glad to see you are looking at GPUs. They are the best solution for
fast data parallel work these days.

>
> > I think if you look a bit more closely at your processor you will find
> > it can do different numbers of different types of operations on each
> > cycle. It may affect the outcome either way. For example, if the
> > processor only has one shifter, it can only do one shift or rotate per
> > cycle.
>
> It's fairly hard to analyze performance on a superscalar CPU, especially
> without a concrete implementation. I hope to get something which can at
> least in theory be fast and then do a test of a concrete program.

"fairly hard" is a serious understatement. I spent many years analyzing
performance of multiprocessor systems using superscalar CPUs - before
the system had been built.

Patricia