If you have a m->n box, you may try to explore exhaustively small
circuits. For each output bit (j=1..n), you compute a 2^m-bit word t_j
which contains the value of output bit o_j for all of the 2^m possible
combinations of the m input bits. The words t_j are your targets.
Now, for each input bit (i=1..m), you compute words 2^m-bit words s_i
such that bit k of s_i is equal to the value of input bit number i
in the combination number k. The words s_i are the sources.
An example: you have a 3->2 box, defined as such:
000 -> 01
001 -> 10
010 -> 01
011 -> 11
100 -> 01
101 -> 00
110 -> 11
111 -> 00
There are eight combinations of the three input bits, shown above.
Combination 0 is "000". Combination 1 is "001". And so on. So the
three source words are:
s_1 = 00001111
s_2 = 00110011
s_3 = 01010101
(these are the columns in the patterns above).
The two target words are:
t_1 = 01010010
t_2 = 10111010
Then you enumerate circuits by adding boolean gates, in a big backtrack.
At each point, you have a number of words which are the values you may
get from your current circuit. Initially, you have only the source
words. You are looking for a circuit such that the set of values
includes the target words. The set of values represents all values you
may find on any wire in the circuit (i.e. the box inputs, and the output
of each gate).
When you have a circuit and its set of values, you add a gate by
using one or two of the values as inputs, and the gate evaluation
produces an extra value, which you add to the set. For instance,
from the initial situation, if I add a XOR between input 1 and 2,
I get a four-words set:
Then I can add an AND between values 2 and 4:
We can thus enumerate all small circuits, and for each of them produce
the set of values which can be extracted from the circuit. Ideally, we
would enumerate circuits in order of speed, where "speed" means how fast
the circuit is evaluated as CPU opcodes (in bitslicea mode), so it
depends on CPU characteristics such as latency and degree of internal
parallelism. An approximation would be to enumerate circuits by number
of gates, which is easier to program; the hope here is that a small
circuit is fast.
This leads to a backtrack search:
if u == 0 then return
for all possible gates over values:
add gate g over two values -> yields extra value z
if "values+z" contain all of t_j then report match
"values+z" here means "the set of values, plus the extra value z". The
"u" parameter is the number of gates that you allow yourself to add
to the current circuit.
The simple backtrack above examinates some circuits several times (e.g.
it looks at "OR(1,2)->4, XOR(1,3)->5" and at "XOR(1,3)->4, OR(1,2)->5"
which are really the same circuit) but this can be solved by using a
kind of lexicographic order over the gates, skipping those which have
already been tried. There is also the "NOT" gate which takes a single
input, which makes the code a bit more complex (you can also handle it
by an extra 111...11 word, the XOR computing the NOT). What kind of
gates you add depend on what your target architecture is (if you want to
use SSE2 registers, you use gates which are translated to a single SSE2
Be warned that the number of small circuits increases _a lot_ with the
number of gates so you will not go very far. For instance, this
exhaustive search cannot compute a complete DES S-box, or not even a
6->1 "quarter of a DES S-box". Yet it may be worth a try.
If we compute 12 Serpent blocks at once (using 3 sets of 5 128-bit
registers), we should be able to achieve the maximum 3 IPC of current x86
CPUs, taking 1728 cycles for 12 blocks. Serpent then has a cost of at least
(and probably not much more than) 1728/(12*16) = 9 cpb.
Can anyone see an improvement to this bound?
In fact it seems likely that Serpent is now the fastest AES finalist for
this kind of usage, since none of the other 3 can take good advantage of
SSE2 vector instructions.
About implementation of this method, are you going to include it in
Crypto++ any time soon, or should I go about implementing it?