# Serpent optimisation

227 views

### Kasra Nassiri

Apr 1, 2009, 6:28:29 AM4/1/09
to crypto-optimization
Hi,

I think using 8 mmx registers with addition of 128-bit registers we
could find a bitslice pattern (using Dag Arne Osvik's method) for the
s-boxes that is much shorter than the current implementations.

Brian Gladman has use mmx to process 2 parallel blocks, however, it
hasn't used new s-boxes which I think is where there is room for
further optimisation.

Does any one know of such undergone/undergoing work?
Any suggestion on how to write the search pattern?

With best regards,

### Thomas Pornin

Apr 1, 2009, 9:14:08 AM4/1/09
On Wed, Apr 01, 2009 at 03:28:29AM -0700, Kasra Nassiri wrote:
> Does any one know of such undergone/undergoing work?
> Any suggestion on how to write the search pattern?

If you have a m->n box, you may try to explore exhaustively small
circuits. For each output bit (j=1..n), you compute a 2^m-bit word t_j
which contains the value of output bit o_j for all of the 2^m possible
combinations of the m input bits. The words t_j are your targets.
Now, for each input bit (i=1..m), you compute words 2^m-bit words s_i
such that bit k of s_i is equal to the value of input bit number i
in the combination number k. The words s_i are the sources.

An example: you have a 3->2 box, defined as such:

000 -> 01
001 -> 10
010 -> 01
011 -> 11
100 -> 01
101 -> 00
110 -> 11
111 -> 00

There are eight combinations of the three input bits, shown above.
Combination 0 is "000". Combination 1 is "001". And so on. So the
three source words are:
s_1 = 00001111
s_2 = 00110011
s_3 = 01010101
(these are the columns in the patterns above).

The two target words are:
t_1 = 01010010
t_2 = 10111010

Then you enumerate circuits by adding boolean gates, in a big backtrack.
At each point, you have a number of words which are the values you may
get from your current circuit. Initially, you have only the source
words. You are looking for a circuit such that the set of values
includes the target words. The set of values represents all values you
may find on any wire in the circuit (i.e. the box inputs, and the output
of each gate).

When you have a circuit and its set of values, you add a gate by
using one or two of the values as inputs, and the gate evaluation
produces an extra value, which you add to the set. For instance,
from the initial situation, if I add a XOR between input 1 and 2,
I get a four-words set:

00001111
00110011
01010101
00111100

Then I can add an AND between values 2 and 4:

00001111
00110011
01010101
00111100
00110000

We can thus enumerate all small circuits, and for each of them produce
the set of values which can be extracted from the circuit. Ideally, we
would enumerate circuits in order of speed, where "speed" means how fast
the circuit is evaluated as CPU opcodes (in bitslicea mode), so it
depends on CPU characteristics such as latency and degree of internal
parallelism. An approximation would be to enumerate circuits by number
of gates, which is easier to program; the hope here is that a small
circuit is fast.

This leads to a backtrack search:

look_for_circuit(values, u)
if u == 0 then return
for all possible gates over values:
add gate g over two values -> yields extra value z
if "values+z" contain all of t_j then report match
look_for_circuit(values+z, u-1)

"values+z" here means "the set of values, plus the extra value z". The
"u" parameter is the number of gates that you allow yourself to add
to the current circuit.

The simple backtrack above examinates some circuits several times (e.g.
it looks at "OR(1,2)->4, XOR(1,3)->5" and at "XOR(1,3)->4, OR(1,2)->5"
which are really the same circuit) but this can be solved by using a
kind of lexicographic order over the gates, skipping those which have
already been tried. There is also the "NOT" gate which takes a single
input, which makes the code a bit more complex (you can also handle it
by an extra 111...11 word, the XOR computing the NOT). What kind of
gates you add depend on what your target architecture is (if you want to
use SSE2 registers, you use gates which are translated to a single SSE2
opcode).

Be warned that the number of small circuits increases _a lot_ with the
number of gates so you will not go very far. For instance, this
exhaustive search cannot compute a complete DES S-box, or not even a
6->1 "quarter of a DES S-box". Yet it may be worth a try.

--Thomas Pornin

### Wei Dai

Apr 2, 2009, 8:06:27 PM4/2/09
Dag Arne Osvik hasn't published his code for optimizing the Serpent s-box
calculations, but he did publish the resulting instruction sequences, at
http://www.osvik.no/serpent/. Those instruction sequences are probably close
to being optimal for SSE2 as well, since the only operation available in
SSE2 that his code doesn't use is AND NOT. In
http://www.osvik.no/pub/aes3.pdf, he also says that using more than 5
registers doesn't help.

### Cactus

Apr 3, 2009, 2:36:31 AM4/3/09
to crypto-optimization

On Apr 3, 1:06 am, "Wei Dai" <wei...@weidai.com> wrote:
> Dag Arne Osvik hasn't published his code for optimizing the Serpent s-box
> calculations, but he did publish the resulting instruction sequences, athttp://www.osvik.no/serpent/. Those instruction sequences are probably close
> to being optimal for SSE2 as well, since the only operation available in
> SSE2 that his code doesn't use is AND NOT. Inhttp://www.osvik.no/pub/aes3.pdf, he also says that using more than 5
> registers doesn't help.

But I did publish my Serpent S-Box optimisation search code here:

This code is only sparsely documented but Sam Simpson and I published
S-Box functions found by using it.

### Kasra Nassir

Apr 4, 2009, 12:13:32 PM4/4/09
Now that I think about it maybe I could use your code to generate
Serpent s-boxes using 128-bit bitslice.

Kasra Nassiri

### Wei Dai

Apr 6, 2009, 6:58:42 PM4/6/09
It seems that in an SSE2 implementation of Serpent, most of the time will be
spent in the linear transformation, rather than the S-boxes. Here's an
estimate of the minimum instruction count. The linear transformation is
expressed as 6 rotates, 2 shifts, and 8 xors. Each rotate needs 4 SSE2
instructions (1 mov, 2 shifts, 1 xor) to implement, and each shift in the LT
also needs a mov. So the LT needs 6*4+2*2+8 = 36 instructions. Assuming that
the S-boxes average 15 instructions each and 4 instructions per round for
key mixing, we have an instruction count of 31*36 + 32*15 + 33*4 = 1728.

If we compute 12 Serpent blocks at once (using 3 sets of 5 128-bit
registers), we should be able to achieve the maximum 3 IPC of current x86
CPUs, taking 1728 cycles for 12 blocks. Serpent then has a cost of at least
(and probably not much more than) 1728/(12*16) = 9 cpb.

Can anyone see an improvement to this bound?

### Wei Dai

Apr 9, 2009, 8:13:45 PM4/9/09
It occurs to me that if we do a bitsliced implementation of Serpent in SSE2
(computing 128 blocks in parallel), the rotates and shifts in the linear
instructions for 4 blocks, it would take an average of only 31*7.7 + 32*15 +
33*4 = 850 instructions per 4 blocks, or around 4.43 cpb at an IPC of 3.
This doesn't account for loads and stores caused by register spills (which
will require some optimization to minimize), but it seems that Serpent may
finally be able to overtake Rijndael in performance, at least for long
messages in counter mode.

In fact it seems likely that Serpent is now the fastest AES finalist for
this kind of usage, since none of the other 3 can take good advantage of
SSE2 vector instructions.

### Kasra Nassir

Apr 9, 2009, 8:59:02 PM4/9/09
interesting, this is some news! I always liked Serpent more than
Rijndael (an opinion which I have expressed to Ross Anderson many
times).

About implementation of this method, are you going to include it in
Crypto++ any time soon, or should I go about implementing it?

Kasra Nassiri

### Wei Dai

Apr 9, 2009, 9:51:19 PM4/9/09
You should do it if you're interested. Crypto++ already has Salsa20 and a
couple of other fast stream ciphers, so I don't have too much motivation to
implement bitsliced Serpent myself. I may incorporate it into Crypto++ later
if you decide to make your code public domain.

--------------------------------------------------