209 views

Skip to first unread message

Apr 1, 2009, 6:28:29 AM4/1/09

to crypto-optimization

Hi,

I think using 8 mmx registers with addition of 128-bit registers we

could find a bitslice pattern (using Dag Arne Osvik's method) for the

s-boxes that is much shorter than the current implementations.

Brian Gladman has use mmx to process 2 parallel blocks, however, it

hasn't used new s-boxes which I think is where there is room for

further optimisation.

Does any one know of such undergone/undergoing work?

Any suggestion on how to write the search pattern?

With best regards,

I think using 8 mmx registers with addition of 128-bit registers we

could find a bitslice pattern (using Dag Arne Osvik's method) for the

s-boxes that is much shorter than the current implementations.

Brian Gladman has use mmx to process 2 parallel blocks, however, it

hasn't used new s-boxes which I think is where there is room for

further optimisation.

Does any one know of such undergone/undergoing work?

Any suggestion on how to write the search pattern?

With best regards,

Apr 1, 2009, 9:14:08 AM4/1/09

to crypto-op...@googlegroups.com

On Wed, Apr 01, 2009 at 03:28:29AM -0700, Kasra Nassiri wrote:

> Does any one know of such undergone/undergoing work?

> Any suggestion on how to write the search pattern?

> Does any one know of such undergone/undergoing work?

> Any suggestion on how to write the search pattern?

If you have a m->n box, you may try to explore exhaustively small

circuits. For each output bit (j=1..n), you compute a 2^m-bit word t_j

which contains the value of output bit o_j for all of the 2^m possible

combinations of the m input bits. The words t_j are your targets.

Now, for each input bit (i=1..m), you compute words 2^m-bit words s_i

such that bit k of s_i is equal to the value of input bit number i

in the combination number k. The words s_i are the sources.

An example: you have a 3->2 box, defined as such:

000 -> 01

001 -> 10

010 -> 01

011 -> 11

100 -> 01

101 -> 00

110 -> 11

111 -> 00

There are eight combinations of the three input bits, shown above.

Combination 0 is "000". Combination 1 is "001". And so on. So the

three source words are:

s_1 = 00001111

s_2 = 00110011

s_3 = 01010101

(these are the columns in the patterns above).

The two target words are:

t_1 = 01010010

t_2 = 10111010

Then you enumerate circuits by adding boolean gates, in a big backtrack.

At each point, you have a number of words which are the values you may

get from your current circuit. Initially, you have only the source

words. You are looking for a circuit such that the set of values

includes the target words. The set of values represents all values you

may find on any wire in the circuit (i.e. the box inputs, and the output

of each gate).

When you have a circuit and its set of values, you add a gate by

using one or two of the values as inputs, and the gate evaluation

produces an extra value, which you add to the set. For instance,

from the initial situation, if I add a XOR between input 1 and 2,

I get a four-words set:

00001111

00110011

01010101

00111100

Then I can add an AND between values 2 and 4:

00001111

00110011

01010101

00111100

00110000

We can thus enumerate all small circuits, and for each of them produce

the set of values which can be extracted from the circuit. Ideally, we

would enumerate circuits in order of speed, where "speed" means how fast

the circuit is evaluated as CPU opcodes (in bitslicea mode), so it

depends on CPU characteristics such as latency and degree of internal

parallelism. An approximation would be to enumerate circuits by number

of gates, which is easier to program; the hope here is that a small

circuit is fast.

This leads to a backtrack search:

look_for_circuit(values, u)

if u == 0 then return

for all possible gates over values:

add gate g over two values -> yields extra value z

if "values+z" contain all of t_j then report match

look_for_circuit(values+z, u-1)

"values+z" here means "the set of values, plus the extra value z". The

"u" parameter is the number of gates that you allow yourself to add

to the current circuit.

The simple backtrack above examinates some circuits several times (e.g.

it looks at "OR(1,2)->4, XOR(1,3)->5" and at "XOR(1,3)->4, OR(1,2)->5"

which are really the same circuit) but this can be solved by using a

kind of lexicographic order over the gates, skipping those which have

already been tried. There is also the "NOT" gate which takes a single

input, which makes the code a bit more complex (you can also handle it

by an extra 111...11 word, the XOR computing the NOT). What kind of

gates you add depend on what your target architecture is (if you want to

use SSE2 registers, you use gates which are translated to a single SSE2

opcode).

Be warned that the number of small circuits increases _a lot_ with the

number of gates so you will not go very far. For instance, this

exhaustive search cannot compute a complete DES S-box, or not even a

6->1 "quarter of a DES S-box". Yet it may be worth a try.

--Thomas Pornin

Apr 2, 2009, 8:06:27 PM4/2/09

to crypto-op...@googlegroups.com

Dag Arne Osvik hasn't published his code for optimizing the Serpent s-box

calculations, but he did publish the resulting instruction sequences, at

http://www.osvik.no/serpent/. Those instruction sequences are probably close

to being optimal for SSE2 as well, since the only operation available in

SSE2 that his code doesn't use is AND NOT. In

http://www.osvik.no/pub/aes3.pdf, he also says that using more than 5

registers doesn't help.

calculations, but he did publish the resulting instruction sequences, at

http://www.osvik.no/serpent/. Those instruction sequences are probably close

to being optimal for SSE2 as well, since the only operation available in

SSE2 that his code doesn't use is AND NOT. In

http://www.osvik.no/pub/aes3.pdf, he also says that using more than 5

registers doesn't help.

Apr 3, 2009, 2:36:31 AM4/3/09

to crypto-optimization

On Apr 3, 1:06 am, "Wei Dai" <wei...@weidai.com> wrote:

> Dag Arne Osvik hasn't published his code for optimizing the Serpent s-box

> to being optimal for SSE2 as well, since the only operation available in

> SSE2 that his code doesn't use is AND NOT. Inhttp://www.osvik.no/pub/aes3.pdf, he also says that using more than 5
> registers doesn't help.

But I did publish my Serpent S-Box optimisation search code here:

http://gladman.me.uk/cryptography_technology/serpent/anal1.cpp

This code is only sparsely documented but Sam Simpson and I published

S-Box functions found by using it.

Brian Gladman

Apr 4, 2009, 12:13:32 PM4/4/09

to crypto-op...@googlegroups.com

Now that I think about it maybe I could use your code to generate

Serpent s-boxes using 128-bit bitslice.

Serpent s-boxes using 128-bit bitslice.

Kasra Nassiri

Apr 6, 2009, 6:58:42 PM4/6/09

to crypto-op...@googlegroups.com

It seems that in an SSE2 implementation of Serpent, most of the time will be

spent in the linear transformation, rather than the S-boxes. Here's an

estimate of the minimum instruction count. The linear transformation is

expressed as 6 rotates, 2 shifts, and 8 xors. Each rotate needs 4 SSE2

instructions (1 mov, 2 shifts, 1 xor) to implement, and each shift in the LT

also needs a mov. So the LT needs 6*4+2*2+8 = 36 instructions. Assuming that

the S-boxes average 15 instructions each and 4 instructions per round for

key mixing, we have an instruction count of 31*36 + 32*15 + 33*4 = 1728.

spent in the linear transformation, rather than the S-boxes. Here's an

estimate of the minimum instruction count. The linear transformation is

expressed as 6 rotates, 2 shifts, and 8 xors. Each rotate needs 4 SSE2

instructions (1 mov, 2 shifts, 1 xor) to implement, and each shift in the LT

also needs a mov. So the LT needs 6*4+2*2+8 = 36 instructions. Assuming that

the S-boxes average 15 instructions each and 4 instructions per round for

key mixing, we have an instruction count of 31*36 + 32*15 + 33*4 = 1728.

If we compute 12 Serpent blocks at once (using 3 sets of 5 128-bit

registers), we should be able to achieve the maximum 3 IPC of current x86

CPUs, taking 1728 cycles for 12 blocks. Serpent then has a cost of at least

(and probably not much more than) 1728/(12*16) = 9 cpb.

Can anyone see an improvement to this bound?

Apr 9, 2009, 8:13:45 PM4/9/09

to crypto-op...@googlegroups.com

It occurs to me that if we do a bitsliced implementation of Serpent in SSE2

(computing 128 blocks in parallel), the rotates and shifts in the linear

transform can be done for free by variable-renaming. So instead of 1728

instructions for 4 blocks, it would take an average of only 31*7.7 + 32*15 +

33*4 = 850 instructions per 4 blocks, or around 4.43 cpb at an IPC of 3.

This doesn't account for loads and stores caused by register spills (which

will require some optimization to minimize), but it seems that Serpent may

finally be able to overtake Rijndael in performance, at least for long

messages in counter mode.

(computing 128 blocks in parallel), the rotates and shifts in the linear

transform can be done for free by variable-renaming. So instead of 1728

instructions for 4 blocks, it would take an average of only 31*7.7 + 32*15 +

33*4 = 850 instructions per 4 blocks, or around 4.43 cpb at an IPC of 3.

This doesn't account for loads and stores caused by register spills (which

will require some optimization to minimize), but it seems that Serpent may

finally be able to overtake Rijndael in performance, at least for long

messages in counter mode.

In fact it seems likely that Serpent is now the fastest AES finalist for

this kind of usage, since none of the other 3 can take good advantage of

SSE2 vector instructions.

Apr 9, 2009, 8:59:02 PM4/9/09

to crypto-op...@googlegroups.com

interesting, this is some news! I always liked Serpent more than

Rijndael (an opinion which I have expressed to Ross Anderson many

times).

Rijndael (an opinion which I have expressed to Ross Anderson many

times).

About implementation of this method, are you going to include it in

Crypto++ any time soon, or should I go about implementing it?

Kasra Nassiri

Apr 9, 2009, 9:51:19 PM4/9/09

to crypto-op...@googlegroups.com

You should do it if you're interested. Crypto++ already has Salsa20 and a

couple of other fast stream ciphers, so I don't have too much motivation to

implement bitsliced Serpent myself. I may incorporate it into Crypto++ later

if you decide to make your code public domain.

--------------------------------------------------

From: "Kasra Nassir" <kasra....@googlemail.com>

Sent: Thursday, April 09, 2009 5:59 PM

To: <crypto-op...@googlegroups.com>

Subject: Re: Serpent optimisation

couple of other fast stream ciphers, so I don't have too much motivation to

implement bitsliced Serpent myself. I may incorporate it into Crypto++ later

if you decide to make your code public domain.

--------------------------------------------------

From: "Kasra Nassir" <kasra....@googlemail.com>

Sent: Thursday, April 09, 2009 5:59 PM

To: <crypto-op...@googlegroups.com>

Subject: Re: Serpent optimisation

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu