I read the paper by Dieter Schmidt on Blowfish. He said that the first 64 bits of the key are ignored for the third and fourth round.
Although this isn't an attack, is there any evidence that this discovery would make it easier to extend any of the other attacks, such as the Rijmen attack against 4 rounds, the Kelsey attack against 3 rounds, or the Vaudenay attack against weak keys?
rzel1...@yahoo.com wrote: > I read the paper by Dieter Schmidt on Blowfish. He said that the first > 64 bits of the key are ignored for the third and fourth round.
No way, and the second 64 bits are ignored in the first and second round!!!
...
You do realize that all key bits influence the sbox generation right?
Tom St Denis wrote: > rzel1...@yahoo.com wrote: > > I read the paper by Dieter Schmidt on Blowfish. He said that the > first > > 64 bits of the key are ignored for the third and fourth round.
> No way, and the second 64 bits are ignored in the first and second > round!!!
> ...
> You do realize that all key bits influence the sbox generation right?
> Tom
So then the paper was a hoax? Maybe I misquoted it. It says that the subkeys for the third and fourth round do not depend on the first 64 bits of the user key.
rzel1...@yahoo.com wrote: > Tom St Denis wrote: > > rzel1...@yahoo.com wrote: > > > I read the paper by Dieter Schmidt on Blowfish. He said that the > > first > > > 64 bits of the key are ignored for the third and fourth round.
> > No way, and the second 64 bits are ignored in the first and second > > round!!!
> > ...
> > You do realize that all key bits influence the sbox generation right?
> > Tom
> So then the paper was a hoax? Maybe I misquoted it. It says that the > subkeys for the third and fourth round do not depend on the first 64 > bits of the user key.
Five seconds with LibTomCrypt disproves the result...
#include "tomcrypt.h"
int main(void) { symmetric_key skey; unsigned char key[16];
Tom St Denis wrote: > > So then the paper was a hoax? Maybe I misquoted it. It says that the > > subkeys for the third and fourth round do not depend on the first 64 > > bits of the user key.
> Five seconds with LibTomCrypt disproves the result...
Some careful thinking later... What they're saying about P1/P2 may be true but the conclusion is not.
It is not correct that all of the round keys do not depend on the first 64-bits of the key. My demo disproves that.
It may be true that they don't depend on P1/P2 specifically but if your key is shorter than 448-bits [which I'd suggest it should be] you will repeat the internal key. So really if you use 128-256 bit keys with Blowfish this is of little to no consequence.
Better yet though use Twofish or AES instead which are faster and more modern.
> Some careful thinking later... What they're saying about P1/P2 may be > true but the conclusion is not.
> It is not correct that all of the round keys do not depend on the first > 64-bits of the key. My demo disproves that.
I used to use a Blowfish variant where the whole key schedule was done twice (once with the usual S and P tables initialized with hex digits of pi, and a second time starting with the result of the first pass). This was endorsed by Bruce and guarantees total avalanche through the entire S and P tables. I agree with you that Blowfish isn't of much interest any more and people should use AES instead (not sure about Twofish).
Lately I'm interested in Serpent, which did poorly in the AES competition because the designers used twice as many rounds as they felt they needed to provide good security, while the other entrants didn't do the same. For a personal project that needs high speed encryption, I'm thinking of using Serpent reduced to 16 rounds. Also, from Pentium II timings, it looks like using MMX speeds up Serpent considerably despite any parallelism in the P2's normal integer pipes. I think the P3/P4/Athlon won't have radically more parallelism than the P2 had.
Paul Rubin wrote: > "Tom St Denis" <tomstde...@gmail.com> writes: > > Some careful thinking later... What they're saying about P1/P2 may be > > true but the conclusion is not.
> > It is not correct that all of the round keys do not depend on the first > > 64-bits of the key. My demo disproves that.
> I used to use a Blowfish variant where the whole key schedule was done > twice (once with the usual S and P tables initialized with hex digits > of pi, and a second time starting with the result of the first pass). > This was endorsed by Bruce and guarantees total avalanche through the > entire S and P tables. I agree with you that Blowfish isn't of much > interest any more and people should use AES instead (not sure about > Twofish).
A "mini-twofish" would be better. the fact the round function isn't bijective is annoying...
Twofish is a decent cipher on its own. AES is better because it's an SPN but if I had a choice between memcpy and twofish ... ;-)
> Lately I'm interested in Serpent, which did poorly in the AES > competition because the designers used twice as many rounds as they > felt they needed to provide good security, while the other entrants > didn't do the same. For a personal project that needs high speed > encryption, I'm thinking of using Serpent reduced to 16 rounds. Also, > from Pentium II timings, it looks like using MMX speeds up Serpent > considerably despite any parallelism in the P2's normal integer pipes. > I think the P3/P4/Athlon won't have radically more parallelism than > the P2 had.
If you need high speed over short messages [and are not going standard] why not use RC5-12/32/b. It's very fast [~100 cycles per block on an AMD64] and why attacks exist are not applicable to short messages.
> > I used to use a Blowfish variant where the whole key schedule was...
> A "mini-twofish" would be better. the fact the round function isn't > bijective is annoying...
This was before the Twofish era. I'd have certainly used AES in that application if it had been available back then.
> If you need high speed over short messages [and are not going standard] > why not use RC5-12/32/b. It's very fast [~100 cycles per block on an > AMD64] and why attacks exist are not applicable to short messages.
I want to avoid patents, and I'm not concerned about short messages (I want to encrypt 4KB blocks)). That makes it feasible to do things like use AES to create an RC4 key schedule and then encrypt the block with RC4. So I'm hoping for something faster than RC4. If I understand RC5 nomenclature, RC5/12/32 means a 64-bit blocksize, so 100 cycles/block would be 12.5 cycles/byte, which is pretty good, but slower than RC4. I'd like to think that bitsliced 16-round Serpent using 128-bit SSE2 operations can beat that speed.
rzel1...@yahoo.com wrote: > Why do you suggest keys shorter than 448 bits? Is there some security > disadvantage of having a key size too big?
The human brain cannot remember that number of random digits and most people would make mistakes typing the number in. However using emails typing may no longer be required.
Paul Rubin wrote: > "Tom St Denis" <tomstde...@gmail.com> writes: > > > I used to use a Blowfish variant where the whole key schedule was...
> > A "mini-twofish" would be better. the fact the round function isn't > > bijective is annoying...
> This was before the Twofish era. I'd have certainly used AES in that > application if it had been available back then.
CAST5? hehehe
> > If you need high speed over short messages [and are not going standard] > > why not use RC5-12/32/b. It's very fast [~100 cycles per block on an > > AMD64] and why attacks exist are not applicable to short messages.
> I want to avoid patents, and I'm not concerned about short messages (I > want to encrypt 4KB blocks)). That makes it feasible to do things > like use AES to create an RC4 key schedule and then encrypt the block > with RC4. So I'm hoping for something faster than RC4. If I > understand RC5 nomenclature, RC5/12/32 means a 64-bit blocksize, so > 100 cycles/block would be 12.5 cycles/byte, which is pretty good, but > slower than RC4. I'd like to think that bitsliced 16-round Serpent > using 128-bit SSE2 operations can beat that speed.
On an AMD64 with it's 16 GPRs you could probably do 3-4 blocks at once and interleave the instructions to get some parallelism.
Tom St Denis wrote: > On an AMD64 with it's 16 GPRs you could probably do 3-4 blocks at once > and interleave the instructions to get some parallelism.
Using the new CTR accelerator framework I'm adding to LTC I wrote a 2x RC5 core [which largely just uses the RC5 code ... mmm re-use]. I get 49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x improvement. The %cl register is a bottleneck [which with register renaming is alleviated] so 3x/4x may not be much better.
Single RC5 in CTR mode takes around 100 cycles/block.
Anyways this demo serves two purposes...
1. To show off RC5 in 2x CTR mode [it's designed to be compatible with RC5 in CTR mode normally].
2. To show off how easy it is to write accelerators for LTC. ;-)
> Using the new CTR accelerator framework I'm adding to LTC I wrote a 2x > RC5 core [which largely just uses the RC5 code ... mmm re-use]. I get > 49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x > improvement. The %cl register is a bottleneck [which with register > renaming is alleviated] so 3x/4x may not be much better.
That's pretty fast. What's 2x mean? You're doing two RC5 operations in a single loop with separate registers, to get more parallelism? Are there two barrel shifters, or do you just schedule it carefully?
I think I'd rather stay away from the data dependent rotates, to avoid both patents and timing attacks. Serpent is really nice--pure combinational logic in bitslice mode.
Paul Rubin wrote: > "Tom St Denis" <tomstde...@gmail.com> writes: > > Using the new CTR accelerator framework I'm adding to LTC I wrote a 2x > > RC5 core [which largely just uses the RC5 code ... mmm re-use]. I get > > 49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x > > improvement. The %cl register is a bottleneck [which with register > > renaming is alleviated] so 3x/4x may not be much better.
> That's pretty fast. What's 2x mean? You're doing two RC5 operations > in a single loop with separate registers, to get more parallelism? > Are there two barrel shifters, or do you just schedule it carefully?
CTR requires you encrypt a counter, so I'm encrypting "2i" and "2i + 1" at the same time.
As for "scheduling" it's C code [with possibly unaligned mem access but otherwise portable]. The trick is that modern cpus can rename registers. e.g. the rotate value is stored in %cl but the value of %cl is short lived. so the cpu can just map it to a temp register.
The cpu decodes the instructions and sends them into instruction queues but since the value of "edx" [this is gas syntax so the destination is on the right] is short lived so let's rename %edx in the first case to %t0 for "the first temp" we get
The net result of the code after the last instruction is %edx has the correct value [e.g. (esi+4)+eax] but the code can be run in parallel. This is how the x86 series of processors gets any sort of performance with only 6 or so GPRs. The K8 [amd64] solves this a bit with the 16 GPRs but even it has register renaming support.
I would be very surprised if you manually renamed %edx in the first case to say %ebx that the code would run any faster on the Athlon k7/k8 processors.
> I think I'd rather stay away from the data dependent rotates, to avoid > both patents and timing attacks. Serpent is really nice--pure > combinational logic in bitslice mode.
You're missing the point. You want todo SSE2 optimized Serpent and I'm saying you'll have a nice API to plug that into [e.g. use the cool LibTomCrypt API that will call your optimized Serpent code]. See we all win ;-)
like my "49 cycles/block" is from the my generic testing which calls ctr_encrypt() [a generic CTR routine that you use todo CTR with *ANY* cipher]. Because the rc5_2x descriptor has a CTR accelerator my code calls it instead of emulating CTR mode with a single block ECB. The net result is a 2x performance boost.
> As for "scheduling" it's C code [with possibly unaligned mem access but > otherwise portable]. The trick is that modern cpus can rename > registers. e.g. the rotate value is stored in %cl but the value of %cl > is short lived. so the cpu can just map it to a temp register.
OK, did you happen to count the instructions in the compiler output to see how many cpi you got? That would give some indication of whether SSE helps or not.
> You're missing the point. You want todo SSE2 optimized Serpent and I'm > saying you'll have a nice API to plug that into [e.g. use the cool > LibTomCrypt API that will call your optimized Serpent code]. See we > all win ;-)
This is good to know, though for this particular application (rather special purpose), a semi-heavyweight general purpose library is sort of overkill.
> As for "scheduling" it's C code [with possibly unaligned mem access but > otherwise portable]. The trick is that modern cpus can rename > registers. e.g. the rotate value is stored in %cl but the value of %cl > is short lived. so the cpu can just map it to a temp register.
Hmm, did you happen to count the instructions in the compiler output and compute the cpi? That would say how much real parallelism you're getting.
> You're missing the point. You want todo SSE2 optimized Serpent and I'm > saying you'll have a nice API to plug that into [e.g. use the cool > LibTomCrypt API that will call your optimized Serpent code]. See we > all win ;-)
Thanks, that's good to know in general. For the particular specialized application I have in mind though, I'm better off with something more bare bones.
Paul Rubin wrote: > "Tom St Denis" <tomstde...@gmail.com> writes: > > As for "scheduling" it's C code [with possibly unaligned mem access but > > otherwise portable]. The trick is that modern cpus can rename > > registers. e.g. the rotate value is stored in %cl but the value of %cl > > is short lived. so the cpu can just map it to a temp register.
> Hmm, did you happen to count the instructions in the compiler output > and compute the cpi? That would say how much real parallelism you're > getting.
I didn't. Roughly speaking 12 rounds of RC5 requires *at least* 24 rotates, 26 additions, 24 XORs. So at a very minimum it's 74 instructions per block. So roughly speaking I'm getting 74/49 = 1.5 instructions per clock.
The original code was actually less than 1 cycle per clock and mostly because there was a huge dependency chain forming around the rotates.
I'll try doing 4x later on tonight [my testbench uses 4KiB blocks so as long as I divide that I'm all good... ;-)] and see what it gets up to.
> > You're missing the point. You want todo SSE2 optimized Serpent and I'm > > saying you'll have a nice API to plug that into [e.g. use the cool > > LibTomCrypt API that will call your optimized Serpent code]. See we > > all win ;-)
> Thanks, that's good to know in general. For the particular > specialized application I have in mind though, I'm better off with > something more bare bones.
Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)
> I didn't. Roughly speaking 12 rounds of RC5 requires *at least* 24 > rotates, 26 additions, 24 XORs. So at a very minimum it's 74 > instructions per block. So roughly speaking I'm getting 74/49 = 1.5 > instructions per clock.
> The original code was actually less than 1 cycle per clock and mostly > because there was a huge dependency chain forming around the rotates.
Interesting.
> I'll try doing 4x later on tonight [my testbench uses 4KiB blocks so as > long as I divide that I'm all good... ;-)] and see what it gets up to.
You mean 4 loops using the extra a64 registers? Hmm, might save you a little loop overhead but it would sure be nice if you could use 64-bit operations somehow.
> Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)
The hardware isn't the problem, I just don't want to hair up the code so much.
Paul Rubin wrote: > > The original code was actually less than 1 cycle per clock and mostly > > because there was a huge dependency chain forming around the rotates.
> Interesting.
hmm. rather.
Sorry... couldn't resist...
> > I'll try doing 4x later on tonight [my testbench uses 4KiB blocks so as > > long as I divide that I'm all good... ;-)] and see what it gets up to.
> You mean 4 loops using the extra a64 registers? Hmm, might save you a > little loop overhead but it would sure be nice if you could use 64-bit > operations somehow.
Problem is they are not fixed rotates. The x86 [even the amd64] still uses %cl for the rotate count. As annoying that is....
> > Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)
> The hardware isn't the problem, I just don't want to hair up the code > so much.
So far ~910 satifisfied downloaders of LTC 1.00 [they'll like 1.01 I'm sure] and no complaints.
As an aside... some of my downloads are from weird places like Goldman Sachs, the USDA, Ford, etc...
Anyone got a clue why folks like that would be downloading crypto?