Dieter Schmidt Blowfish Paper Results

rzel...@yahoo.com

unread,

Mar 27, 2005, 10:29:46 PM3/27/05

to

I read the paper by Dieter Schmidt on Blowfish. He said that the first
64 bits of the key are ignored for the third and fourth round.

Although this isn't an attack, is there any evidence that this
discovery would make it easier to extend any of the other attacks, such
as the Rijmen attack against 4 rounds, the Kelsey attack against 3
rounds, or the Vaudenay attack against weak keys?

Tom St Denis

unread,

Mar 28, 2005, 6:12:05 AM3/28/05

to

rzel...@yahoo.com wrote:
> I read the paper by Dieter Schmidt on Blowfish. He said that the
first
> 64 bits of the key are ignored for the third and fourth round.

No way, and the second 64 bits are ignored in the first and second
round!!!

...

You do realize that all key bits influence the sbox generation right?

Tom

rzel...@yahoo.com

unread,

Mar 28, 2005, 10:37:47 AM3/28/05

to

So then the paper was a hoax? Maybe I misquoted it. It says that the
subkeys for the third and fourth round do not depend on the first 64
bits of the user key.

Tom St Denis

unread,

Mar 28, 2005, 9:37:43 PM3/28/05

to

Five seconds with LibTomCrypt disproves the result...

#include "tomcrypt.h"

int main(void)
{
symmetric_key skey;
unsigned char key[16];

memset(key, 0, 16);
blowfish_setup(key, 16 ,0, &skey);
memset(key, 1, 8);
blowfish_setup(key, 16 ,0, &skey);
return 0;
}

Gives

Key[ 0] = 706d9fcc1792d23a
Key[ 2] = 2db9d714966e1439
Key[ 4] = ac21a76d8324e988
Key[ 6] = ac0dc9dd2c38f6b3
Key[ 8] = 70619520fa23ecbe
Key[10] = 17b2f676eba13a04
Key[12] = 8b949e617a147caf
Key[14] = 56ccc6b64461b24d
Key[16] = 7361e6a1196a7c43
Key[ 0] = d4f62ce183c464ce
Key[ 2] = 61b90b19303a9d85
Key[ 4] = d26617236c259b3f
Key[ 6] = 2e7ad6052567fcab
Key[ 8] = 6479c8b49fd5df0b
Key[10] = 78e7bdd01ea558e0
Key[12] = a601910958356c0d
Key[14] = 5b4608b08c801daa
Key[16] = 956fdab1534d5baf

As output. Both keys are completely different.

I edited my blowfish.c to have

for (x = 0; x < 18; x += 2) {
/* encrypt it */
blowfish_ecb_encrypt(B, B, skey);
/* copy it */
LOAD32H(skey->blowfish.K[x], &B[0]);
LOAD32H(skey->blowfish.K[x+1], &B[4]);
printf("Key[%2d] = %08lx%08lx\n", x, skey->blowfish.K[x],
skey->blowfish.K[x+1]);
}

So his claim that "the first two round keys don't affect the latter
round keys" is totally wrong.

Tom

Tom St Denis

unread,

Mar 28, 2005, 9:55:04 PM3/28/05

to

Tom St Denis wrote:
> > So then the paper was a hoax? Maybe I misquoted it. It says that
the
> > subkeys for the third and fourth round do not depend on the first
64
> > bits of the user key.
>
> Five seconds with LibTomCrypt disproves the result...

Some careful thinking later... What they're saying about P1/P2 may be
true but the conclusion is not.

It is not correct that all of the round keys do not depend on the first
64-bits of the key. My demo disproves that.

It may be true that they don't depend on P1/P2 specifically but if your
key is shorter than 448-bits [which I'd suggest it should be] you will
repeat the internal key. So really if you use 128-256 bit keys with
Blowfish this is of little to no consequence.

Better yet though use Twofish or AES instead which are faster and more
modern.

Tom

Paul Rubin

unread,

Mar 28, 2005, 10:01:13 PM3/28/05

to

"Tom St Denis" <tomst...@gmail.com> writes:
> Some careful thinking later... What they're saying about P1/P2 may be
> true but the conclusion is not.
>
> It is not correct that all of the round keys do not depend on the first
> 64-bits of the key. My demo disproves that.

I used to use a Blowfish variant where the whole key schedule was done
twice (once with the usual S and P tables initialized with hex digits
of pi, and a second time starting with the result of the first pass).
This was endorsed by Bruce and guarantees total avalanche through the
entire S and P tables. I agree with you that Blowfish isn't of much
interest any more and people should use AES instead (not sure about
Twofish).

Lately I'm interested in Serpent, which did poorly in the AES
competition because the designers used twice as many rounds as they
felt they needed to provide good security, while the other entrants
didn't do the same. For a personal project that needs high speed
encryption, I'm thinking of using Serpent reduced to 16 rounds. Also,
from Pentium II timings, it looks like using MMX speeds up Serpent
considerably despite any parallelism in the P2's normal integer pipes.
I think the P3/P4/Athlon won't have radically more parallelism than
the P2 had.

Tom St Denis

unread,

Mar 28, 2005, 10:12:43 PM3/28/05

to

Paul Rubin wrote:
> "Tom St Denis" <tomst...@gmail.com> writes:
> > Some careful thinking later... What they're saying about P1/P2 may
be
> > true but the conclusion is not.
> >
> > It is not correct that all of the round keys do not depend on the
first
> > 64-bits of the key. My demo disproves that.
>
> I used to use a Blowfish variant where the whole key schedule was
done
> twice (once with the usual S and P tables initialized with hex digits
> of pi, and a second time starting with the result of the first pass).
> This was endorsed by Bruce and guarantees total avalanche through the
> entire S and P tables. I agree with you that Blowfish isn't of much
> interest any more and people should use AES instead (not sure about
> Twofish).

A "mini-twofish" would be better. the fact the round function isn't
bijective is annoying...

Twofish is a decent cipher on its own. AES is better because it's an
SPN but if I had a choice between memcpy and twofish ... ;-)

> Lately I'm interested in Serpent, which did poorly in the AES
> competition because the designers used twice as many rounds as they
> felt they needed to provide good security, while the other entrants
> didn't do the same. For a personal project that needs high speed
> encryption, I'm thinking of using Serpent reduced to 16 rounds.
Also,
> from Pentium II timings, it looks like using MMX speeds up Serpent
> considerably despite any parallelism in the P2's normal integer
pipes.
> I think the P3/P4/Athlon won't have radically more parallelism than
> the P2 had.

If you need high speed over short messages [and are not going standard]
why not use RC5-12/32/b. It's very fast [~100 cycles per block on an
AMD64] and why attacks exist are not applicable to short messages.

Tom

Paul Rubin

unread,

Mar 28, 2005, 10:25:27 PM3/28/05

to

"Tom St Denis" <tomst...@gmail.com> writes:

> > I used to use a Blowfish variant where the whole key schedule was...

>
> A "mini-twofish" would be better. the fact the round function isn't
> bijective is annoying...

This was before the Twofish era. I'd have certainly used AES in that
application if it had been available back then.

> If you need high speed over short messages [and are not going standard]
> why not use RC5-12/32/b. It's very fast [~100 cycles per block on an
> AMD64] and why attacks exist are not applicable to short messages.

I want to avoid patents, and I'm not concerned about short messages (I
want to encrypt 4KB blocks)). That makes it feasible to do things
like use AES to create an RC4 key schedule and then encrypt the block
with RC4. So I'm hoping for something faster than RC4. If I
understand RC5 nomenclature, RC5/12/32 means a 64-bit blocksize, so
100 cycles/block would be 12.5 cycles/byte, which is pretty good, but
slower than RC4. I'd like to think that bitsliced 16-round Serpent
using 128-bit SSE2 operations can beat that speed.

rzel...@yahoo.com

unread,

Mar 29, 2005, 1:29:34 AM3/29/05

to

Why do you suggest keys shorter than 448 bits? Is there some security
disadvantage of having a key size too big?

Tom St Denis

unread,

Mar 29, 2005, 6:05:04 AM3/29/05

to

rzel...@yahoo.com wrote:
> Why do you suggest keys shorter than 448 bits? Is there some security
> disadvantage of having a key size too big?

Is there a security advantage?

Tom

Andrew Swallow

unread,

Mar 29, 2005, 6:42:28 AM3/29/05

to

rzel...@yahoo.com wrote:

> Why do you suggest keys shorter than 448 bits? Is there some security
> disadvantage of having a key size too big?
>

The human brain cannot remember that number of random digits and most
people would make mistakes typing the number in. However using emails
typing may no longer be required.

Andrew Swallow

Tom St Denis

unread,

Mar 29, 2005, 12:26:29 PM3/29/05

to

Paul Rubin wrote:
> "Tom St Denis" <tomst...@gmail.com> writes:
> > > I used to use a Blowfish variant where the whole key schedule
was...
> >
> > A "mini-twofish" would be better. the fact the round function
isn't
> > bijective is annoying...
>
> This was before the Twofish era. I'd have certainly used AES in that
> application if it had been available back then.

CAST5? hehehe

> > If you need high speed over short messages [and are not going
standard]
> > why not use RC5-12/32/b. It's very fast [~100 cycles per block on
an
> > AMD64] and why attacks exist are not applicable to short messages.
>
> I want to avoid patents, and I'm not concerned about short messages
(I
> want to encrypt 4KB blocks)). That makes it feasible to do things
> like use AES to create an RC4 key schedule and then encrypt the block
> with RC4. So I'm hoping for something faster than RC4. If I
> understand RC5 nomenclature, RC5/12/32 means a 64-bit blocksize, so
> 100 cycles/block would be 12.5 cycles/byte, which is pretty good, but
> slower than RC4. I'd like to think that bitsliced 16-round Serpent
> using 128-bit SSE2 operations can beat that speed.

On an AMD64 with it's 16 GPRs you could probably do 3-4 blocks at once
and interleave the instructions to get some parallelism.

;-)

Tom

Tom St Denis

unread,

Mar 29, 2005, 7:04:32 PM3/29/05

to

Tom St Denis wrote:
> On an AMD64 with it's 16 GPRs you could probably do 3-4 blocks at
once
> and interleave the instructions to get some parallelism.

Using the new CTR accelerator framework I'm adding to LTC I wrote a 2x
RC5 core [which largely just uses the RC5 code ... mmm re-use]. I get
49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x
improvement. The %cl register is a bottleneck [which with register
renaming is alleviated] so 3x/4x may not be much better.

Single RC5 in CTR mode takes around 100 cycles/block.

Anyways this demo serves two purposes...

1. To show off RC5 in 2x CTR mode [it's designed to be compatible with
RC5 in CTR mode normally].

2. To show off how easy it is to write accelerators for LTC. ;-)

Tom

Paul Rubin

unread,

Mar 29, 2005, 9:32:30 PM3/29/05

to

"Tom St Denis" <tomst...@gmail.com> writes:

> Using the new CTR accelerator framework I'm adding to LTC I wrote a 2x
> RC5 core [which largely just uses the RC5 code ... mmm re-use]. I get
> 49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x
> improvement. The %cl register is a bottleneck [which with register
> renaming is alleviated] so 3x/4x may not be much better.

That's pretty fast. What's 2x mean? You're doing two RC5 operations
in a single loop with separate registers, to get more parallelism?
Are there two barrel shifters, or do you just schedule it carefully?

I think I'd rather stay away from the data dependent rotates, to avoid
both patents and timing attacks. Serpent is really nice--pure
combinational logic in bitslice mode.

Tom St Denis

unread,

Mar 29, 2005, 9:45:18 PM3/29/05

to

Paul Rubin wrote:
> "Tom St Denis" <tomst...@gmail.com> writes:
> > Using the new CTR accelerator framework I'm adding to LTC I wrote a
2x
> > RC5 core [which largely just uses the RC5 code ... mmm re-use]. I
get
> > 49 cycles/block [6.2 cycles/byte] in CTR mode which is nearly a 2x
> > improvement. The %cl register is a bottleneck [which with register
> > renaming is alleviated] so 3x/4x may not be much better.
>
> That's pretty fast. What's 2x mean? You're doing two RC5 operations
> in a single loop with separate registers, to get more parallelism?
> Are there two barrel shifters, or do you just schedule it carefully?

CTR requires you encrypt a counter, so I'm encrypting "2i" and "2i + 1"
at the same time.

As for "scheduling" it's C code [with possibly unaligned mem access but
otherwise portable]. The trick is that modern cpus can rename
registers. e.g. the rotate value is stored in %cl but the value of %cl
is short lived. so the cpu can just map it to a temp register.

A quick example of this would be

movl (%esi),%edx
addl %eax,%edx
movl %edx,(%edi)

movl (%esi+4),%edx
addl %eax,%edx
movl %edx,(%edi+4)

The cpu decodes the instructions and sends them into instruction queues
but since the value of "edx" [this is gas syntax so the destination is
on the right] is short lived so let's rename %edx in the first case to
%t0 for "the first temp" we get

movl (%esi),%t0
addl %eax,%t0
movl %t0,(%edi)

movl (%esi+4),%edx
addl %eax,%edx
movl %edx,(%edi+4)

But now the cpu can also pair them

movl (%esi),%t0 || movl (%esi+4),%edx
addl %eax,%t0 || addl %eax,%edx
movl %t0,(%edi) || movl %edx,(%edi)

The net result of the code after the last instruction is %edx has the
correct value [e.g. (esi+4)+eax] but the code can be run in parallel.
This is how the x86 series of processors gets any sort of performance
with only 6 or so GPRs. The K8 [amd64] solves this a bit with the 16
GPRs but even it has register renaming support.

I would be very surprised if you manually renamed %edx in the first
case to say %ebx that the code would run any faster on the Athlon k7/k8
processors.

> I think I'd rather stay away from the data dependent rotates, to
avoid
> both patents and timing attacks. Serpent is really nice--pure
> combinational logic in bitslice mode.

You're missing the point. You want todo SSE2 optimized Serpent and I'm
saying you'll have a nice API to plug that into [e.g. use the cool
LibTomCrypt API that will call your optimized Serpent code]. See we
all win ;-)

like my "49 cycles/block" is from the my generic testing which calls
ctr_encrypt() [a generic CTR routine that you use todo CTR with *ANY*
cipher]. Because the rc5_2x descriptor has a CTR accelerator my code
calls it instead of emulating CTR mode with a single block ECB. The
net result is a 2x performance boost.

Tom

Paul Rubin

unread,

Mar 29, 2005, 11:22:39 PM3/29/05

to

"Tom St Denis" <tomst...@gmail.com> writes:
> As for "scheduling" it's C code [with possibly unaligned mem access but
> otherwise portable]. The trick is that modern cpus can rename
> registers. e.g. the rotate value is stored in %cl but the value of %cl
> is short lived. so the cpu can just map it to a temp register.

OK, did you happen to count the instructions in the compiler output to
see how many cpi you got? That would give some indication of whether
SSE helps or not.

> You're missing the point. You want todo SSE2 optimized Serpent and I'm
> saying you'll have a nice API to plug that into [e.g. use the cool
> LibTomCrypt API that will call your optimized Serpent code]. See we
> all win ;-)

This is good to know, though for this particular application (rather
special purpose), a semi-heavyweight general purpose library is sort
of overkill.

Paul Rubin

unread,

Mar 30, 2005, 2:58:54 AM3/30/05

to

"Tom St Denis" <tomst...@gmail.com> writes:
> As for "scheduling" it's C code [with possibly unaligned mem access but
> otherwise portable]. The trick is that modern cpus can rename
> registers. e.g. the rotate value is stored in %cl but the value of %cl
> is short lived. so the cpu can just map it to a temp register.

Hmm, did you happen to count the instructions in the compiler output
and compute the cpi? That would say how much real parallelism you're
getting.

> You're missing the point. You want todo SSE2 optimized Serpent and I'm
> saying you'll have a nice API to plug that into [e.g. use the cool
> LibTomCrypt API that will call your optimized Serpent code]. See we
> all win ;-)

Thanks, that's good to know in general. For the particular
specialized application I have in mind though, I'm better off with
something more bare bones.

Tom St Denis

unread,

Mar 30, 2005, 6:17:46 AM3/30/05

to

Paul Rubin wrote:
> "Tom St Denis" <tomst...@gmail.com> writes:
> > As for "scheduling" it's C code [with possibly unaligned mem access
but
> > otherwise portable]. The trick is that modern cpus can rename
> > registers. e.g. the rotate value is stored in %cl but the value of
%cl
> > is short lived. so the cpu can just map it to a temp register.
>
> Hmm, did you happen to count the instructions in the compiler output
> and compute the cpi? That would say how much real parallelism you're
> getting.

I didn't. Roughly speaking 12 rounds of RC5 requires *at least* 24
rotates, 26 additions, 24 XORs. So at a very minimum it's 74
instructions per block. So roughly speaking I'm getting 74/49 = 1.5
instructions per clock.

The original code was actually less than 1 cycle per clock and mostly
because there was a huge dependency chain forming around the rotates.

I'll try doing 4x later on tonight [my testbench uses 4KiB blocks so as
long as I divide that I'm all good... ;-)] and see what it gets up to.

> > You're missing the point. You want todo SSE2 optimized Serpent and
I'm
> > saying you'll have a nice API to plug that into [e.g. use the cool
> > LibTomCrypt API that will call your optimized Serpent code]. See
we
> > all win ;-)
>
> Thanks, that's good to know in general. For the particular
> specialized application I have in mind though, I'm better off with
> something more bare bones.

Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)

Tom

Paul Rubin

unread,

Mar 30, 2005, 6:31:03 AM3/30/05

to

"Tom St Denis" <tomst...@gmail.com> writes:
> I didn't. Roughly speaking 12 rounds of RC5 requires *at least* 24
> rotates, 26 additions, 24 XORs. So at a very minimum it's 74
> instructions per block. So roughly speaking I'm getting 74/49 = 1.5
> instructions per clock.
>
> The original code was actually less than 1 cycle per clock and mostly
> because there was a huge dependency chain forming around the rotates.

Interesting.

> I'll try doing 4x later on tonight [my testbench uses 4KiB blocks so as
> long as I divide that I'm all good... ;-)] and see what it gets up to.

You mean 4 loops using the extra a64 registers? Hmm, might save you a
little loop overhead but it would sure be nice if you could use 64-bit
operations somehow.

> Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)

The hardware isn't the problem, I just don't want to hair up the code
so much.

Tom St Denis

unread,

Mar 30, 2005, 7:13:25 AM3/30/05

to

Paul Rubin wrote:
> > The original code was actually less than 1 cycle per clock and
mostly
> > because there was a huge dependency chain forming around the
rotates.
>
> Interesting.

hmm. rather.

Sorry... couldn't resist...

> > I'll try doing 4x later on tonight [my testbench uses 4KiB blocks
so as
> > long as I divide that I'm all good... ;-)] and see what it gets up
to.
>
> You mean 4 loops using the extra a64 registers? Hmm, might save you
a
> little loop overhead but it would sure be nice if you could use
64-bit
> operations somehow.

Problem is they are not fixed rotates. The x86 [even the amd64] still
uses %cl for the rotate count. As annoying that is....

> > Bullocks. Can't get much more barebones than a WG511T or PS2 ;-)
>
> The hardware isn't the problem, I just don't want to hair up the code
> so much.

So far ~910 satifisfied downloaders of LTC 1.00 [they'll like 1.01 I'm
sure] and no complaints.

As an aside... some of my downloads are from weird places like Goldman
Sachs, the USDA, Ford, etc...

Anyone got a clue why folks like that would be downloading crypto?

Tom