The blowfish encryption algorithm, in CL [CODE]

Jochen Schmidt

unread,

Jun 24, 2002, 3:44:18 AM6/24/02

to

Alain Picard wrote:

>
> Fellow lispers,
>
> I needed a simple, in-lisp (i.e. no FFI) encryption for an
> application, so I wrote the following. Perhaps it will be
> of some use to someone else.
>
> I'm posting this here because I don't have a public FTP site, and
> I'm not too sure where would be a good place to send this to.
> Certainly, either the CCLAN or CCLOCC (or both!) people are welcome
> to incorporate this into their offerings, if they desire.
>
> My thanks to Pierre Mai for posting an excellend MD5 implementation
> in lisp, from which I learned many things which went into this code.

I could offer you to include it to my "Cryptography in Common Lisp" site
on http://www.dataheaven.de if you want.

It is rather minimalistic but it will certainly grow in future. I've
contributed my "in-lisp" RSA implementation, math utilities and my binding
to CL-SSL.

Some months ago I worked on a version of ARC4 (A stream cipher) but I have
to look it up on my harddisk first.

I remember there were implementations of RIPEMD and SHA too but I don't
remember the details.

ciao,
Jochen

--
http://www.dataheaven.de

Roland Kaufmann

unread,

Jun 24, 2002, 11:40:37 AM6/24/02

to

A Common Lisp implementation of ARC4 can be found on
http://www.xs4all.nl/~cg/ciphersaber/

best regards
Roland Kaufmann

Espen Vestre

unread,

Jun 25, 2002, 9:47:40 AM6/25/02

to

Jochen Schmidt <j...@dataheaven.de> writes:

> I could offer you to include it to my "Cryptography in Common Lisp" site
> on http://www.dataheaven.de if you want.

This is great :-)

I just implemented both blowfish and RSA myself (mostly because I
needed something small and light which could be cross-platform
portable without having to depend on external libraries), so I don't
need any of them right now, but whenever I might find some spare time,
I'll have a look and see if I can contribute anything at all to what
you guys have already done!

--
(espen)

Jochen Schmidt

unread,

Jun 25, 2002, 5:27:29 PM6/25/02

to

Espen Vestre wrote:

I really would be interested in seeing your modular exponentiation function
as this is the major bottleneck in my RSA implementation.

I think a fast lowlevel tuned modular exponentiation function would be a
good enhancement for CLs numeric function repertoire. Many cryptographic
algorithms make use of that. It is possible to implement it fast by using
things like "montgomery multiplication" which enables fast modular
exponentiation.

Erann Gat

unread,

Jun 25, 2002, 7:28:31 PM6/25/02

to

(defun expt-mod-n (a x n)
(let ( (result 1) )
(while (> x 0)
(if (oddp x)
(setf result (mod (* result a) n)))
(setf x (ash x -1))
(setf a (mod (* a a) n)))
result))

In article <afan3a$k3m$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt

Espen Vestre

unread,

Jun 26, 2002, 3:53:39 AM6/26/02

to

Jochen Schmidt <j...@dataheaven.de> writes:

> I really would be interested in seeing your modular exponentiation function
> as this is the major bottleneck in my RSA implementation.

It's not fast. It's the main bottleneck in my implementation, too.

> I think a fast lowlevel tuned modular exponentiation function would be a
> good enhancement for CLs numeric function repertoire. Many cryptographic
> algorithms make use of that. It is possible to implement it fast by using
> things like "montgomery multiplication" which enables fast modular
> exponentiation.

I whole-heartedly agree!

--
(espen)

Jochen Schmidt

unread,

Jun 26, 2002, 4:28:23 AM6/26/02

to

Erann Gat wrote:

> (defun expt-mod-n (a x n)
> (let ( (result 1) )
> (while (> x 0)
> (if (oddp x)
> (setf result (mod (* result a) n)))
> (setf x (ash x -1))
> (setf a (mod (* a a) n)))
> result))

I rewrote it using (loop while (> x 0) ...)
because while is not part of ANSI CL.
I also added declarations.

(defun expt-mod-n (a x n)

(declare (optimize (speed 3) (safety 0)))
(let ((result 1))
(loop while (> x 0)
do (if (oddp x)

(setf result (mod (* result a) n)))
(setf x (ash x -1))
(setf a (mod (* a a) n)))
result))

I compared it to mine:

(defun expt-mod (n exponent modulus)
"As (mod (expt n exponent) modulus), but more efficient."
(declare (optimize (speed 3) (safety 0) (space 0) (debug 0)))
(loop with result = 1
for i of-type fixnum from 0 below (integer-length exponent)
for sqr = n then (mod (* sqr sqr) modulus)
when (logbitp i exponent) do
(setf result (mod (* result sqr) modulus))
finally (return result)))

Of course the result of integer-length can be bigger than a fixnum but
not in application to the cryptographic algorithms I used it for.
A more general CL implementation can test the ranges of it's parameters
first and then choose the best (and more important: a correct)
implementation.

The difference in LispWorks 4.2.6 was significant. (With mine being faster).

Here is my reasoning why this is so:
I think that the use of logbitp and integer-length enables the
implementation to reduce consing here. The ASH will allocate a new bignum
per cycle in your code while increasing the bitcounter i in mine will not.

There are two optimizations I see which would make modular exponentation
faster. One is reusing allocated bignums. The result does not have to get
allocated over and over. The same counts for sqr. Both result, sqr and
intermediate results are never bigger than twice of the size of the modulus.
It would be nice if one could preallocate two "bignum-buffers" (for sqr and
the result) and then let the arithmetic operations use those buffers.
I'm not sure how far dataflow optimization can deduce such reuse of bignums
and if any implementation does something like that.

The other optimization would be to use montgomery representation for the
numbers. This would mean having a conversion step at the beginning and then
using "montgomery-multiplication" instead of (mod (* ...) modulus). If the
result is calculated one converts it from montgomery representation to
normal representation back. This allows one to leave out the MODs.

Erann Gat

unread,

Jun 26, 2002, 12:32:37 PM6/26/02

to

In article <afbtqb$aju$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
<j...@dataheaven.de> wrote:

> I rewrote it using (loop while (> x 0) ...)
> because while is not part of ANSI CL.

Yes, but it should be ;-)

Indeed. In MCL the difference is a factor 10. Zowie!

> Here is my reasoning why this is so:
> I think that the use of logbitp and integer-length enables the
> implementation to reduce consing here. The ASH will allocate a new bignum
> per cycle in your code while increasing the bitcounter i in mine will not.

Sounds plausible.

E.

Erann Gat

unread,

Jun 26, 2002, 12:42:09 PM6/26/02

to

In article <864rftp...@gondolin.local.net>, Alain Picard
<apicard+die...@optushome.com.au> wrote:

> ;;;; Blowfish Constants.
> ;;
> ;; These values taken from constants.txt,
> ;; at counterpane.com, in the description of blowfish.
> ;; They're actually the digits of pi, in Hex.
> ;;

Here's a wizzy piece of code that will allow you to compute these values
instead of having them clutter up your source. (It's also less
error-prone.)

(defun compute-pi-hex (n &aux (p 0) r)
(dotimes (i n)
(incf p (- (/ 4 (+ 1 (* 8 i)))
(/ 2 (+ 4 (* 8 i)))
(/ 1 (+ 5 (* 8 i)))
(/ 1 (+ 6 (* 8 i)))))
(multiple-value-setq (r p) (truncate p 16))
(format t "~X" r)
(if (= (mod i 8) 1) (princ #\space))
(setf p (* p 16))))

FWIW, here's a decimal pi-computer as well:

(defun compute-pi-decimal (n &aux (p 0) r)
(dotimes (i n)
(incf p (/ (- (/ 4 (+ 1 (* 8 i)))
(/ 2 (+ 4 (* 8 i)))
(/ 1 (+ 5 (* 8 i)))
(/ 1 (+ 6 (* 8 i))))
(expt 16 i))))
(dotimes (i n)
(multiple-value-setq (r p) (truncate p 10))
(format t "~X" r)
(if (= (mod i 10) 1) (princ #\space))
(setf p (* p 10))))

E.

Alain Picard

unread,

Jun 27, 2002, 4:05:47 AM6/27/02

to

g...@jpl.nasa.gov (Erann Gat) writes:

> In article <864rftp...@gondolin.local.net>, Alain Picard
> <apicard+die...@optushome.com.au> wrote:
>
> > ;;;; Blowfish Constants.
> > ;;
> > ;; These values taken from constants.txt,
> > ;; at counterpane.com, in the description of blowfish.
> > ;; They're actually the digits of pi, in Hex.
> > ;;
>
> Here's a wizzy piece of code that will allow you to compute these values
> instead of having them clutter up your source. (It's also less
> error-prone.)
>

[SNIP]

Thanks, but no thanks. If Bruce Schneier made a mistake in the constants,
at least he and I (and all other implementors in the world) are making
the same mistake. :-)

Ob. conjecture: I'm guessing that PI is used to initialize the s-boxes
because there was long standing suspicion in the cryptographic community
that NSA had diddled the s-boxes of DES (presumably to give themselves
a back door). Bruce probably thought that using PI only gives God a
back door.

Ob lisp: I've metered my code a bit more, and CMU is > 120X faster
that Lispworks. Those kernel:32bit-logical-xor functions sure do
make a difference. I've been reading the fast modulo exponentiation
thread with interest; there doesn't seem to be an API in LW to do
this 32bit diddling, and even if there were, I suspect all the return
values would still have to be coerced to bignums, sort of defeating the
purpose. I'm starting to think CMUCL is pretty cool.

Erann Gat

unread,

Jun 27, 2002, 12:49:17 PM6/27/02

to

In article <86n0thn...@gondolin.local.net>, Alain Picard
<apicard+die...@optushome.com.au> wrote:

> g...@jpl.nasa.gov (Erann Gat) writes:
>
> > In article <864rftp...@gondolin.local.net>, Alain Picard
> > <apicard+die...@optushome.com.au> wrote:
> >
> > > ;;;; Blowfish Constants.
> > > ;;
> > > ;; These values taken from constants.txt,
> > > ;; at counterpane.com, in the description of blowfish.
> > > ;; They're actually the digits of pi, in Hex.
> > > ;;
> >
> > Here's a wizzy piece of code that will allow you to compute these values
> > instead of having them clutter up your source. (It's also less
> > error-prone.)
> >
> [SNIP]
>
> Thanks, but no thanks. If Bruce Schneier made a mistake in the constants,
> at least he and I (and all other implementors in the world) are making
> the same mistake. :-)

Well, you can check the algorithm against Bruce's constants once to
convince yourself that he got them right, then use the algorithm to
protect yourself against future typos and bit-rot. But whatever.

E.

Espen Vestre

unread,

Jun 28, 2002, 4:01:05 AM6/28/02

to

g...@jpl.nasa.gov (Erann Gat) writes:

> > Thanks, but no thanks. If Bruce Schneier made a mistake in the constants,
> > at least he and I (and all other implementors in the world) are making
> > the same mistake. :-)
>
> Well, you can check the algorithm against Bruce's constants once to
> convince yourself that he got them right, then use the algorithm to
> protect yourself against future typos and bit-rot. But whatever.

...but isn't the usage of pi in blowfish pretty random anyway (just
used as a useful non-repeating pattern)?
--
(espen)

Espen Vestre

unread,

Jun 28, 2002, 4:13:02 AM6/28/02

to

Espen Vestre <espen@*do-not-spam-me*.vestre.net> writes:

> ...but isn't the usage of pi in blowfish pretty random anyway (just
> used as a useful non-repeating pattern)?

Of course it is (*), here's a quote from B. Schneier:

I chose the digits of pi as the initial subkey table for two
reasons: because it is a random sequence not related to the
algorithm, and because it could either be stored as part of the
algorithm or derived when needed. There is nothing sacred about pi;
any string of random bits--digits of e, RAND tables, output of a
random number generator--will suffice.

(http://www.counterpane.com/bfsverlag.html)

(*) I apologize for posting before I had checked my bookmarks.
--
(espen)

Shannon Spires

unread,

Jun 28, 2002, 10:40:27 PM6/28/02

to

In article <afbtqb$aju$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
<j...@dataheaven.de> wrote:

> The other optimization would be to use montgomery representation for the
> numbers. This would mean having a conversion step at the beginning and
> then
> using "montgomery-multiplication" instead of (mod (* ...) modulus). If
> the
> result is calculated one converts it from montgomery representation to
> normal representation back. This allows one to leave out the MODs.

In C, it is certainly true that the Montgomery representation speeds up
modular exponentiation dramatically. In Common Lisp, it's more murky.
Montgomery representation lets you do modular multiplies with shifts
instead of divides for the modulus operation. But in Common Lisp, shifts
of bignums typically cons, and may not be much faster than divides.
It depends on your implementation.

Other optimizations are worth trying too. For example, the Karatsuba
formulation is a good way to optimize multiplication of large numbers in
general, regardless of modulo operations (Montgomery only applies to
modular multiplication). Karatsuba is elegantly recursive and
substitutes two additions for one of the four needed sub-multiplications
at each step. (Do a Google search on Karatsuba for more information).
Karatsuba can multiply in O[N^1.585] operations instead of O[N^2]
operations. CMUCL already uses Karatsuba internally for large bignum
multiplications, but other CLs may not.

Squaring is another avenue for optimization. It's possible to formulate
squaring to be faster than general multiplication, and since
exponentiation requires squaring, speeding up squaring speeds up
exponentiation.

See http://members.tripod.com/careybloodworth/karatsuba.htm
for a good description of how to go one step further and combine the
Karatsuba reformulation trick with squaring, which makes squaring faster
still.

While it's possible to code all these optimizations in Lisp--and it's
an enlightening exercise to do so--such an approach will probably not
help the speed of your code much unless you're using a very smart
compiler that knows when how to avoid excess consing.

If you have LAP available, however, they could make a real difference.

--
Shannon Spires

Jochen Schmidt

unread,

Jun 29, 2002, 6:02:37 AM6/29/02

to

Shannon Spires wrote:

> In article <afbtqb$aju$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
> <j...@dataheaven.de> wrote:
>
>> The other optimization would be to use montgomery representation for the
>> numbers. This would mean having a conversion step at the beginning and
>> then
>> using "montgomery-multiplication" instead of (mod (* ...) modulus). If
>> the
>> result is calculated one converts it from montgomery representation to
>> normal representation back. This allows one to leave out the MODs.
>
> In C, it is certainly true that the Montgomery representation speeds up
> modular exponentiation dramatically. In Common Lisp, it's more murky.
> Montgomery representation lets you do modular multiplies with shifts
> instead of divides for the modulus operation. But in Common Lisp, shifts
> of bignums typically cons, and may not be much faster than divides.
> It depends on your implementation.

Yes of course. I actually meant to implement Montgomery multiplication
using lowlevel means of particular implementations (for example access to
the bignum vector).

> Other optimizations are worth trying too. For example, the Karatsuba
> formulation is a good way to optimize multiplication of large numbers in
> general, regardless of modulo operations (Montgomery only applies to
> modular multiplication). Karatsuba is elegantly recursive and
> substitutes two additions for one of the four needed sub-multiplications
> at each step. (Do a Google search on Karatsuba for more information).
> Karatsuba can multiply in O[N^1.585] operations instead of O[N^2]
> operations. CMUCL already uses Karatsuba internally for large bignum
> multiplications, but other CLs may not.
>
> Squaring is another avenue for optimization. It's possible to formulate
> squaring to be faster than general multiplication, and since
> exponentiation requires squaring, speeding up squaring speeds up
> exponentiation.

Hm... this may be interesting too - I remember reading something about that
the Handbook of applied Cryptography.

> See http://members.tripod.com/careybloodworth/karatsuba.htm
> for a good description of how to go one step further and combine the
> Karatsuba reformulation trick with squaring, which makes squaring faster
> still.
>
> While it's possible to code all these optimizations in Lisp--and it's
> an enlightening exercise to do so--such an approach will probably not
> help the speed of your code much unless you're using a very smart
> compiler that knows when how to avoid excess consing.

Yes as I said one has to solve the consing problem first. A
WITH-BIGNUM-BUFFERS which enables the compiler to reuse the bignum buffers
could be a good idea (combined with dataflow analysis in the body of the
macro).

> If you have LAP available, however, they could make a real difference.

Whats LAP?

Shannon Spires

unread,

Jun 30, 2002, 4:50:16 PM6/30/02

to

In article <afk0f1$3s$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
<j...@dataheaven.de> wrote:

> Shannon Spires wrote:
> > Squaring is another avenue for optimization. It's possible to formulate
> > squaring to be faster than general multiplication, and since
> > exponentiation requires squaring, speeding up squaring speeds up
> > exponentiation.
>
> Hm... this may be interesting too - I remember reading something about
> that
> the Handbook of applied Cryptography.

Here's why. Pardon me if parts of this are too elementary, but I want
to be clear for the benefit of readers who might not have studied
multiplication algorithms.

Notice that if you think of bignum multiplication as
polynomial multiplication (which is what scalar multiplication
really is at core), you could express (* A B) as the two-digit
polynomials

A2r + A1
times
B2r + B1

where r is your base or digit size. For computation, it's usually most
convenient to make r a power of 2. (When one writes an ordinary decimal
number, r is just ten and it and the "+" signs are implicit.)

So multiplication is then of course

A2B2r^2 + A1B2r + B1A2r + A1B1 [1]

which is four sub-multiplies (A2B2, A1B2, B1A2, and A1B1) plus some
additions plus some shifts if we choose r to be a power of 2.

Thus our sub-multiplications (presumably the most expensive
sub-operations) are O[n^2], where n is the number of digits. One could
make r smaller and increase the number of digits; the multiplication is
still O[n^2]. But I chose two digits for reasons that will become clear.

Notice that each of these four multiplications could be reformulated
exactly the same way with a new r that is half the length of the old r.
Thus multiplication itself is inherently recursive. This buys you no
performance; it's just an interesting observation at this point,
especially when you contrast it with the manifestly non-recursive
algorithm we were all taught for multiplication in grade school.

Now let's examine squaring by the same mechanism.
(* A A) becomes

A2r + A1
times
A2r + A1

which is

A2A2r^2 + A1A2r + A1A2r + A1A1
or

A2^2r^2 + 2A1A2r + A1^2 [2]

aha! Only 3 multiplications now, and one fewer addition. Thus squaring
is already faster than multiplication.

But we're just getting started.

By now, your recursion detector should be lit up bright red, and you've
noticed that two of these multiplications are recursive calls to our
square routine itself. The third is a general-purpose multiply that we
cannot do anything about (just yet). So not only do we need 3/4 as many
sub-multiplies, but two-thirds of _those_ need only 3/4 as many
sub-sub-multiplies, etc.

Thus, without any Karatsuba tricks, we can formulate squaring
recursively to be faster than general multiplication. Providing we can
do so without the recursion itself adding too much overhead.

The key further insights of the Karatsuba mechanism are that

a) You can reformulate [1] above to have one fewer multiplication than
it has now, and if you implement it recursively, each sub-multiply gets
the same advantage. Now, recursively expressing general multiplication
_does_ buy you something. To wit:

A2B2(r^2+r) + (A2-A1)(B1-B2)r + A1B1(r+1) [3]

b) You can reformulate [2] above to make all _three_ sub-multiplies into
recursive square calls, which makes things even faster than either [2]
or [3] if you're squaring:

A2^2(r^2+r) - ((A2-A1)^2)r + A1^2(r+1) [4]

Karatsuba imparts overhead of its own in the form of additional
additions, shifts, and recursive calls. Reminiscent of Quicksort, it
typically pays off only on fairly large numbers and one switches
to ordinary multiplications in the leaf nodes of the recursion tree.

The potential speed boost of squaring is why I wish Common Lisp
had included #'square in the spec. Although it probably only makes
much difference to cryptographers. Perhaps even better, implementers
might consider including a compiler macro to detect it in the case of
(* foo foo).

> > If you have LAP available, however, they could make a real difference.
>
> Whats LAP?

Lisp Assembly Programming. Some CL implementations (e.g. MCL and OpenMCL
and some others) allow you to write assembler using Lisp notation.

--
Shannon Spires

Raymond Toy

unread,

Jul 1, 2002, 10:00:17 AM7/1/02

to

>>>>> "Jochen" == Jochen Schmidt <j...@dataheaven.de> writes:

>> Other optimizations are worth trying too. For example, the Karatsuba
>> formulation is a good way to optimize multiplication of large numbers in
>> general, regardless of modulo operations (Montgomery only applies to
>> modular multiplication). Karatsuba is elegantly recursive and
>> substitutes two additions for one of the four needed sub-multiplications
>> at each step. (Do a Google search on Karatsuba for more information).
>> Karatsuba can multiply in O[N^1.585] operations instead of O[N^2]
>> operations. CMUCL already uses Karatsuba internally for large bignum
>> multiplications, but other CLs may not.

No, CMUCL doesn't use Karatsuba. I wrote one, but it was never
incoporated into the sources, mostly because it only speeds up
multiplication if the numbers are 512 or 1024 bits or more. That was
a fast as I could get the Lisp implementation to go with
CMUCL-specific stuff.

Ray

Jochen Schmidt

unread,

Jul 1, 2002, 2:02:15 PM7/1/02

to

Raymond Toy wrote:

While your post was certainly informative I only wanted to note that it was
not me who wrote the paragraph you quoted - It was Shannon.

Raymond Toy

unread,

Jul 1, 2002, 2:40:29 PM7/1/02

to

>>>>> "Jochen" == Jochen Schmidt <j...@dataheaven.de> writes:

Jochen> Raymond Toy wrote:
>>>>>>> "Jochen" == Jochen Schmidt <j...@dataheaven.de> writes:
>>
>> >> Other optimizations are worth trying too. For example, the
>> >> Karatsuba formulation is a good way to optimize multiplication of
>> >> large numbers in general, regardless of modulo operations
>> >> (Montgomery only applies to modular multiplication). Karatsuba is
>> >> elegantly recursive and substitutes two additions for one of the
>> >> four needed sub-multiplications at each step. (Do a Google search
>> >> on Karatsuba for more information). Karatsuba can multiply in
>> >> O[N^1.585] operations instead of O[N^2] operations. CMUCL already
>> >> uses Karatsuba internally for large bignum multiplications, but
>> >> other CLs may not.
>>
>> No, CMUCL doesn't use Karatsuba. I wrote one, but it was never
>> incoporated into the sources, mostly because it only speeds up
>> multiplication if the numbers are 512 or 1024 bits or more. That was
>> a fast as I could get the Lisp implementation to go with
>> CMUCL-specific stuff.

Jochen> While your post was certainly informative I only wanted to note that it was
Jochen> not me who wrote the paragraph you quoted - It was Shannon.

Sorry, I obviously snipped the wrong part. Mea culpa.

Ray