add Poly1305?

zooko

unread,

Feb 1, 2008, 9:38:28 AM2/1/08

to Crypto++ Users

Dear maintainers of Crypto++:

Would there be any interest in adding Poly1305 MAC? I was just
looking at DJB's timings.html [1], which shows that while VMAC is
faster than Poly1305 on amd64 (e.g. [2] vs [3]), VMAC is slower than
Poly1305 on the same machine in 32-bit mode ([4] vs [5]),
VMAC is slower than Poly1305 on an UltraSPARC III ([6] vs [7]) and
VMAC is dramatically slower than Poly1305 on an old PowerPC G4 ([8]
vs [9]).

This makes me interested in having Poly1305 available within the
Crypto++ class framework.

Regards,

Zooko

Wei Dai

unread,

Feb 1, 2008, 8:21:22 PM2/1/08

to Crypto++ Users, zooko

DJB used an old implementation of VMAC in those timings, which wasn't
optimized for 32-bit platforms at all. The fastest VMAC implementation
currently available is in Crypto++ 5.5 and later. If you compare
http://www.cryptopp.com/benchmarks-p4.html and
http://groups.google.com/group/sci.crypt/msg/622a03847bc41f9b, it's faster
than Poly1305 on Pentium 4. I don't have a direct comparison on other 32-bit
platforms, but I think VMAC should be fairly competitive.

--------------------------------------------------
From: "zooko" <zo...@zooko.com>
Sent: Friday, February 01, 2008 6:38 AM
To: "Crypto++ Users" <cryptop...@googlegroups.com>
Subject: add Poly1305?

zooko

unread,

Feb 5, 2008, 6:09:56 PM2/5/08

to Crypto++ Users

On Feb 1, 2008, at 6:21 PM, Wei Dai wrote:

>
> DJB used an old implementation of VMAC in those timings, which wasn't
> optimized for 32-bit platforms at all. The fastest VMAC implementation
> currently available is in Crypto++ 5.5 and later. If you compare
> http://www.cryptopp.com/benchmarks-p4.html and
> http://groups.google.com/group/sci.crypt/msg/622a03847bc41f9b, it's
> faster
> than Poly1305 on Pentium 4. I don't have a direct comparison on
> other 32-bit
> platforms, but I think VMAC should be fairly competitive.

Here are some timings on my Pentium-M 1.86 GHz, called "zaula":

https://zooko.com/speedreport-vmac-zaula.jpg

https://zooko.com/speedreport-poly1305aes-zaula.jpg

Here are the raw timings tables from which these graphs were generated:

https://zooko.com/speedreport-vmac-zaula.txt.bz2

https://zooko.com/speedreport-poly1305aes-zaula.txt.bz2

The X axis is size of message to be MACed, and the Y axis is CPU
cycles. For more information, the legend for the graphs and the
explanation of the timings tables is on DJB's page:

http://cr.yp.to/mac/speed.html

Here is Crypto++'s own benchmark results on the same machine:

I guess that the "cycles per byte" number reported by Crypto++ is the
slope of the lines visible in the graphs. While this number -- the
slope of the line -- is of course useful for engineers designing and
implementing algorithms, it isn't the number that best predicts the
performance for a user. The performance for a user is best predicted
by some point or points on these graphs, depending on your specific
use case. (Note that the X axis on these graphs goes up to only 8192
bytes, so if your use case is to MAC larger messages than that then
the point or points relevant to you lie somewhere off of the right-
hand edge.)

One thing I don't understand is the faint gray marks substantially
above the colorful lines on the Poly1305 graphs. Looking at the
Poly1305 timings tables, those could only be the "first count"
measurements, when the code is not yet in the icache...

http://cr.yp.to/mac/speed.html#firstcount

Oh dear -- looking at the timings tables tells me what is going on
here -- the equivalent grey marks for VMAC are so high that they are
off the top of the graph -- (this implementation of) Poly1305 takes
about 15,000 CPU cycles to MAC the first 64-byte message, (this
implementation of) VMAC
takes about 60,000 CPU cycles. I'm not sure, but this may be due to
the fact that I didn't strip debug symbols from either my "vmac-
speed" executable or my "poly1305aes-speed" executable, and the "vmac-
speed" executable had a lot more debug symbols. Rerunning this
experiment without debug symbols will have to wait until I (or
someone else, hint hint) gets another round tuit.

I used DJB's own poly1305aes package, version 20050218, to implement
Poly1305, and Crypto++ v5.5.2 release to implement VMAC. The timing
code and graph generation was from the poly1305aes package.

Here is the source code of my hack of DJB's poly1305aes package to
time Crypto++'s VMAC:

https://zooko.com/poly1305aes-vmac-hack.tar.bz2

Regards,

Zooko

zooko

unread,

Feb 5, 2008, 7:44:31 PM2/5/08

to Crypto++ Users

following up to my own post, as I accidentally omitted a URL:

On Feb 5, 2008, at 4:09 PM, zooko wrote:
>
> Here is Crypto++'s own benchmark results on the same machine:

https://zooko.com/cryptopp-bench-zaula.html

The VMAC measurement reported is 3.8 cycles per byte.

This isn't really roughly consistent with the colorful graphs --
according to DJB's speed.html page, the line from lower-left to upper-
right corner of a graph is about 12 cycles per byte. According to
the VMAC timings table, it took, for example, 37915 cycles to MAC
8192 bytes of message (4.6 cycles/byte) or 22867 cycles to MAC 4096
bytes of message (5.6 cycles/byte).

Regards,

Zooko

zooko

unread,

Feb 5, 2008, 8:18:16 PM2/5/08

to zooko, Crypto++ Users

following up to my own post, as I accidentally missed something obvious:

On Feb 5, 2008, at 5:44 PM, zooko wrote:

> https://zooko.com/cryptopp-bench-zaula.html
>
> The VMAC measurement reported is 3.8 cycles per byte.
>
> This isn't really roughly consistent with the colorful graphs --
> according to DJB's speed.html page, the line from lower-left to upper-
> right corner of a graph is about 12 cycles per byte. According to
> the VMAC timings table, it took, for example, 37915 cycles to MAC
> 8192 bytes of message (4.6 cycles/byte) or 22867 cycles to MAC 4096
> bytes of message (5.6 cycles/byte).

Oh, right -- the line that passes through those two data points has a
slope of 3.7 cycles per byte.

Regards,

Zooko

zooko

unread,

Feb 19, 2008, 5:04:47 PM2/19/08

to Crypto++ Users

Dear Wei and Crypto++ folks:

For people who didn't examine the graphs closely, I should summarize
that they show that DJB's implementation of Poly1305-AES is
substantially faster than Crypto++'s implementation of VMAC-AES-128
at all measured messages sizes and situations.

Considering this, and considering Crypto++'s long history of offering
multiple alternate algorithms side-by-side, would you be interested
in including Poly1305?

Thanks,

Regards,

Zooko

Wei Dai

unread,

Feb 22, 2008, 1:48:57 AM2/22/08

to zooko, Crypto++ Users

I'd be interested if there is a way to add Poly1305 in a way that is
portable and maintainable for me without a huge amount of work, but it's not
clear how to do that. If you take a look at
http://cr.yp.to/mac/poly1305_athlon.s, it can't be compiled with MSVC, and I
can't even begin to understand it without learning a new language, namely
DJB's qhasm. (The last time I tried putting code into Crypto++ that I didn't
understand, i.e. the deflate compression code, it went pretty badly in the
long run and I ended up having to rewrite it.)

From what I can tell from your graphs, on Pentium M Crypto++'s VMAC-AES-128
has a higher per-message overhead than Poly1305-AES, but their
per-additional-byte cost is pretty close. I'm curious how many messages you
are planning to MAC per second, and why? If the per-message overhead is
really a serious problem for you, I can look into if anything can be done to
reduce it.

--------------------------------------------------
From: "zooko" <zo...@zooko.com>
Sent: Tuesday, February 19, 2008 2:04 PM
To: "Crypto++ Users" <cryptop...@googlegroups.com>
Subject: Re: add Poly1305?

Reply all

Reply to author

Forward