Initial SUPERCOP Haswell results for round 2

384 views
Skip to first unread message

D. J. Bernstein

unread,
Aug 15, 2019, 8:21:04 PM8/15/19
to pqc-...@list.nist.gov
New tables appear here:

https://bench.cr.yp.to/results-kem.html#amd64-hiphop
https://bench.cr.yp.to/results-sign.html#amd64-hiphop

There are also several new graphs (sorry for the poor label placement in
the last graph---I'm still experimenting with layout algorithms):

https://bench.cr.yp.to/graph/amd64-hiphop-kem-pkcycles,pkbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-kem-ccycles,cbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-kem-kcycles,cbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-kem-pkbytes,cbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-sign-pkcycles,pkbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-sign-smcycles,sbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-sign-mcycles,pkbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-sign-mcycles,sbytes-nistpqc.pdf
https://bench.cr.yp.to/graph/amd64-hiphop-sign-pkbytes,sbytes-nistpqc.pdf

Implementors should follow the links for their implementations from

https://bench.cr.yp.to/web-impl/amd64-hiphop-crypto_kem.html
https://bench.cr.yp.to/web-impl/amd64-hiphop-crypto_sign.html

to look for any compiler warnings, compilation failures (e.g., current
gcc breaks crypto_kem/ntruhrss701/avx2; I'll explain this in more detail
below), test failures, and unexpected performance regressions.

All of this is from supercop-20190811 on "hiphop", an Intel Xeon E3-1220
v3 (Haswell) running the latest long-term-support version of Ubuntu
(Ubuntu 18.04; gcc 7.4.0, clang 6.0.0). I limited compiler options to

gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE
clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE

to make sure a run would finish promptly. Of course, when both options
work, the faster one is used for benchmarking. Other options could
produce even better results. For people who want to check the results on
their own Haswell machines, a complete run with these two compilers
takes 27 hours on 4 cores at 3.1GHz.

The caveats from my previous messages are applicable, except that this
SUPERCOP version includes picnic2*. This brings the round-2 list up to
14 candidates (including 10 that submitted their round-2 code to
SUPERCOP and 4 further patented candidates), so the round-1 list is down
to bike*, frodo*, gemss*, lac*, luov*, mqdss*, newhope*, nts*, qtesla*,
rainbow{1a,1b,1c,3b,3c,4a,5c,6a,6b} (the other "rainbow" primitives are
Rainbow variants that predate the competition), *saber, and sike*.

The rest of this message explains the -fPIC -fPIE above. On exactly the
same long-term-support version of Ubuntu, with the supposedly compatible
system-provided versions of gcc and clang, if you run

gcc -c x.c; clang -c y.c; gcc -o x x.o y.o

where x.c says

#include <stdio.h>
extern int thenumber(void);
int main() { printf("%d\n",thenumber()); return 0; }

and y.c says

static int myconstant = 5;
int thenumber(void) { return myconstant; }

then compilation fails, while doing

gcc -c x.c; clang -c y.c; clang -o x x.o y.o

or

gcc -c x.c; gcc -c y.c; gcc -o x x.o y.o

or

clang -c x.c; clang -c y.c; clang -o x x.o y.o

works fine. The underlying problem is compiler-writer mismanagement of
an ongoing transition to

* -fPIC: compiling libraries as "position-independent code" (this is
typically advertised as being important for shared libraries);

* -fPIE: compiling main() etc. as position-independent code (this is
typically advertised as being important for the claimed security
benefits of address-space layout randomization); and

* -pie: linking position-independent executables.

Code that's compiled as position-independent code can be linked into
position-dependent executables or position-independent executables. A
correctly managed transition would have consisted of

* turning on -fPIC and -fPIE by default,

* issuing automatic _warnings_ for any position-dependent code,

* waiting a specified number of years for people to get rid of any
previously compiled position-dependent code, and finally

* turning on -pie by default.

What happened instead was gcc suddenly turning on -pie, clumsily
breaking all existing position-dependent code and even managing to break
compatibility with clang on the same system---clang still produces
position-dependent code, and then gcc fails to produce an executable.
This is also why gcc now breaks crypto_kem/ntruhrss701/avx2.

For the moment gcc has a -no-pie option. I could use this as part of a
broader plan to continue benchmarking position-dependent code. However,
I don't like the idea of promoting an ongoing fight between ASLR and
cryptographic code (even if I'm skeptical about the claimed security
benefits of ASLR), so I'm leaving gcc with its -pie default. Meanwhile
I've turned on -fPIC and -fPIE for clang so that the -pie default for
gcc doesn't cause the same failures as in my x.c/y.c example above.

Position-dependent code such as crypto_kem/ntruhrss701/avx2 is still
benchmarked with clang for the moment, but this code should be replaced
by position-independent code as soon as possible (I already did a bunch
of PIC fixes as part of importing various round-1 implementations into
libpqcrypto), and asm programmers should stop writing new
position-dependent code for any platform that supports
position-independent code.

---Dan
signature.asc

D. J. Bernstein

unread,
Aug 19, 2019, 4:43:07 PM8/19/19
to pqc-...@list.nist.gov
The tables online from "hiphop" are now updated in two ways. First,
they're now from supercop-20190816, which in particular has much faster
round-2 code from the SABER team (so there are now 9 candidates left
without round-2 code in SUPERCOP). Second, the compiler options have
expanded to

gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE
clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -Os -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O2 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE

adding -O2, -O, -Os beyond the previous -O3. Sometimes -O3 is
suboptimal, as illustrated by

https://bench.cr.yp.to/web-impl/amd64-hiphop-crypto_kem-ledakemlt10.html

saying that clang -O is 7% faster for crypto_kem/ledakemlt10 than clang
-O3 (or gcc -O3). As always, for each primitive, SUPERCOP chooses the
fastest compiled code that it finds for that primitive across all of the
combinations of implementations and compiler options.

There are also Skylake results online from "samba". Beware, however,
that Skylake optimization is often noticeably different from Haswell
optimization, and Haswell has been a larger focus in the literature than
Skylake, so assuming that teams have obtained the best possible Skylake
results is even less safe than assuming that they've obtained the best
possible Haswell results.

---Dan
signature.asc

D. J. Bernstein

unread,
Sep 15, 2019, 3:35:56 PM9/15/19
to pqc-...@list.nist.gov
New tables are online for supercop-20190910 on Haswell (hiphop),
including substantial speedups for rollo and rqc from new code
contributed to SUPERCOP by the submission teams. As before, this is
taking the best of 4 gcc options and 4 clang options.

supercop-20190910 results are also online now for various other Intel
microarchitectures, including Ivy Bridge (hydra8), Broadwell (bolero),
Skylake (samba), Skylake with AVX-512 (pmnod003), and Cascade Lake
(pmnod076). Most candidates don't have SSE2 implementations for the
older microarchitectures and AVX-512 implementations for the newer
microarchitectures, so most of these cycle counts will clearly be
improved, and the speedups depend on the candidate; but implementors
may find the cycle counts useful in deciding what to optimize next.

Something else that implementors may find useful is new tables of code
size on various implementation-notes web pages. (As above, this has
usually not been optimized, and it shouldn't be assumed to be anywhere
near optimal.) For example, the first line of

https://bench.cr.yp.to/web-impl/amd64-hiphop-crypto_scalarmult-curve25519.html

shows "object size" of "20816 0 0" meaning that the compiled *.o files
have 20816 bytes of code (including constants), 0 bytes of read-write
data (this should be 0 for typical crypto libraries), and 0 bytes of
read-write zero-filled data (this again should be 0). The line also
shows "test size" of "40902 776 1608"; this is the size of the compiled
"try" program, including measurement mechanisms and any used lower-level
SUPERCOP libraries but not including external shared libraries such as
OpenSSL. One way to find this implementation web page is as follows:

main page -> https://bench.cr.yp.to
"List of subroutines: scalarmult" -> https://bench.cr.yp.to/primitives-scalarmult.html
"curve25519" -> https://bench.cr.yp.to/impl-scalarmult/curve25519.html
"hiphop" -> https://bench.cr.yp.to/web-impl/amd64-hiphop-crypto_scalarmult-curve25519.html

Similar comments apply to, e.g., your favorite KEM or signature system.
There are also various cross-links from the "Measurements indexed by
machine" pages.

---Dan
signature.asc

D. J. Bernstein

unread,
Oct 19, 2019, 4:47:56 PM10/19/19
to pqc-...@list.nist.gov
supercop-20191017 benchmarks are online for Haswell (hiphop), now
including round-2 Frodo, round-2 GeMSS, corrected Falcon code (slower),
and new Picnic code. This brings the overall picture to the following:

* 17 candidates have round-2 functions in SUPERCOP (and in some cases
round-1 functions under separate names). 15 of these candidates
submitted their code to SUPERCOP: Classic McEliece, Dilithium,
Falcon, Frodo, GeMSS, Kyber, LEDA, NTRU, NTRU Prime, Picnic, ROLLO,
RQC, SABER, SPHINCS+, ThreeBears. See my email dated 7 Aug 2019
13:24:12 -0000 for how I handled HQC and Round5.

* The other 9 candidates have only round-1 functions in SUPERCOP:
BIKE, LAC, LUOV, MQDSS, NewHope, NTS-KEM, qTESLA, Rainbow, SIKE.

Previously stated caveats remain applicable.

---Dan
signature.asc
Reply all
Reply to author
Forward
0 new messages