Benchmark planning for on-ramp?

D. J. Bernstein

unread,

Feb 21, 2025, 8:12:49 AMFeb 21

to pqc-...@list.nist.gov

Back in 2017, there were many different slowdowns in submitted code
compared to what the submitted cryptosystems were actually capable of,
even for the first targeted platform (Intel Haswell). For example,

https://web.archive.org/web/20190214071008/https://pq-crystals.org/kyber/data/kyber-specification.pdf

reported 85472, 112660, and 108904 "Haswell Cycles (AVX2)" for keygen,
enc, and dec for kyber768, whereas today

https://bench.cr.yp.to/results-kem/amd64-titan0.html

(also a Haswell) lists 44614 cycles, 60787 cycles, and 48345 cycles
respectively for kyber768. (This isn't the same kyber768 as in 2017, but
the main speed differences come from software improvements rather than
cryptosystem differences.)

The slowdowns back in 2017 were different from one submission to
another. Many submissions had much bigger slowdowns than Kyber did.
This flooded benchmarks with random noise: the comparisons and absolute
numbers weren't reliable predictors of what users would end up seeing.

Optimization quality gradually improved after that. NIST indicated
during the second round that it wanted benchmarks for all submissions. I
called for post-quantum code submissions to SUPERCOP in mid-2019, and
announced various benchmark updates after that, of course with caveats
about missing optimizations, CPU variations, etc. (Links below track the
history and underlying considerations.)

We're now a similar amount of time after the on-ramp submissions were
posted (July 2023). Does that mean that this is a good moment to collect
software for comprehensive SUPERCOP benchmarking of all submissions? Or
is it still too early?

I find it clear from skimming on-ramp code that there are many speedups
possible, especially if NIST ends up taking multiple platforms into
account. My impression is that submission teams have typically been
focusing more on security work than on speed work (and of course
security is ultimately what matters). So far NIST seems to have said
less about speed for the on-ramp than it said about speed in 2017--2019.

SUPERCOP is a continually updated long-running project that's open for
code submissions any time, but the real questions here are about plans
and expectations for the on-ramp.

---D. J. Bernstein

Links:
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/67HUhliSU1Q/m/tB3rSx8yAQAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/LVpCs_vjMlE/m/M2uQPfaEAQAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/Mb5ZKpnO57I/m/S8yaURFYCwAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/9jWcxBOdLRM/m/aHiHr0CUBwAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/F9L8sAYVCkA/m/PYCsHVB2CQAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/kiMo-1jBwDI/m/KS8VtDwWAgAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/SYNBqyJFd3E/m/R0bH0j6FAAAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/2di9c7UmBmA/m/HRBOwaRiCAAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/bKANj2y1DTc/m/nxzExONbGAAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/c-FJZWQWrHk/m/KVUK8m2CBgAJ
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/T0UMlGUnqB0/m/vFFAlaWVBgAJ

signature.asc

D. J. Bernstein

unread,

Mar 27, 2025, 2:48:35 PMMar 27

to pqc-...@list.nist.gov

I wrote, back in February:

> We're now a similar amount of time after the on-ramp submissions were
> posted (July 2023). Does that mean that this is a good moment to collect
> software for comprehensive SUPERCOP benchmarking of all submissions? Or
> is it still too early?

As an update, SUPERCOP version 20250307 includes code submitted to
SUPERCOP by the CROSS team. Results from that version, including
benchmarks, correctness tests, and TIMECOP's constant-time tests, are
already online from 20 machines for various gcc and clang options.

However, there's still the question of whether there should be a call
for _all_ of the teams to submit code for benchmarking. The situation
back in July 2019 was that NIST clearly asked for a speed competition:
"we do very much want to see Supercop (and any other 3rd party testing
platform) release performance numbers for all of the Round 2
candidates". There has been no such statement regarding the on-ramp.

Last month, I wrote that there are many speedups possible in on-ramp
code. Benchmarks collected at this point will understate the speeds that
various submissions can reach. There's no reason to think that these
errors will match across submissions. Being pressured to join a speed
competition can also distract teams from focusing on security.

On the other hand, it seems possible that speed was already used to
exclude some submissions. I've tried to figure this out from NIST IR
8528, but the report (1) doesn't have specific comments on the excluded
submissions and (2) doesn't explain how specific factors were combined
into decisions. There is in any case ample reason for on-ramp teams to
be concerned about how exactly speed will be used in NIST's next
selection round. So I'm hesitant to suggest holding off on demonstrating
what speeds have already been achieved!

As always, teams that would like to submit to SUPERCOP are welcome to do
so, and to update the code for any future improvements. On each machine,
for each primitive, SUPERCOP automatically tries each implementation for
each compiler in its list, and then reports benchmarks specifically for
the best combination. Furthermore, results from updated code in SUPERCOP
systematically replace any results from older code. (They're directly
replaced via re-benchmarking on the same machines if possible; for
non-replaced older results, there's a schedule of gradually adding
warnings and then removing old results from the web pages, in all cases
with versions clearly labeled.) People running Linux on the same CPUs
can download SUPERCOP and re-run it for verifiability.

Most benchmarking projects don't have the same protections. Even in the
cases where the numbers from those projects are reasonably accurate in
the first place, the numbers end up falling out of date. This poses a
problem for teams that expect to have future speedups and that didn't
volunteer to have their code integrated into benchmarking projects yet.

Clear assurance from NIST about limits on the role of benchmarks could
help alleviate this problem. Alternatively, teams can take some control
by sending code to SUPERCOP, so that benchmark results are available
from a site that's systematically updated to reflect code improvements.

On a related note, some serious fixes seem necessary to the procedures
that NIST uses for handling benchmark numbers. Pages 5-6 of a recent
NIST report

https://nvlpubs.nist.gov/nistpubs/ir/2025/NIST.IR.8545.pdf

claim to display benchmarks of the round-4 submissions "on x86_64 [1]".
It seems that the tables were assembled manually with copy-and-paste
errors (one of which was noted on list and led to the report being
updated, apparently without being given a new report number), but
there's also a much more fundamental problem here, as I'll now explain.

Let's focus specifically on mceliece348864f key generation, for which the
report claims a benchmark of "114 189" kcycles. For comparison, the
round-4 Classic McEliece submission reported a median of 35976620
Haswell cycles, i.e., 35977 kcycles, for the same operation, and
presented ample information for reproducing this speed:

These software measurements were collected using supercop-20220506
running on a computer named hiphop. The CPU on hiphop is an Intel
Xeon E3-1220 v3 running at 3.10GHz. This CPU does not support
hyperthreading. It does support Turbo Boost but
/sys/devices/system/cpu/intel_pstate/no_turbo was set to 1,
disabling Turbo Boost. hiphop has 32GB of RAM and runs Ubuntu 18.04.
Benchmarks used ./do-part, which ran on one core of the CPU. The
compiler list was reduced to just gcc -march=native -mtune=native
-O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE.

Source: https://classic.mceliece.org/mceliece-impl-20221023.pdf

So why does NIST's report claim 114189 kcycles? Is the unspecified
"x86_64" something much slower than a Haswell, deviating from

https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/BjLtcwXALbA/m/Bjj_77pzCAAJ

in which NIST specifically asked teams to provide "an AVX2 (Haswell)
optimized implementation"?

(Haswell, the most common optimization target in the relevant literature
at that point, was a reasonable choice of target. Earlier statements
from NIST hadn't specified the target beyond saying "the Intel x64
processor". I've seen nothing from NIST requesting public feedback on a
proposal to switch to a different Intel target.)

Reference "[1]" in the report is "Open quantum safe (OQS) algorithm
performance visualizations. Available at https:
//openquantumsafe.org/benchmarking." Looking around that web site shows
that these are numbers for "Intel(R) Xeon(R) Platinum 8259CL CPU", which
is Cascade Lake (basically Skylake), a _newer_ Intel microarchitecture
having generally _smaller_ cycle counts than Haswell.

Concretely, SUPERCOP shows 31314 kcycles for mceliece348864f keygen on
Skylake. So, yes, NIST did deviate from its announced comparison
platform, but this makes the 114189 kcycles even harder to explain.

More importantly, https://openquantumsafe.org/benchmarking has a
disclaimer right at the top:

These pages visualize measurements taken by the now-defunct OQS
profiling project. This project is not currently maintained, and
these measurements are not up to date.

The words "now-defunct" are even in boldface. The archived page

https://web.archive.org/web/20240425050637/https://openquantumsafe.org/benchmarking/

shows that this warning was on the page in April 2024.

Tung Chou sped up the Classic McEliece software in many ways after the
original submission in 2017. In particular, he already published much
faster keygen code in 2019 (as also reported in the round-3 submission

https://classic.mceliece.org/nist/mceliece-20201010.pdf#page.33

in 2020). It's astonishing to see NIST issuing a report in 2025 with
benchmarks of code that's six years out of date, and presenting those as
benchmarks of the round-4 submission, especially when the source that
NIST cites is a page that says at the top that it's presenting obsolete
measurements from a defunct benchmarking project.

If NIST had been carrying out its selection discussions on a public
mailing list, then this 4x error could have been corrected as soon as it
appeared. But FOIA results show that NIST's selection discussions are
mostly carried out in secret. There's no evident way that on-ramp teams
will be able to correct similar benchmarking errors.

---D. J. Bernstein

signature.asc

Moody, Dustin (Fed)

unread,

Apr 4, 2025, 1:12:26 PMApr 4

to D. J. Bernstein, pqc-forum

We would certainly encourage submission teams for the 2nd round of the onramp to submit implementations to SUPERCOP and other testing platforms. That way we all have a better picture of updated performance numbers.

Dustin Moody

NIST PQC

From: pqc-...@list.nist.gov on behalf of D. J. Bernstein
Sent: Thursday, March 27, 2025 2:48 PM
To: pqc-forum
Subject: Re: [pqc-forum] Benchmark planning for on-ramp?

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20250327184813.1621692.qmail%40cr.yp.to.

D. J. Bernstein

unread,

Apr 4, 2025, 2:48:17 PMApr 4

to pqc-...@list.nist.gov

'Moody, Dustin (Fed)' via pqc-forum writes:
> We would certainly encourage submission teams for the 2nd round of the
> onramp to submit implementations to SUPERCOP and other testing
> platforms. That way we all have a better picture of updated
> performance numbers.

OK. In the case of SUPERCOP, teams are encouraged to look at

https://bench.cr.yp.to/call-sign.html

for the minimum submission requirements, and at

https://bench.cr.yp.to/tips.html

for useful options: how to make sure code passes SUPERCOP's tests on
your machine, how to call fast code that's already in SUPERCOP for
subroutines such as hashing, etc. I sometimes fix up submitted code,
but there are no guarantees of that, and in any case it's better if
teams make things work themselves and check that the results are what
they expect. Implementors aiming for constant-time code should also look
at the new file cryptoint/README in the SUPERCOP distribution.

---D. J. Bernstein

signature.asc

Reply all

Reply to author

Forward