I wrote, back in February:
> We're now a similar amount of time after the on-ramp submissions were
> posted (July 2023). Does that mean that this is a good moment to collect
> software for comprehensive SUPERCOP benchmarking of all submissions? Or
> is it still too early?
As an update, SUPERCOP version 20250307 includes code submitted to
SUPERCOP by the CROSS team. Results from that version, including
benchmarks, correctness tests, and TIMECOP's constant-time tests, are
already online from 20 machines for various gcc and clang options.
However, there's still the question of whether there should be a call
for _all_ of the teams to submit code for benchmarking. The situation
back in July 2019 was that NIST clearly asked for a speed competition:
"we do very much want to see Supercop (and any other 3rd party testing
platform) release performance numbers for all of the Round 2
candidates". There has been no such statement regarding the on-ramp.
Last month, I wrote that there are many speedups possible in on-ramp
code. Benchmarks collected at this point will understate the speeds that
various submissions can reach. There's no reason to think that these
errors will match across submissions. Being pressured to join a speed
competition can also distract teams from focusing on security.
On the other hand, it seems possible that speed was already used to
exclude some submissions. I've tried to figure this out from NIST IR
8528, but the report (1) doesn't have specific comments on the excluded
submissions and (2) doesn't explain how specific factors were combined
into decisions. There is in any case ample reason for on-ramp teams to
be concerned about how exactly speed will be used in NIST's next
selection round. So I'm hesitant to suggest holding off on demonstrating
what speeds have already been achieved!
As always, teams that would like to submit to SUPERCOP are welcome to do
so, and to update the code for any future improvements. On each machine,
for each primitive, SUPERCOP automatically tries each implementation for
each compiler in its list, and then reports benchmarks specifically for
the best combination. Furthermore, results from updated code in SUPERCOP
systematically replace any results from older code. (They're directly
replaced via re-benchmarking on the same machines if possible; for
non-replaced older results, there's a schedule of gradually adding
warnings and then removing old results from the web pages, in all cases
with versions clearly labeled.) People running Linux on the same CPUs
can download SUPERCOP and re-run it for verifiability.
Most benchmarking projects don't have the same protections. Even in the
cases where the numbers from those projects are reasonably accurate in
the first place, the numbers end up falling out of date. This poses a
problem for teams that expect to have future speedups and that didn't
volunteer to have their code integrated into benchmarking projects yet.
Clear assurance from NIST about limits on the role of benchmarks could
help alleviate this problem. Alternatively, teams can take some control
by sending code to SUPERCOP, so that benchmark results are available
from a site that's systematically updated to reflect code improvements.
On a related note, some serious fixes seem necessary to the procedures
that NIST uses for handling benchmark numbers. Pages 5-6 of a recent
NIST report
https://nvlpubs.nist.gov/nistpubs/ir/2025/NIST.IR.8545.pdf
claim to display benchmarks of the round-4 submissions "on x86_64 [1]".
It seems that the tables were assembled manually with copy-and-paste
errors (one of which was noted on list and led to the report being
updated, apparently without being given a new report number), but
there's also a much more fundamental problem here, as I'll now explain.
Let's focus specifically on mceliece348864f key generation, for which the
report claims a benchmark of "114 189" kcycles. For comparison, the
round-4 Classic McEliece submission reported a median of 35976620
Haswell cycles, i.e., 35977 kcycles, for the same operation, and
presented ample information for reproducing this speed:
These software measurements were collected using supercop-20220506
running on a computer named hiphop. The CPU on hiphop is an Intel
Xeon E3-1220 v3 running at 3.10GHz. This CPU does not support
hyperthreading. It does support Turbo Boost but
/sys/devices/system/cpu/intel_pstate/no_turbo was set to 1,
disabling Turbo Boost. hiphop has 32GB of RAM and runs Ubuntu 18.04.
Benchmarks used ./do-part, which ran on one core of the CPU. The
compiler list was reduced to just gcc -march=native -mtune=native
-O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE.
Source:
https://classic.mceliece.org/mceliece-impl-20221023.pdf
So why does NIST's report claim 114189 kcycles? Is the unspecified
"x86_64" something much slower than a Haswell, deviating from
https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/BjLtcwXALbA/m/Bjj_77pzCAAJ
in which NIST specifically asked teams to provide "an AVX2 (Haswell)
optimized implementation"?
(Haswell, the most common optimization target in the relevant literature
at that point, was a reasonable choice of target. Earlier statements
from NIST hadn't specified the target beyond saying "the Intel x64
processor". I've seen nothing from NIST requesting public feedback on a
proposal to switch to a different Intel target.)
Reference "[1]" in the report is "Open quantum safe (OQS) algorithm
performance visualizations. Available at https:
//
openquantumsafe.org/benchmarking." Looking around that web site shows
that these are numbers for "Intel(R) Xeon(R) Platinum 8259CL CPU", which
is Cascade Lake (basically Skylake), a _newer_ Intel microarchitecture
having generally _smaller_ cycle counts than Haswell.
Concretely, SUPERCOP shows 31314 kcycles for mceliece348864f keygen on
Skylake. So, yes, NIST did deviate from its announced comparison
platform, but this makes the 114189 kcycles even harder to explain.
More importantly,
https://openquantumsafe.org/benchmarking has a
disclaimer right at the top:
These pages visualize measurements taken by the now-defunct OQS
profiling project. This project is not currently maintained, and
these measurements are not up to date.
The words "now-defunct" are even in boldface. The archived page
https://web.archive.org/web/20240425050637/https://openquantumsafe.org/benchmarking/
shows that this warning was on the page in April 2024.
Tung Chou sped up the Classic McEliece software in many ways after the
original submission in 2017. In particular, he already published much
faster keygen code in 2019 (as also reported in the round-3 submission
https://classic.mceliece.org/nist/mceliece-20201010.pdf#page.33
in 2020). It's astonishing to see NIST issuing a report in 2025 with
benchmarks of code that's six years out of date, and presenting those as
benchmarks of the round-4 submission, especially when the source that
NIST cites is a page that says at the top that it's presenting obsolete
measurements from a defunct benchmarking project.
If NIST had been carrying out its selection discussions on a public
mailing list, then this 4x error could have been corrected as soon as it
appeared. But FOIA results show that NIST's selection discussions are
mostly carried out in secret. There's no evident way that on-ramp teams
will be able to correct similar benchmarking errors.
---D. J. Bernstein