Moody, Dustin (Fed) writes:
> As stated in NIST IR 8545, the benchmark tables were intended to
> provide representative data points,
Here we go again. Representative of _what_?
The numbers NIST used aren't for the round-4 submission: they're for
much slower software (and now we know that it's software that was slowed
down by a third party). But NIST labels them as benchmarks of the
round-4 submission. This is data falsification by NIST.
> not authoritative
No, the report had no such disclaimer, nor would such a disclaimer be
relevant to the problem at hand, namely NIST fabricating data.
> or exhaustive metrics.
Irrelevant. Table 5 uses the typical speed metrics (keygen cycles, enc
cycles, dec cycles), and presents misinformation from NIST regarding the
performance of the round-4 Classic McEliece submission in those metrics.
> To that end, we cited Open Quantum Safe because it provided a
> consistently structured and well-documented dataset that served our
> goal of presenting a fair, representative comparison across
> submissions.
No, that reference did not cover the round-4 Classic McEliece
submission. NIST promised that it would evaluate the "submitted
algorithms"; that reference presented measurements of a much slower
third-party algorithm; ergo, it was wrong for NIST to use that data.
We now know exactly what created the bulk of the slowdown: namely,
liboqs removed the fast sorting subroutine in the AVX2-optimized round-4
Classic McEliece code, and replaced it with much slower sorting code.
But, even without this specific knowledge, NIST had a responsibility to
check whether it was looking at evaluations _of the submission_, rather
than evaluations of something else.
The page cited by NIST reports that it's measuring OQS version
0.9.0-rc1. NIST suppressed that information, and falsely claimed that
these were benchmarks of the round-4 submission.
Regarding fairness, reporting slowed-down numbers for just _one_ of the
submissions would have been glaringly unfair even if NIST had openly
admitted the substitution. But NIST _hid_ the substitution. That's what
turns the unfairness into outright falsification.
There are other reasons NIST obviously shouldn't have used this data set.
For example:
* NIST's information-quality standards require measurements to be
accompanied by quantitative indications of variability. This data
set flunks that requirement, as does Table 5 of NIST IR 8545.
* NIST had asked teams for "an AVX2 (Haswell) optimized
implementation". This data set covers only a few CPUs, _not_
including Haswell. NIST has been repeatedly waving at platform
differences as supposedly explaining Table 5; that isn't true (the
numbers that NIST took from this data set are for Cascade Lake,
which is _faster_ than Haswell), but _seeing_ that it isn't true
is extra work for readers. NIST should have stuck to its announced
comparison platforms instead of suddenly complicating the picture.
But the really big problem here is NIST pretending that the numbers are
for the submission, when in fact the numbers are for a much slower
third-party algorithm.
> We accurately reproduced the numbers reported there.
If a data point says "The observed temperature in D.C. at 23:59 on 8
August 2025 was 67 degrees", manipulating it to say "The observed
temperature in D.C. at 23:59 on 8 August 2025 was 73 degrees" is data
fabrication. Manipulating it to say "The observed temperature in
Anchorage at 23:59 on 8 August 2025 was 67 degrees" is also data
fabrication. One can't defend this by saying that the 67 was copied
correctly. The data point includes the number _and_ the statement of
what the number was observing.
NIST took information from a source that said it was measuring "OQS
version 0.9.0-rc1". NIST removed the original labeling of these numbers
as measurements of "OQS version 0.9.0-rc1". NIST pretended that these
were measurements of the round-4 Classic McEliece submission. This is
NIST fabricating data, and more specifically NIST falsifying data. The
misconduct here is similar to, e.g., the case reported in
https://retractionwatch.com/2025/03/21/osaka-dental-university-fabricated-data-investigation/
("Because images in the other two articles were identical to images in
the 2014 paper, the university determined the data in the later papers
were fabricated").
> We did not disregard SUPERCOP; it was among the sources we reviewed
> during the evaluation process. We are of course also aware of the
> numbers reported in the submission documents and took those into
> account.
To clarify, you're saying that NIST _saw_ that the numbers in Table 5
were much larger than the numbers from the submission and from SUPERCOP?
How exactly are you claiming that NIST took this into account?
The public evidence is of NIST IR 8545 not saying anything about this
gigantic gap. The most charitable explanation is that NIST looked at
only one source, somehow misunderstood what that source was measuring,
and recklessly took those numbers without checking any other sources
(such as the speed table provided with the submission).
It's much worse if NIST looked at multiple sources and _knew_ that it
was taking an outlier while suppressing the numbers from other sources.
A 4x gap in cycle counts is screaming "we're not measuring the same
software".
> Not every number, data point, or result that we evaluated made its way
> into the Report, as it was a summary.
"We operate transparently. We've shown all our work" (source:
https://web.archive.org/web/20211115191840/https://www.nist.gov/blogs/taking-measure/post-quantum-encryption-qa-nists-matt-scholl)
> The key generation cycle counts for Classic McEliece did not play a
> significant role in the rationale for not selecting it for
> standardization.
NIST's report fails to quantify the weights placed on individual
comparison factors. This gives NIST the freedom to respond to _any_
error by claiming that the error wasn't "significant" and that fixing
the error wouldn't have changed NIST's decisions.
Such claims lack credibility: why is NIST putting information into a
report in the first place if the information doesn't matter? These speed
tables are prominently placed on page 6 of a 34-page report. They're the
only justification that the report provides for various statements in
the text, such as the claim that Classic McEliece keygen is "three
orders of magnitude more costly than HQC".
The report also isn't limited to looking at past decisions, and isn't
limited to looking at NIST's decisions. In particular, even though the
report fails to recognize the existing deployments of Classic McEliece
documented on
https://mceliece.org, the report does admit that there's
an ongoing process of multiple parties at least _considering_ Classic
McEliece (to quote the report: "After the ISO standardization process
has been completed, NIST may consider developing a standard for Classic
McEliece based on the ISO standard"). It's irresponsible for NIST to be
feeding misinformation into future decisions.
In the end, claiming that NIST's falsified data hasn't had an impact
doesn't remove NIST's obligation to issue a correction. As
https://www.ams.org/about-us/governance/policy-statements/sec-ethics
puts it, we have a responsibility to "correct in a timely way or to
withdraw work that is erroneous".