Two years ago an NTRU Prime update talk announced the new "factored"
NTRU Prime software, a wrapper around "modules with separate tests and
optimizations". I gave a talk today announcing, among other things,
computer verification for most of these modules that the existing "avx"
implementation produces the same output as the existing "ref"
implementation for _all_ possible inputs:
https://cr.yp.to/talks.html#2021.09.03
This is a big step towards full verification of the optimized NTRU Prime
software. Next steps are matching "avx" to "ref" for the other modules
(notably multiplication, where another tool has gotten through some
important parts of the code but not yet everything) and matching the C
code to the Sage reference code.
The "saferewrite" tool used for this verification has a broad range of
applicability beyond NTRU Prime. The first example in the talk is how
saferewrite catches both of the array-comparison problems that were
announced in the official Frodo software. However, to enable this
analysis, I had to define a Frodo array-comparison module and write
reference code for that module, so that saferewrite could compare the
official code to my reference code. This wasn't a big deal since this
particular module is so simple, but an analogous analysis for larger
components of Frodo (short of taking the entire KEM as a monolith!)
would require additional work to write reference code for those modules.
For NTRU Prime, this work is already done.
For each module, saferewrite compiles each implementation with clang -O1
and with gcc -O3, uses the angr symbolic-execution toolkit to convert
each binary into unrolled code in a much simpler language, and uses the
Z3 theorem prover (via angr's claripy) to verify equivalence or find a
counterexample. The automatic equivalence chains look like this
(although this pattern isn't optimal in general):
opt clang -O1 = ref clang -O1 = avx clang -O1
||
opt gcc -O3 = ref gcc -O3 = avx gcc -O3
There could be compiler bugs affecting outputs, but to evade detection
these bugs would have to have the same effect on every node in the
diagram simultaneously. (It would also be possible to hook a direct
Python-to-the-simpler-language conversion into the picture.) There could
be unrolling bugs, but saferewrite also runs the binaries on some random
inputs (plus all-0 and all-1 to make sure every bit is touched) and
checks that these match the unrolled code; also, angr has been heavily
exercised in a variety of reverse-engineering applications. There could
be bugs in saferewrite itself, but reviewing saferewrite is a much
smaller task than reviewing a ton of optimized post-quantum code.
The examples supplied in the saferewrite package include deliberately
buggy code to exercise saferewrite's tests, in particular producing 16
analyses printing "differentfrom" counterexamples (which I've checked by
hand), providing some evidence that saferewrite is working as desired.
Some of these bugs are also found by random tests, but some aren't. More
advanced fuzzing can do better than random tests but has no hope of
finding typical cryptographic overflow bugs.
Seeing C code working with two compilers doesn't mean that the same code
will work with further compilers, but if analyses are fast enough then
it's realistic to re-apply the analyses whenever the compiler changes. I
tried the saferewrite analysis of 107 implementations of 27 functions,
times 2 compilers, on a dual EPYC 7742; it finished in 8 minutes of
wall-clock time, using 20 cores on average, using under 200GB of RAM.
I'm filing this as an OFFICIAL COMMENT regarding NTRU Prime because it's
directly relevant to the official NISTPQC evaluation criteria, notably
the following:
The algorithms can be implemented securely and efficiently on a wide
variety of platforms, including constrained environments, such as smart
cards.
Various NISTPQC submissions have provided fast AVX2 software, fast M4
software, etc., but the primary evidence for "securely" is that the
software is constant-time. (I'll skip discussion of the broken masked
implementations.) The problem is that the same optimizations add massive
complexity to the software, and this complexity is a security threat:
*
https://arxiv.org/abs/2107.04940 studied the vulnerabilities
announced between 2010 and 2020 in eight well-known cryptographic
libraries, and found 73 vulnerabilities in the cryptographic
computations, including 11 known to be exploitable ("severe"),
along with "evidence of a strong correlation between the complexity
of these libraries and their (in)security". (There were also
hundreds of further bugs, such as buffer overflows.)
* Post-quantum software is newer, more complicated, and much harder
to thoroughly review. Superficial reviews of post-quantum software
have caught one devastating bug after another; the only reasonable
prediction is that more serious reviews will find many more bugs.
Is it reasonable to say that an algorithm "can be implemented securely
and efficiently" if fast implementations are so complex that the experts
are getting them wrong? If the answer is pointing to an implementation
and saying "No bugs are known in this implementation", then why should
we think that the code is correct, rather than thinking that security
reviewers are overloaded and that this answer is pure selection bias?
There's stronger evidence for "securely and efficiently" when optimized
modules are verified to match much simpler reference implementations.
Covering more modules will further strengthen this evidence.
Another relevant NISTPQC evaluation criterion is the following:
Factors that might hinder or promote widespread adoption of an
algorithm or implementation will be considered in the evaluation
process, including, but not limited to, ...
The availability of modularized implementations, and the availability of
verification tools applicable to some of those modules, certainly help
promote widespread adoption of those implementations and the algorithm.
---Dan (speaking for myself)