Alperin-Sheriff, Jacob (Fed) writes:
> Just for reference, of the 65 schemes that have not officially
> withdrawn yet, 38 submissions did not contain additional
> implementations using assembly/vector instructions (i.e. that are
> likely faster).
I don't see enough information to reproduce the object code that NIST
benchmarked, but it seems quite likely that _all_ of the submissions
used vectorization and/or hand-optimized assembly and/or hand-optimized
intrinsics in this object code.
How is this possible, given NIST's requirement for the source code to be
portable ANSI C without such instructions? Common answers include
* vectorization generated by compilers and
* hand-written assembly and/or intrinsics hidden inside preexisting
libraries that were exempted from the requirement.
For example, NIST allowed submissions to use the GMP library. Checking
for gmp.h suggests that the library is used in 8 of the 38 submissions
you mentioned: Compact LWE, Mersenne-756839, Odd Manhattan, pqRSA (which
NIST counts as two submissions), Ramstake, RankSign, and RQC. GMP, in
turn, uses platform-specific optimizations, which presumably affected
the numbers in NIST's benchmarks. Whether or not the statement about
"additional implementations" is correct, it's missing the point.
Is NIST going to replace GMP with an unoptimized big-integer library? Or
is NIST going to continue giving these submissions a speed boost through
GMP's use of platform-specific optimizations, while prohibiting such
optimizations in the subroutines used in most submissions? Of course the
marketing department wants to say that NIST's measurements are fair and
balanced, but looking at real examples such as GMP shows that these
measurements were collected under incoherent ground rules.
The situation becomes even worse when one looks at vectorization added
by gcc. Spot checks suggest that _every_ submission ends up using _some_
vector instructions---but at a small and apparently random fraction of
the locations where vectorization is possible. For lucky submissions,
this fraction includes critical inner loops, producing a big speed
boost, although there's generally also much more room for improvement.
Of course it's possible to study gcc in detail and figure out structure
in what's going on---and search for modifications to C code that make
gcc produce better results without compromising some theoretical notion
of portability. I used to do this sort of thing many years ago but then
switched to code-generation tools that achieve better performance with
much less programming time: intrinsics, assembly, etc.
The big picture is that CPUs are evolving farther and farther away from
the naive machine model expressed by the textbook-level subset of C, and
compilers are having more and more trouble closing this gap. This is why
it's less and less common for real-world performance-critical code to be
written in this subset. Real-world portability is achieved not by
banning instructions, but by having enough code alternatives to cover
all target processors.
> apples-to-apples comparison
What we care about is the time that the crypto needs. What NIST has
actually measured is something that's generally _much_ larger:
time that the crypto needs
+ time lost to deficiencies in this unrealistic language subset
+ time lost to deficiencies in compilers for this language
+ time lost depending on how implementations are written (which could
be blamed on the implementor, but this presumes that the artificial
setup is something implementors should be spending time on).
As I said before, this is primarily random noise, and has very little
predictive power regarding the timings that users will end up with.
> mandatory optimized implementations
NIST's use of the word "optimized" here is contradicted by NIST's
restrictions on the language---restrictions that make a huge difference
in performance on the reference platform. In response to objections from
various people, NIST stated the following on pqc-forum:
When we evaluate performance, we’ll be interested in the performance
of the best-available implementations on a particular platform.
Reporting the performance of the best implementations available gives
teams a healthy incentive to produce software showing the performance
that the crypto is really capable of. This requires platform-specific
optimizations, and the majority of submissions already provided _some_
platform-specific optimizations in "additional implementations" or via
preexisting libraries.
Unfortunately, what NIST actually measured is something that's generally
much slower, namely performance of implementations under artificial
restrictions. This replaces the healthy incentive with an unhealthy
incentive to produce faster implementations under these restrictions.
These are generally much slower than unrestricted implementations
(unless all the critical code is in libraries exempted from the
restrictions); these aren't what users who care about performance will
use; but these still consume software-engineering resources and
verification resources.
Mike Hamburg writes:
> The unclear security estimates are mostly a problem with lattice
> systems. And as you so convincingly argued last week, it’s probably
> best if these at least go through the standardization process before
> we start rolling those out.
You're attributing to me statements that I never made. Pointing out
security risks---and pointing out false statements regarding those
risks---is very different from endorsing any particular strategy for
handling those risks. Furthermore, in the message you're replying to, I
mentioned applications that are "agile enough to deploy before
standardization", and I objected to misinformation delaying deployment
of post-quantum crypto in those applications.
For the record, I was happy to see Google rolling out a cryptosystem in
the NTRU family, at least _trying_ to protect user data against future
quantum computers. It seems entirely possible that
* quantum computation will eventually be cheap enough for wide-scale
attacks by various countries that have been recording data but
* this crypto is unbreakable even by such adversaries,
and in this scenario there's a tremendous societal benefit to what
Google started doing.
Recognizing _risks_ of breaks in the crypto, and taking sensible
risk-management steps, isn't the same as saying that there _are_ breaks.
It also isn't the same as saying that Google should have delayed. On the
contrary: the benefits were potentially massive, and Google could afford
the engineering costs. Rumors indicate that Google was then threatened
with a patent lawsuit, but that's a different issue.
---Dan