Some recent comments seem to assume that Kyber is the most efficient
lattice KEM in NISTPQC. However, if I want an ARM Cortex-M4 to decrypt
messages, specifically with Core-SVP >= 2^128, then my costs (sorted
by cycles+1000*bytes) are as follows, according to (1) pqm4 benchmarks
from https://github.com/mupq/pqm4/blob/master/benchmarks.md, (2) tables
of ciphertext sizes, and (3) tables of Core-SVP:
* 486707 cycles + receiving 897 bytes with NTRU Prime (sntrup653).
* 812608 cycles + receiving 931 bytes with NTRU (ntruhps2048677).
* 748573 cycles + receiving 1088 bytes with Kyber (kyber768-90s).
* 774055 cycles + receiving 1088 bytes with SABER (saber).
---Dan
--Derek Atkins
Amazingly, Kyber's M4-optimized software manages to take more cycles
than NTRU Prime at _every_ security level >=128! How is this possible?
Derek Atkins writes:
> Efficiency is not just number of cycles (or transmission). There are
> also size and memory usage considerations.
Sure, it's always possible to pick benchmarks that teams haven't
optimized code for yet. I fully agree that optimizing RAM is important
for some applications, and I said in the first message that there are
"other benchmarks where Kyber does better".
None of this explains Kyber's losses in Cortex-M4 cycles, which it _did_
optimize code for, and in ciphertext size. These make Kyber considerably
slower than NTRU Prime in this simple message-receiving scenario: Kyber
receives 20% more bytes and then spends 50% more cycles.
The call for submissions explicitly considers applications that "can
cache public keys, or otherwise avoid transmitting them frequently".
Key reuse is an important, commonly applied speedup, and if we're aiming
for top performance then of course we should take this into account.
The call also considers other scenarios, but benchmarks that are
presented as overall performance while failing to cover this scenario
are (1) mislabeled, (2) misleading, and (3) not following the rules.
This is "benchmarking crime B1" in http://arxiv.org/abs/1801.02381.
P.S. https://eprint.iacr.org/2021/826, to appear at USENIX Security
2022, demonstrates _much_ faster keygen for sntrup inside TLS 1.3---
faster than NIST P-256 keygen! Older sntrup keygen benchmarks are thus
obsolete, including every number in the message I'm replying to. The
same approach is being ported to more platforms. Even without that, pqm4
sntrup keygen costs only about a dozen decaps---well under a second on a
highly constrained M4 for a key that can be reused for years.
Vadim Lyubashevsky writes:
> on Coretx-M4 as on X86. The majority of that time is spent on
> expanding a public seed into a 3x3 polynomial matrix which is used in
> decapsulation.
Thank you for indirectly confirming my point about Kyber's "increasingly
severe performance problems after 1024".
> (Dan,)
>
> You know this well, but let me be explicit -- It's not just a "common speedup." Key reuse essentially forgoes forward security, which is an integral part of the security feature set of protocols such as TLS 1.3. Benchmarks with enc/dec only should perhaps be labeled "no forward security" to emphasize what type of security assurance is lost with that approach.
Session tickets, which avoid the entire handshake, have similar
issues. But a limited loss to forward secrecy of say 10 seconds or 20
seconds on a busy server that can then handle many more connections is
a tradeoff that's worth making. Having KEM keys in certificates is one
approach to optimizing TLS for the postquantum era, and there the keys
cannot change.
I'll have to say what Derek didn't; the NTRU Prime Cortex M4 implementations that you chose as examples in this thread are clearly not practical for use in embedded engineering. They have very limited value as evidence of the suitability of NTRU Prime for embedded use (if any at all).It's not just RAM, but as their large size, occupying a very large part of the Flash program memory (for a single algorithm variant).The large size is not a result of a lack of optimization, but the result of doing way too much of it. This is because significant performance improvements can be gained by completely unrolling the multipliers -- asymptotically efficient non-NTT multipliers (e.g Toom-Cook) have quite complex control logic and the Cortex M4 has only a very primitive program cache. These performance gains go away when one has to shrink the implementation size to a realistic size.
As an extreme example, one can look at the 40,996 - line source code for sntrup1277 multiplier. https://github.com/mupq/pqm4/blob/master/crypto_kem/sntrup1277/m4f/mod3_mul128xN.S The actual implementation of this variant requires 455,956 bytes of Flash. Yes, half a megabyte for a single-variant embedded KEM firmware.
Although currently the clearest violator, the NTRU Prime team has not been the only one who has been guilty of ignoring the purpose of Cortex M4 evaluation. [cut]
Dear Bo-Yin Yang,Yes, I didn't know that the NTRU Prime code was written by Master's students. For that it is very impressive!
The newer paper that you mention [2] is a nice one but does not contain data about NTRU Prime, only Saber.
As you know, Saber, unlike NTRU, benefits from its Module (LWR) structure, which allows a fixed-size multiplier to be reused (effectively in a loop) without increase in code size as dimensions grow higher. Furthermore, the fixed n=256 ring size of SABER makes it much more suitable for NTT (the technique used in [2]) than NTRU. I don't see how these results extend to NTRU Prime, at least without actual data being presented.
If there are more compact or otherwise efficient implementations of NTRU Prime available, I'd suggest submitting them to pqm4 for public evaluation.
Dear all,
To increase speed in this simple message-receiving scenario, at the
cost of more storage, the decryptor can store the expanded matrix as
part of the key (this was already specified in the Kyber
documentation). This would add another 256*9*12/8 = 3456 Bytes to
storage. Since the code size of Kyber is 10KB smaller than sntrup653
and the RAM requirement is 16KB smaller, increasing the key size by
3.5KB should still keep the storage/ram of Kyber noticeably less
regardless of where these extra bytes are being stored.
The time saving comes from not needing to hash (or run AES in Kyber
90s) to expand the key. I don't know what the current profile is for
key expansion vs other hashing (would be happy if someone chimed in),
but let's just say conservatively that half of the hashing is saved.
The running time would therefore decrease by 30% to 520K cycles. So
Kyber768-90s would be 10% slower (this number is obviously not very
precise), require 20% more ciphertext bytes, while providing ~ 50 more
bits of Core-SVP security than sntrup653.