Kyber's inefficiency: some data points

1,085 views
Skip to first unread message

D. J. Bernstein

unread,
Oct 20, 2021, 1:59:03 PM10/20/21
to pqc-...@list.nist.gov
Some recent comments seem to assume that Kyber is the most efficient
lattice KEM in NISTPQC. However, if I want an ARM Cortex-M4 to decrypt
messages, specifically with Core-SVP >= 2^128, then my costs (sorted
by cycles+1000*bytes) are as follows, according to (1) pqm4 benchmarks
from https://github.com/mupq/pqm4/blob/master/benchmarks.md, (2) tables
of ciphertext sizes, and (3) tables of Core-SVP:

* 486707 cycles + receiving 897 bytes with NTRU Prime (sntrup653).

* 812608 cycles + receiving 931 bytes with NTRU (ntruhps2048677).

* 748573 cycles + receiving 1088 bytes with Kyber (kyber768-90s).

* 774055 cycles + receiving 1088 bytes with SABER (saber).

NTRU can improve a bit by specifying a smaller dimension, while Kyber
and SABER can't vary their dimensions---it's not just "more effort" but
structurally impossible. Of course cycle counts could improve, but the
current situation is a solid loss for Kyber and SABER.

(Should this be accompanied by histrionics regarding how it would be a
disaster for NIST to be selecting inferior, poorly designed candidates
when there's clearly a much better scheme available?)

It has become reasonably clear that Kyber will do even worse in ASIC
implementations. There are other benchmarks where Kyber does better, but
cherry-picking pro-Kyber benchmarks and pretending that Kyber always
wins is highly inappropriate. The actual situation is that the
performance winner depends on the application.

---Dan
signature.asc

Derek Atkins

unread,
Oct 20, 2021, 2:53:05 PM10/20/21
to d...@cr.yp.to, pqc-...@list.nist.gov
Dan,

On Wed, 2021-10-20 at 19:58 +0200, D. J. Bernstein wrote:
Some recent comments seem to assume that Kyber is the most efficient
lattice KEM in NISTPQC. However, if I want an ARM Cortex-M4 to decrypt

Efficiency is not just number of cycles (or transmission).  There are also size and memory usage considerations.

messages, specifically with Core-SVP >= 2^128, then my costs (sorted
by cycles+1000*bytes) are as follows, according to (1) pqm4 benchmarks
of ciphertext sizes, and (3) tables of Core-SVP:

   * 486707 cycles + receiving 897 bytes with NTRU Prime (sntrup653).

This requires 16,088 bytes of RAM and a total executable size of 237,580.


   * 812608 cycles + receiving 931 bytes with NTRU (ntruhps2048677).

This requires 19,728 bytes of RAM and a total executable size of 149,988.


   * 748573 cycles + receiving 1088 bytes with Kyber (kyber768-90s).

This requires 3,520 bytes of RAM and a total executable size of 10,796.


   * 774055 cycles + receiving 1088 bytes with SABER (saber).

This requires 7,324 bytes of RAM and a total executable size of 18,708.

This data is also available from that same benchmark location.

My quick analysis:  the two NTRU systems require 8-20x the amount of code space and 2-5 times the amount of memory vs Kyber and SABER.

---Dan


-derek

-- 
Derek Atkins
Chief Technology Officer
Veridify Security - Securing the Internet of Things®

Office: 203.227.3151  x1343
Direct: 617.623.3745
Mobile: 617.290.5355
Email: DAt...@Veridify.com

This email message may contain confidential, proprietary and / or legally privileged information and intended only for the use of the intended recipient(s) and others specifically authorized. Any disclosure, dissemination, copying, distribution or use of the information contained in this email message, including any attachments, to or by anyone other than the intended recipient is strictly prohibited.  If you received this in error, please immediately advise the sender by reply email or at the telephone number above, and then delete, shred, or otherwise dispose of this message.

D. J. Bernstein

unread,
Oct 20, 2021, 4:52:13 PM10/20/21
to pqc-...@list.nist.gov
Derek Atkins writes:
> Efficiency is not just number of cycles (or transmission). There are
> also size and memory usage considerations.

Sure, it's always possible to pick benchmarks that teams haven't
optimized code for yet. I fully agree that optimizing RAM is important
for some applications, and I said in the first message that there are
"other benchmarks where Kyber does better".

None of this explains Kyber's losses in Cortex-M4 cycles, which it _did_
optimize code for, and in ciphertext size. These make Kyber considerably
slower than NTRU Prime in this simple message-receiving scenario: Kyber
receives 20% more bytes and then spends 50% more cycles.

---Dan
signature.asc

Vadim Lyubashevsky

unread,
Oct 22, 2021, 8:14:31 AM10/22/21
to pqc-forum
Dear all,

On Wed, Oct 20, 2021 at 10:52 PM D. J. Bernstein <d...@cr.yp.to> wrote:
>
> None of this explains Kyber's losses in Cortex-M4 cycles, which it _did_
> optimize code for, and in ciphertext size. These make Kyber considerably
> slower than NTRU Prime in this simple message-receiving scenario: Kyber
> receives 20% more bytes and then spends 50% more cycles.

I just want to add a small observation to the message-receiving application.

The hashing evaluation table at
https://github.com/mupq/pqm4/blob/master/benchmarks.md states that 60%
of the time of Kyber768-90s is spent on hashing, which is not as fast
on Coretx-M4 as on X86. The majority of that time is spent on
expanding a public seed into a 3x3 polynomial matrix which is used in
decapsulation.

To increase speed in this simple message-receiving scenario, at the
cost of more storage, the decryptor can store the expanded matrix as
part of the key (this was already specified in the Kyber
documentation). This would add another 256*9*12/8 = 3456 Bytes to
storage. Since the code size of Kyber is 10KB smaller than sntrup653
and the RAM requirement is 16KB smaller, increasing the key size by
3.5KB should still keep the storage/ram of Kyber noticeably less
regardless of where these extra bytes are being stored.

The time saving comes from not needing to hash (or run AES in Kyber
90s) to expand the key. I don't know what the current profile is for
key expansion vs other hashing (would be happy if someone chimed in),
but let's just say conservatively that half of the hashing is saved.
The running time would therefore decrease by 30% to 520K cycles. So
Kyber768-90s would be 10% slower (this number is obviously not very
precise), require 20% more ciphertext bytes, while providing ~ 50 more
bits of Core-SVP security than sntrup653.

Best,
Vadim


>
> ---Dan
>
> --
> You received this message because you are subscribed to the Google Groups "pqc-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
> To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20211020205137.948006.qmail%40cr.yp.to.

D. J. Bernstein

unread,
Oct 23, 2021, 6:10:41 AM10/23/21
to pqc-forum
Vadim Lyubashevsky writes:
> ~ 50 more bits of Core-SVP security

Sure, but if we're considering more Core-SVP levels then NTRU Prime
offers a spectrum of further options (with many more possible)---

* 486707 cycles + receiving 897 bytes for Core-SVP 2^129 (sntrup653),
* 538141 cycles + receiving 1039 bytes for Core-SVP 2^153 (sntrup761),
* 689920 cycles + receiving 1184 bytes for Core-SVP 2^175 (sntrup857),
* 744434 cycles + receiving 1349 bytes for Core-SVP 2^196 (sntrup953),
* 838171 cycles + receiving 1455 bytes for Core-SVP 2^209 (sntrup1013)
* 1071964 cycles + receiving 1847 bytes for Core-SVP 2^270 (sntrup1277)

---whereas Kyber offers only the following jumps (plus the slower
non-90s versions, and in any case no intermediate dimensions possible):

* 456990 cycles + receiving 768 bytes for Core-SVP 2^118 (kyber90s512),
* 748573 cycles + receiving 1088 bytes for Core-SVP 2^181 (kyber90s768),
* 1146844 cycles + receiving 1568 bytes for Core-SVP 2^256 (kyber90s1024).

Amazingly, Kyber's M4-optimized software manages to take more cycles
than NTRU Prime at _every_ security level >=128! How is this possible?
Wasn't Kyber supposed to be a super-fast KEM, while NTRU and NTRU Prime
ignore decades of research and are 15x behind the state of the art? Or
could it _possibly_ be that recent advertisements have gone too far?

Of course one also has to look at the cost of communication. The size
comparison can point either way---sometimes in favor of Kyber, sometimes
in favor of NTRU Prime, depending on the user's target security level.

> 60% of the time of Kyber768-90s is spent on hashing, which is not as fast
> on Coretx-M4 as on X86. The majority of that time is spent on
> expanding a public seed into a 3x3 polynomial matrix which is used in
> decapsulation.

Thank you for indirectly confirming my point about Kyber's "increasingly
severe performance problems after 1024".

I agree that kilobytes of intermediate Kyber hash output could be cached
to save _some_ cycles, but for the cycle counts I'd want to see actual
measurements rather than guesses. For NTRU Prime, the inefficiency of
this matrix doesn't exist in the first place.

---Dan
signature.asc

Markku-Juhani O. Saarinen

unread,
Oct 23, 2021, 8:00:57 AM10/23/21
to pqc-forum

On Sat, Oct 23, 2021, 11:10 D. J. Bernstein <d...@cr.yp.to> wrote:
Amazingly, Kyber's M4-optimized software manages to take more cycles
than NTRU Prime at _every_ security level >=128! How is this possible?

Dan,

Yeah, that doesn't seem possible. This certainly wasn't the case in my evaluation of the candidates. It seems that you are forgetting the cost of key generation.

Applications like TLS would typically be using KEMs for ephemeral key establisment, which requires key generation. Because of this, the total NTRU Prime energy and latency needs are at least an order of magnitude higher (>20x  or 50x typically) on most platforms than those of, say, SABER and Kyber.

While protocols may be redesigned around this issue, or the key generation can be assumed to be always done by an all-powerful "server",  most people do include it in the cost of TLS, IPSec etc key establishment. Key caching may alleviate the latency, but not energy costs.

Cheers,
- Markku

D. J. Bernstein

unread,
Oct 23, 2021, 11:29:08 AM10/23/21
to pqc-...@list.nist.gov
Markku-Juhani O. Saarinen writes:
> It seems that you are forgetting the cost of key generation.

No, keygen has essentially zero impact on performance in the stated
scenario ("I want an ARM Cortex-M4 to decrypt messages"). This is true
even if the key is generated on-device.

The call for submissions explicitly considers applications that "can
cache public keys, or otherwise avoid transmitting them frequently".
Key reuse is an important, commonly applied speedup, and if we're aiming
for top performance then of course we should take this into account.

The call also considers other scenarios, but benchmarks that are
presented as overall performance while failing to cover this scenario
are (1) mislabeled, (2) misleading, and (3) not following the rules.
This is "benchmarking crime B1" in http://arxiv.org/abs/1801.02381.

---Dan

P.S. https://eprint.iacr.org/2021/826, to appear at USENIX Security
2022, demonstrates _much_ faster keygen for sntrup inside TLS 1.3---
faster than NIST P-256 keygen! Older sntrup keygen benchmarks are thus
obsolete, including every number in the message I'm replying to. The
same approach is being ported to more platforms. Even without that, pqm4
sntrup keygen costs only about a dozen decaps---well under a second on a
highly constrained M4 for a key that can be reused for years.
signature.asc

Markku-Juhani O. Saarinen

unread,
Oct 23, 2021, 11:33:06 AM10/23/21
to pqc-forum
On Wed, Oct 20, 2021 at 9:52 PM D. J. Bernstein <d...@cr.yp.to> wrote:
Derek Atkins writes:
> Efficiency is not just number of cycles (or transmission).  There are
> also size and memory usage considerations.

Sure, it's always possible to pick benchmarks that teams haven't
optimized code for yet. I fully agree that optimizing RAM is important
for some applications, and I said in the first message that there are
"other benchmarks where Kyber does better".

Hi Again,

I'll have to say what Derek didn't; the NTRU Prime Cortex M4 implementations that you chose as examples in this thread are clearly not practical for use in embedded engineering. They have very limited value as evidence of the suitability of NTRU Prime for embedded use (if any at all).

It's not just RAM, but as their large size, occupying a very large part of the Flash program memory (for a single algorithm variant).

The large size is not a result of a lack of optimization, but the result of doing way too much of it. This is because significant performance improvements can be gained by completely unrolling the multipliers -- asymptotically efficient non-NTT multipliers (e.g Toom-Cook) have quite complex control logic and the Cortex M4 has only a very primitive program cache. These performance gains go away when one has to shrink the implementation size to a realistic size.

As an extreme example, one can look at the 40,996 - line source code for sntrup1277 multiplier. https://github.com/mupq/pqm4/blob/master/crypto_kem/sntrup1277/m4f/mod3_mul128xN.S  The actual implementation of this variant requires 455,956 bytes of Flash. Yes, half a megabyte for a single-variant embedded KEM firmware.

Although currently the clearest violator, the NTRU Prime team has not been the only one who has been guilty of ignoring the purpose of Cortex M4 evaluation. If an embedded engineer finds that Cortex M4 is not suitable for the task at hand, she can generally switch to a different microcontroller more suitable for the task. This is much cheaper, easier to support, and performant than going all the way and redesigning protocols to support the limitations an arbitrary $5 microcontroller. In embedded, both firmware and hardware are generally both designed to meet some specific goals.

The word "embedded" means that the MCU is embedded into something; perhaps to control a simple electronic appliance or to take care of a special function in a larger SoC (Cortex M4 is an IP block). This means that there is typically some primary function for that MCU and cryptography is there merely to accomplish a side task (perhaps related to communication, authentication, or software integrity). So a realistic commercial embedded implementation can't just take all the Flash for a single asymmetric algorithm.

As a side note; we've even seen hardware implementations where the encapsulate and decapsulate functions are on two separate FPGAs, basically occupying them in their entirety! This wasn't from a submitting team, so I don't think that this was an attempt to game the system, merely a result of a complete disconnect with the reality of security engineering. In reality, when writing commercial FPGA implementations, we find that the FPGAs are already partially occupied by other functions. We also have to interface things over (AMBA) buses and figure out key transport and storage mechanisms.

None of this explains Kyber's losses in Cortex-M4 cycles, which it _did_
optimize code for, and in ciphertext size. These make Kyber considerably
slower than NTRU Prime in this simple message-receiving scenario: Kyber
receives 20% more bytes and then spends 50% more cycles.

Note that you chose a message-receiving scenario that is a big assumption. Many (if not most) IoT devices would need to support a forward-secure ephemeral key exchange scenario, where NTRU Prime is tens of times slower.

Cheers,
- Markku

 Dr. Markku-Juhani O. Saarinen <mj...@iki.fi>

Markku-Juhani O. Saarinen

unread,
Oct 23, 2021, 1:53:36 PM10/23/21
to pqc-forum
On Sat, Oct 23, 2021 at 4:29 PM D. J. Bernstein <d...@cr.yp.to> wrote:

The call for submissions explicitly considers applications that "can
cache public keys, or otherwise avoid transmitting them frequently".
Key reuse is an important, commonly applied speedup, and if we're aiming
for top performance then of course we should take this into account.

Dan,

You know this well, but let me be explicit -- It's not just a "common speedup." Key reuse essentially forgoes forward security, which is an integral part of the security feature set of protocols such as TLS 1.3. Benchmarks with enc/dec only should perhaps be labeled "no forward security" to emphasize what type of security assurance is lost with that approach.

The call also considers other scenarios, but benchmarks that are
presented as overall performance while failing to cover this scenario
are (1) mislabeled, (2) misleading, and (3) not following the rules.
This is "benchmarking crime B1" in http://arxiv.org/abs/1801.02381.
 
Yes, I think that one should consider the "no forward security" scenario among others; KEMs doing PKE would be like that. By the way, you just presented data in this thread without mentioning the forward-secure scenario .. so I guess that was a quite clear "B1 benchmarking crime" then?

Quickly looking, there might be additional counts:
- Not presenting the cycle counts of NTRU Prime key generation might be "A3: Selective data set hiding deficiencies."
- Ignoring the binary size in a Cortex M4 benchmark .. "E2: Only measure runtime overhead."

P.S. https://eprint.iacr.org/2021/826, to appear at USENIX Security
2022, demonstrates _much_ faster keygen for sntrup inside TLS 1.3---
faster than NIST P-256 keygen! Older sntrup keygen benchmarks are thus
obsolete, including every number in the message I'm replying to. The
same approach is being ported to more platforms. Even without that, pqm4
sntrup keygen costs only about a dozen decaps---well under a second on a
highly constrained M4 for a key that can be reused for years.
 
That's a nice paper, but I can see what you did there :) 

This is close to the crime "D2: Only evaluate against yourself" since you're now comparing a "new" NTRU Prime to "old" NTRU Prime, not Kyber or Saber (which have a faster keygen than most P-256 implementations). I'm not sure if I agree with the applicability of the batched technique in the embedded use scenarios. Or which NTRU Prime performance numbers have you now declared obsolete? All of them?

Even from your handwavy estimate, it sounds like the ephemeral NTRU Prime key establishment is still going to be about 10 times more expensive than Kyber or SABER ("dozen decaps") in the regular ephemeral key exchange in terms of cycles/energy. I wonder why you didn't just say that.

I'm not sure what clock frequency you're using but 1 second is really bad UX, while something well under 100ms is actually tolerable. If I was a system engineer, I would still not "buy" NTRU Prime for my embedded system, given these options.

Cheers,
- markku

Watson Ladd

unread,
Oct 23, 2021, 2:37:01 PM10/23/21
to Markku-Juhani O. Saarinen, pqc-forum
On Sat, Oct 23, 2021 at 10:53 AM Markku-Juhani O. Saarinen
<mjos....@gmail.com> wrote:
>
> On Sat, Oct 23, 2021 at 4:29 PM D. J. Bernstein <d...@cr.yp.to> wrote:
>
>> The call for submissions explicitly considers applications that "can
>> cache public keys, or otherwise avoid transmitting them frequently".
>> Key reuse is an important, commonly applied speedup, and if we're aiming
>> for top performance then of course we should take this into account.
>
>
> Dan,
>
> You know this well, but let me be explicit -- It's not just a "common speedup." Key reuse essentially forgoes forward security, which is an integral part of the security feature set of protocols such as TLS 1.3. Benchmarks with enc/dec only should perhaps be labeled "no forward security" to emphasize what type of security assurance is lost with that approach.

Session tickets, which avoid the entire handshake, have similar
issues. But a limited loss to forward secrecy of say 10 seconds or 20
seconds on a busy server that can then handle many more connections is
a tradeoff that's worth making. Having KEM keys in certificates is one
approach to optimizing TLS for the postquantum era, and there the keys
cannot change.

Sincerely,
Watson




--
Astra mortemque praestare gradatim

Vadim Lyubashevsky

unread,
Oct 23, 2021, 3:24:57 PM10/23/21
to pqc-forum


On Sat, Oct 23, 2021, 12:10 D. J. Bernstein <d...@cr.yp.to> wrote:
Vadim Lyubashevsky writes:

> on Coretx-M4 as on X86.  The majority of that time is spent on
> expanding a public seed into a 3x3 polynomial matrix which is used in
> decapsulation.

Thank you for indirectly confirming my point about Kyber's "increasingly
severe performance problems after 1024".

I have no idea what you mean, and I believe that the opposite is actually true. As the dimension increases, the vast majority of the hashing is going to be due to the public key expansion. You're proposing a scenario which excludes key generation from efficiency considerations, which means that the (expanded) public key is necessarily cached / stored somewhere. So all this hashing in Kyber will be unnecessary. So the slowdown of Kyber should be less, for higher dimensions, in the cached case than in the ephemeral one. The exact timings would of course need to be confirmed, but you are drawing conclusions about the static key scenario based on Kyber code that was optimized for the ephemeral case.

Vadim


Markku-Juhani O. Saarinen

unread,
Oct 23, 2021, 3:57:20 PM10/23/21
to Watson Ladd, pqc-forum
On Sat, Oct 23, 2021 at 10:53 AM Markku-Juhani O. Saarinen <mjos....@gmail.com> wrote:

> (Dan,)

>
> You know this well, but let me be explicit -- It's not just a "common speedup." Key reuse essentially forgoes forward security, which is an integral part of the security feature set of protocols such as TLS 1.3. Benchmarks with enc/dec only should perhaps be labeled "no forward security" to emphasize what type of security assurance is lost with that approach.

On Sat, Oct 23, 2021 at 7:36 PM Watson Ladd <watso...@gmail.com> wrote:
 
Session tickets, which avoid the entire handshake, have similar
issues. But a limited loss to forward secrecy of say 10 seconds or 20
seconds on a busy server that can then handle many more connections is
a tradeoff that's worth making. Having KEM keys in certificates is one
approach to optimizing TLS for the postquantum era, and there the keys
cannot change.

Watson,
 
(Yes, I meant "forward secrecy", I hope "forward security" isn't something completely different and confusing!)

Session tickets.. since this is TLS 1.3 you can't be talking about RFC 5077, but of the "new session ticket message" (Sect 4.6.1 of RFC 8446) I guess. But that's just essentially a PSK and avoids encaps/decaps too, right? You skip the entire handshake as you say, so that is not really related to the asymmetric key establishment. I guess you're saying that TLS 1.3 doesn't always 100% have forward secrecy, which is true.

The proposals I've seen use KEMs in certificates for TLS session authentication, while still performing an ephemeral key exchange. The client in KEMTLS (at least in the CCS 2020 version) still needs to do a key generation operation for the handshake. To me, the performance saving seemed to come mainly from the fact that a PQC KEM private key op is often more efficient than a PQC signature private key op, and both can be used for challenge-response authentication at essentially the same security level.

If a fixed private key was fixed for both shared key establishment and authentication, then forward secrecy would truly be gone. TLS 1.3 explicitly removed all ciphersuites w.o. forward secrecy only a few years back, and reversing that decision would need a lot of explaining at IETF. And would be unnecessary too, as there are PQC algorithms available that can easily support this security feature.

There are of course other uses for KEMs in certs, such as old-fashioned public-key encryption (PKE) in S/MIME email -- There one doesn't really have to care about key generation and can just look at the PQC KEMs as functional RSA(-KEM) replacements.

Blumenthal, Uri - 0553 - MITLL

unread,
Oct 23, 2021, 10:05:35 PM10/23/21
to Watson Ladd, Markku-Juhani O. Saarinen, pqc-forum
Some use cases tolerate, or indeed encourage reuse of the key. Others (including mine) do not accept that.

I hope that the selected algorithm would have good performance in all three operations - keygen, encap, decap.

Regards,
Uri

> On Oct 23, 2021, at 14:37, Watson Ladd <watso...@gmail.com> wrote:
>
> On Sat, Oct 23, 2021 at 10:53 AM Markku-Juhani O. Saarinen
> --
> You received this message because you are subscribed to the Google Groups "pqc-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
> To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/CACsn0cmAERJQC8np3REF7mKBPND_3mqgf1u5hxBQ5%2BtHpE%3D%2BTg%40mail.gmail.com.

Bo-Yin Yang

unread,
Oct 24, 2021, 1:52:57 AM10/24/21
to pqc-forum, mjos....@gmail.com
Dear Markku:

On Saturday, October 23, 2021 at 11:33:06 PM UTC+8 mjos....@gmail.com wrote:
I'll have to say what Derek didn't; the NTRU Prime Cortex M4 implementations that you chose as examples in this thread are clearly not practical for use in embedded engineering. They have very limited value as evidence of the suitability of NTRU Prime for embedded use (if any at all).

It's not just RAM, but as their large size, occupying a very large part of the Flash program memory (for a single algorithm variant).

The large size is not a result of a lack of optimization, but the result of doing way too much of it. This is because significant performance improvements can be gained by completely unrolling the multipliers -- asymptotically efficient non-NTT multipliers (e.g Toom-Cook) have quite complex control logic and the Cortex M4 has only a very primitive program cache. These performance gains go away when one has to shrink the implementation size to a realistic size.

I happen to know that much of the bloat resulted from a lack of time to optimize.  The young people who worked on NTRU Prime were in a hurry to graduate from their MS programs and maybe they didn't do the best job in optimizing for space, but I hope for a little patience.

As an extreme example, one can look at the 40,996 - line source code for sntrup1277 multiplier. https://github.com/mupq/pqm4/blob/master/crypto_kem/sntrup1277/m4f/mod3_mul128xN.S  The actual implementation of this variant requires 455,956 bytes of Flash. Yes, half a megabyte for a single-variant embedded KEM firmware.

Although currently the clearest violator, the NTRU Prime team has not been the only one who has been guilty of ignoring the purpose of Cortex M4 evaluation. [cut]

My bright students (some of the same group, and with outsiders) had worked on Saber also, and the result of their initial labor (1) was fast but not space-optimized; nine months later a space-optimized version for the M4 was produced that did not lose much speed (2).

(1) C.-M. M. Chung, V. Hwang, M. J. Kannwischer, G. Seiler, C.-J. Shih and B.-Y. Yang, NTT Multiplication for NTT-unfriendly Rings, New Speed Records for Saber and NTRU on Cortex-M4 and AVX2, IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2021(2), pp. 159–188. https://doi.org/10.46586/tches.v2021.i2.159-188, available as IACR e-Print 2020/1397.
(2) A. Abdulrahman, J.-P. Chen, Y.-J. Chen, V. Hwang, M. J. Kannwischer, and B.-Y. Yang, Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4, to appear, IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2022(1). Available as IACR e-Print 2021/995.

Best wishes,
Bo-Yin

D. J. Bernstein

unread,
Oct 24, 2021, 2:06:15 AM10/24/21
to pqc-...@list.nist.gov
Markku-Juhani O. Saarinen writes:
> Key reuse essentially forgoes forward security, which is an integral part of
> the security feature set of protocols such as TLS 1.3. Benchmarks with enc/dec
> only should perhaps be labeled "no forward security" to emphasize what type of
> security assurance is lost with that approach.

No. It's essential to avoid confusing _one-time_ with _ephemeral_. What
matters for forward secrecy is being _ephemeral_ ("lasting a very short
time"). Being _one-time_ is neither necessary nor sufficient for this:
it doesn't put any limit on _how long the secret is used_.

Compared to a protocol properly designed to erase secrets after, e.g.,
1 minute, a protocol designed to be _one-time_ (1) incurs unnecessary
slowdowns and (2) fails to achieve the security goal of forward secrecy.
This is crucial for analyses of the weight to assign to keygen.

> By the way, you just presented data in this thread without mentioning
> the forward-secure scenario

In fact, I _did_ mention other scenarios: "There are other benchmarks
where Kyber does better, but cherry-picking pro-Kyber benchmarks and
pretending that Kyber always wins is highly inappropriate. The actual
situation is that the performance winner depends on the application."

> .. so I guess that was a quite clear "B1 benchmarking crime" then?

Nope. Read the definition from the cited paper:

Misrepresenting the results of microbenchmarks is classified as _B1:
microbenchmarks representing overall performance_.

Or read what I wrote about "benchmarks that are presented as overall
performance while failing to cover this scenario".

Example: Intel keygen+enc+dec microbenchmarks---assuming that a key is
used for only one ciphertext, ignoring communication costs, and ignoring
other platforms where Kyber struggles---have been misrepresented as
overall performance, producing, e.g., claims that NTRU is "15x behind in
speed". (These claims also ignore recent keygen speedups and mislead
people regarding Kyber's Intel cycles, but those are separate issues.)

> Quickly looking, there might be additional counts:

Nope. "The actual situation is that the performance winner depends on
the application." This is exactly what the pro-Kyber advertising has
been failing to say.

> Kyber or Saber (which have a faster keygen than most P-256
> implementations)

Yes, all options at hand have faster keygen than NIST P-256 (and faster
keygen+enc+dec than NIST P-256) at higher security levels against known
attacks. This information is useful for people evaluating the claim that
keygen time in any of these systems prohibits forward secrecy.

> I'm not sure if I agree with the applicability of the batched
> technique in the embedded use scenarios.

If the device is generating only an occasional key then the keygen time
is so fast as to be irrelevant. If the device is generating many keys
then it can batch (and most of the speedup comes from batching the first
few keys; see the paper for numbers). What exactly is the disagreement?

> Even from your handwavy estimate

I pointed to a paper and to pqm4. You can easily find, e.g., sntrup761
keygen listed as 7951328 cycles in pqm4, without the speedups from that
paper. Feel free to divide by your favorite M4 clock frequency.

---Dan
signature.asc

Markku-Juhani O. Saarinen

unread,
Oct 24, 2021, 2:31:09 AM10/24/21
to Bo-Yin Yang, pqc-forum
Dear Bo-Yin Yang,

Yes, I didn't know that the NTRU Prime code was written by Master's students. For that it is very impressive!

The newer paper that you mention [2] is a nice one but does not contain data about NTRU Prime, only Saber.

As you know, Saber, unlike NTRU, benefits from its Module (LWR) structure, which allows a fixed-size multiplier to be reused (effectively in a loop) without increase in code size as dimensions grow higher. Furthermore, the fixed n=256 ring size of SABER makes it much more suitable for NTT (the technique used in [2]) than NTRU. I don't see how these results extend to NTRU Prime, at least without actual data being presented.

If there are more compact or otherwise efficient implementations of NTRU Prime available, I'd suggest submitting them to pqm4 for public evaluation.

Cheers,
Markku

Bo-Yin Yang

unread,
Oct 24, 2021, 3:15:15 AM10/24/21
to pqc-forum, mjos....@gmail.com, pqc-forum, Bo-Yin Yang
Dear Markku,

On Sunday, October 24, 2021 at 2:31:09 PM UTC+8 mjos....@gmail.com wrote:
Dear Bo-Yin Yang,

Yes, I didn't know that the NTRU Prime code was written by Master's students. For that it is very impressive!

Thank you for your kind words which I will relay to said young people.
 
The newer paper that you mention [2] is a nice one but does not contain data about NTRU Prime, only Saber.

Er, yes, I did mention those papers as an example of how Saber's pqm4 code first got faster, then later got more compact.  I was hoping that people could be patient and similarly allow NTRU Prime some time to improve.
 
As you know, Saber, unlike NTRU, benefits from its Module (LWR) structure, which allows a fixed-size multiplier to be reused (effectively in a loop) without increase in code size as dimensions grow higher. Furthermore, the fixed n=256 ring size of SABER makes it much more suitable for NTT (the technique used in [2]) than NTRU. I don't see how these results extend to NTRU Prime, at least without actual data being presented.
If there are more compact or otherwise efficient implementations of NTRU Prime available, I'd suggest submitting them to pqm4 for public evaluation.

There would definitely be a more compact NTRU Prime available.  The current speed-optimized code uses things like Rader's trick for speed; which results in somewhat larger code and cannot be reused.  While I don't really see an embedded system carrying more than one instance of NTRU Prime or for that matter Saber (I am sure that accompanying a change from NIST level 1 to 3, for example, would be a reflash of the firmware), space-optimized code would no doubt use Good's Trick, which allows some code reuse.

If and when different NTRU Prime code is produced, surely they would be released for public evaluation as soon as feasible.

Best wishes,
Bo-Yin

D. J. Bernstein

unread,
Oct 24, 2021, 4:39:59 AM10/24/21
to pqc-...@list.nist.gov
> > > 60% of the time of Kyber768-90s is spent on hashing, which is not as fast
> > > on Coretx-M4 as on X86. The majority of that time is spent on
> > > expanding a public seed into a 3x3 polynomial matrix which is used in
> > > decapsulation.
> > Thank you for indirectly confirming my point about Kyber's "increasingly
> > severe performance problems after 1024".
> I have no idea what you mean, and I believe that the opposite is actually true.

To see the general scaling, consider dimensions 2048, 4096, and 8192
(all of which are reasonable to consider for users who see much the
attack picture has been changing recently). Kyber performs various
computations on matrices of 256-coefficient polynomials:

* n=2048: an 8x8 matrix, overall 8*n = 16384 coefficients;
* n=4096: a 16x16 matrix, overall 16*n = 65536 coefficients;
* n=8192: a 32x32 matrix, overall 32*n = 131072 coefficients.

The matrix-related computations are already easily visible in a variety
of existing Kyber benchmarks for, e.g., n=768, and obviously are going
to dominate each benchmark as n increases.

Also, the Kyber documentation doesn't seem to explain how to scale up
Kyber to larger values of n, so there's a starting question of what's
being evaluated---and this appears to be an uncomfortable question for
Kyber to answer. How would Kyber deal with decryption failures caused by
modulus 3329? Increasingly small errors, affecting Core-SVP? Or a larger
modulus, with its own effects on performance, security, and software?
There seem to be serious limits to the story that a Kyber implementor
can just do an NTT for (\Z/3329)[x]/(x^256+1). Does this story cover
more than 3 parameter sets (times 2 for 90s)?

For comparison, all 6 of the specified NTRU Prime parameter sets (times
2 for sntrup versus ntrulpr) share AVX2-optimized NTT code generated by
a new range-checking NTT compiler (see https://eprint.iacr.org/2021/826).
Beyond NTTs, the full NTRU Prime code for all sizes is automatically
generated from a unified code base, as I've mentioned before. All of the
variations in moduli and dimensions are handled automatically. Further
intermediate sizes are already fully supported. Some minor tweaks will
be required to the generator for larger n, but the specification is
unchanged, the parameter-set generation and Core-SVP evaluation are
already automated, and various larger parameter sets already appear in

https://ntruprime.cr.yp.to/nist/ntruprime-20201007.pdf

pages 104--111. Section 5.5 of that document notes that all NTRU Prime
operations have essentially linear scaling, exponent 1+o(1), in the log
of the ring size; this isn't a substitute for concrete cost evaluations,
but NTRU Prime structurally doesn't have Kyber's issues here.

The claim that "varying security levels in NTRU / NTRUPrime requires
significantly more effort than in Kyber/Saber" gets the facts backwards
at least for NTRU Prime vs. Kyber, and really should be withdrawn. For
intermediate dimensions, the actual situation is done vs. impossible;
for larger dimensions, the actual situation is done vs. problematic for
the parameter sets, and basically done vs. tricky for the software.

---Dan
signature.asc

Matthias Kannwischer

unread,
Oct 24, 2021, 11:03:08 PM10/24/21
to Vadim Lyubashevsky, pqc-forum
Dear Vadim, dear all, 

On Fri, 22 Oct 2021 at 20:14, Vadim Lyubashevsky <vadim....@gmail.com> wrote:
Dear all,


To increase speed in this simple message-receiving scenario, at the
cost of more storage, the decryptor can store the expanded matrix as
part of the key (this was already specified in the Kyber
documentation).  This would add another 256*9*12/8 = 3456 Bytes to
storage. Since the code size of Kyber is 10KB smaller than sntrup653
and the RAM requirement is 16KB smaller, increasing the key size by
3.5KB should still keep the storage/ram of Kyber noticeably less
regardless of where these extra bytes are being stored.

The time saving comes from not needing to hash (or run AES in Kyber
90s) to expand the key.  I don't know what the current profile is for
key expansion vs other hashing (would be happy if someone chimed in),
but let's just say conservatively that half of the hashing is saved.
The running time would therefore decrease by 30% to 520K cycles.  So
Kyber768-90s would be 10% slower (this number is obviously not very
precise), require 20% more ciphertext bytes, while providing ~ 50 more
bits of Core-SVP security than sntrup653.


kyber512
  decaps: 512k
  gen_a: 174k
  decaps w/o gen_a: 338k
kyber512-90s
  decaps: 457k
  gen_a: 141k
  decaps w/o gen_a: 316k
kyber768
  decaps: 839k
  gen_a: 394k
  decaps w/o gen_a: 445k
kyber768-90s
  decaps: 749k
  gen_a: 318k
  decaps w/o gen_a: 431k
kyber1024
  decaps: 1295k
  gen_a: 713k
  decaps w/o gen_a: 582k
kyber1024-90s
  decaps: 1147k
  gen_a: 561k
  decaps w/o gen_a: 586k

Cheers,
Matthias
 

Vadim Lyubashevsky

unread,
Oct 26, 2021, 5:44:06 AM10/26/21
to Matthias Kannwischer, pqc-forum
Dear Matthias, all,

Thanks a lot for running the numbers.

So just to summarize this thread that began with

"cherry-picking pro-Kyber benchmarks and pretending that Kyber always
wins is highly inappropriate,"

then chose a scenario in which Kyber might do worse and stated

"Amazingly, Kyber's M4-optimized software manages to take more cycles
than NTRU Prime at _every_ security level >=128! How is this possible?"

The conclusion is that even in this decryption scenario where the key
is cached (and we don't count the 10X advantage in Key Generation in
the full key exchange), Kyber is still faster unless the caching is
forced to be done in an "adversarial" manner.

689K vs. 445K for sntrup857 (Core-SVP 2^175) vs Kyber768 (Core-SVP 2^181)
838K vs. 582K for sntrup1013 (Core-SVP 2^209) vs Kyber1024 (Core-SVP 2^256)

Best,
Vadim

P.S. I don't think anyone claimed that Kyber does better in every
imaginable scenario (maybe in the *encryption* scenario without key
caching and where the symmetric primitives are very slow, Kyber could
do slightly worse?). But in most scenarios, and almost certainly all
the ones where one counts key generation, Kyber will be faster.

D. J. Bernstein

unread,
Nov 1, 2021, 3:16:14 AM11/1/21
to pqc-...@list.nist.gov
Vadim Lyubashevsky writes:
> The conclusion is that even in this decryption scenario where the key
> is cached (and we don't count the 10X advantage in Key Generation in
> the full key exchange), Kyber is still faster unless the caching is
> forced to be done in an "adversarial" manner.

Nope. https://ntruprime.cr.yp.to/latticerisks-20211031.pdf includes a
full analysis (1) showing that Kyber is slower in this scenario and (2)
pinpointing the benchmarking errors that led to the above statement.

---Dan
signature.asc
Reply all
Reply to author
Forward
0 new messages