Hello All,
This is not a brand new research result per se, but I wanted to offer some views on the Round 2 candidates on embedded / IoT targets, specifically on energy consumption and the quantitative work [1] I've done on that front.
The IEEE workshop where the write-up [1] will appear has been moved to August. The slides for the talk I that gave at the ETSI PQC workshop last November also cover the same subject matter [2]. The energy measurement laboratory experiments are intended to be reproducible, described in [3].
Anyway, some broad observations on NIST PQC Candidates on Embedded and Hardware:
- In 2017 substantial implementation work had already been done on lattice schemes but almost none on Isogeny systems (i.e. SIKE). This situation has since radically changed. In a series of papers, the embedded performance (and energy consumption) difference between ring/module lattice and Isogeny systems at the same security level has shrunk from 1000-fold to about 100-fold but is still very significant. Unfortunately, these better SIKE Cortex M4 implementations have not been made publicly available so I could not measure them. Due to the size of the performance gap, the same general conclusions remain.
- On microcontroller targets, one should focus solely on cycle counts and ignore millisecond measurements; ULP circuits such as Cortex M4 have been deliberately designed to have a wide range of clock frequencies and they can run slow specifically because that saves energy. Note that this is unrelated to sleep states. The dynamic power of most circuits is (locally) linearly dependant on clock frequency, but some MCUs also scale up the voltage with higher clock frequencies, hence resulting in higher than linear power consumption. Furthermore, Flash program memory is less likely to keep up, therefore requiring more cycles for the same task. On hardware implementations Cycles x Area is a reasonable energy estimation metric if more advanced measurements are not available.
- A hypothesis about a relatively high "transmission energy cost" of public keys and ciphertexts is still being propagated by some researchers. Transmission energy is measured in Joules/bit -- bit rates have vastly increased from what we had in the early 2000s and this, in turn, has decreased transmission energy. The write-up [1] goes into a bit more detail about this, but a cross-over point can be estimated on "how expensive transmission has to be" to justify greatly increased computation. Based on my research, it is difficult to find a scenario where CPU energy (cycles) is not the dominant cost in the energy budget for these public key algorithms (in mobile systems), so I fee that this hypothesis is largely a myth.
- We urge NIST to be cautious and pragmatic when viewing hardware results. Actual "real world" PQC implementations in smartcards, secure elements, and HSMs are much more likely to be of coprocessor (hardware/software codesign) type than monolithic ("hardware API"). Large monolithic designs that occupy the area equivalent of half a dozen lightweight CPU cores or (as has been proposed) the entire FPGA chip do not make much sense. Furthermore, hardware results based on "high-level synthesis" (HLS) do not tell anything meaningful about the algorithms being investigated, simply that a vendor tool was able to translate a C reference implementation into some HDL.
- A recommendation to those designing hardware support for PQC: Many algorithms benefit very significantly from a fast Keccak (SHA3/SHAKE) permutation; evidence indicates +50% for faster ring/module lattice algorithms, and even more for general lattice algorithms Round5 R5N1 and FrodoKEM. A good memory-mapped Keccak implementation also makes SPHINCS+ and XMSS signatures feasible on IoT and Embedded targets, speeding signatures up 20x or more. Since the area requirement of the permutation is rather large (mostly due to its 1600-bit state), it would not make much sense to package it into a monolithic hardware implementation with a PQC algorithm, but make it available to the CPU in a hardware/software codesign. For those considering hardware support for PQC algorithms, you can’t go wrong with a fast SHA3 core. Our experience in the RISC-V Crypto Task Group indicates that ISA extensions are not as helpful for SHA-3 as for some other algorithms, so a separate SHA3 hardware module is my current “first aid” suggestion when people are asking for PQC hardware acceleration. You can’t go wrong with it.
- On a final note, Dilithium and Falcon are really the preferable signature algorithms on IoT and Hardware targets as they are generally faster than RSA or ECC. The company I work for is also involved with composite (“hybrid”) PKI and signatures and these are the primary algorithms used in that work.
Stay safe everyone..
Cheers,
- markku
[1] M.-J. O. Saarinen: "Mobile Energy Requirements of the Upcoming NIST Post-Quantum Cryptography Standards." To appear in Proc. IEEE MobileCloud 2020. Preprint: https://arxiv.org/abs/1912.00916
[2] M.-J. O. Saarinen: "Towards μJoule PQC." ETSI/IQC Quantum-Safe Cryptography Workshop 2019. Slides: https://docbox.etsi.org/Workshop/2019/201911_QSCWorkshop/TECHNICAL_TRACK/05_PROTOTYPING_RESOURCE_CONSTRAINTS/PQSHIELD_SAARINEN.pdf
[3] M.-J. O. Saarinen: "PQPS (Post-Quantum Power Sandwich)." https://github.com/mjosaarinen/pqps
Dr. Markku-Juhani O. Saarinen <mj...@pqshield.com> PQShield, Oxford UK.
> Based on my research, it is difficult to find a scenario where CPU energy> (cycles) is not the dominant cost in the energy budget for these public key
I assume you mean the TLS model where public keys are processed often. That is
far too wasteful for many microcontroller deployments.
If public key exchanges happen once or infrequently then the energy cost of e.g.
post quantum resistant AES-256 encryption is then tiny in comparison to the data
transmission cost.
Hi Markku,
The SIKE Cortex M4 implementations were posted on github just before your email came out, at https://github.com/solowal/SIKE_M4
The companion paper is https://eprint.iacr.org/2020/410 (note
that these results are about 1.5x faster than previously
announced).
Regarding HW/SW codesigns, the latest work for SIKE is
https://eprint.iacr.org/2020/040 and the implementation will be
available shortly.
I am planning a more comprehensive email describing these and other implementation advances in time for the April 15 deadline. This is just a short note to respond to the specific points you brought up.
Sorry that we were not able to get these posted sooner. As Dan
mentioned before, many of us are dealing with previously unplanned
child care responsibilities.
-David
--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/33ff1220-645d-476b-9f94-26a097de6cf2%40list.nist.gov.
Hi all,
Just wanted to highlight that the hw/sw co-design implementation of SIKE described in https://eprint.iacr.org/2020/040 is available here: https://github.com/pmassolino/hw-sike
Best,
Patrick
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/68bde302-0a4f-32a7-05f5-0b96e2f76d07%40math.uwaterloo.ca.
A 1000*bytes+cycles cost metric consists mostly of bytes for typical
lattice systems but mostly of cycles for SIKE. (See Section 5.4 of
https://ntruprime.cr.yp.to/nist/ntruprime-20190330.pdf for a data point
justifying this metric, and for a call for the community to agree on a
spectrum of such data points.)
For at least a year the SIKE team has been quite reasonably arguing that
long-term chip evolution will make SIKE's computation disadvantage less
and less noticeable compared to the communication advantage. This
doesn't mean that SIKE is about to beat lattice systems, but it does
mean that complaints about SIKE's cost need to be tempered by analysis
of how costs will evolve during the lifetime of the NISTPQC standards.
I don't see where the SIKE team ever endorsed the extreme claim that
"CPU/MCU energy consumption is automatically negligible when compared to
transmission energy", and I don't see where the SIKE team ever claimed
that SIKE's current microcontroller cycles are outweighed by its current
savings in "transmission energy cost". I also don't see how these
strawman claims help anyone analyze SIKE's current and future costs.
Related problem: From spot checks of https://arxiv.org/abs/1912.00916, I
extrapolate that the paper already has a problematic volume of obsolete
data. How are typical readers supposed to figure out which information
is obsolete, or to find updated information? The paper does not appear
to be attached to a framework that systematically includes new software.
The paper does not include even the most minimal effort to label
reference implementations. Even worse, the paper mischaracterizes
reference-implementation performance as inherent algorithm performance,
actively denying the reality that people then optimize implementations.
I already pointed out (email dated 4 Jun 2019 16:52:56 -0000) that your
published guesses of 0.1 mJ for 1 million Cortex-M4 cycles and 75 nJ for
I don't see how focusing on energy consumption eliminates, or even
reduces, the difficulties of defending your anti-SIKE hype.
> I'd be prepared to guess that the improvements in de facto Joule/bit
> will be larger than improvements in computation energy Joule/cycle
> (normalized equivalent) in the immediate future.
https://www.openfabrics.org/images/eventpresos/workshops2015/DevWorkshop/Tuesday/tuesday_10.pdf
has Intel stating the opposite trend from your guess: computation cost
will "scale well with process and voltage", while the energy cost of
11.20 pJ per 5 mm to move 8 bytes at 22nm is "more difficult to scale
down".
I don't see how this handles the structural problems I described. For
example, how exactly does the SIKE team get you to update your numbers
for their newly published code, and how exactly are you addressing the
risk of readers taking your obsolete numbers? Sure, you _could_ re-run
everything and publish new numbers after taking new code (via pqm4), but
is this actually going to happen? When?