Programmable Hardware, Microcontrollers and Vector Instructions

615 views
Skip to first unread message

Alperin-Sheriff, Jacob (Fed)

unread,
Dec 13, 2018, 11:07:04 AM12/13/18
to pqc-...@list.nist.gov

Hi all, 

 

We have received requests on the forum to indicate programmable hardware and/or microcontrollers that teams should focus on doing implementations on for Round2, in order to better enable direct comparisons between schemes. 

 

In the spirit of this request, NIST has noticed that recent post-quantum implementations have been done on the ARM Cortex-M4, the Atmel Atxmega128, and a number of Xilinx FPGAs, the Spartan-6, Artix-7, Virtex-7, so these may be good choices for teams to focus on to ensure a large number of direct comparisons are possible in the 2nd round when performance (on a wide range of devices) will become more important.

 

If any of these microcontrollers or FPGAs seem unsuitable or someone has another suggestion for a significantly-used micontroller FPGA that people should focus implementations on, please let us know immediately.

 

In addition, we would like to note that we STRONGLY RECOMMEND teams have an AVX2/Haswell implementation for Round 2

 

 

—Jacob Alperin-Sheriff

 

Kevin Chadwick

unread,
Dec 14, 2018, 8:22:18 AM12/14/18
to Alperin-Sheriff, Jacob (Fed), pqc-...@list.nist.gov
On 12/13/18 4:06 PM, 'Alperin-Sheriff, Jacob (Fed)' via pqc-forum wrote:
> If any of these microcontrollers or FPGAs seem unsuitable or someone has another
> suggestion for a significantly-used micontroller FPGA that people should focus
> implementations on, please let us know immediately.

Probably outside the remit and I don't know to what degree they apply to Post
Quantum algorithms but "30.4.2.2 Available Instructions" shows accelerated
instructions of the crypto engine in a popular cortex m4 chip? Some of these
instructions like MMUL have certainly sped up ecc, beyond software capabilities.

"https://www.silabs.com/documents/public/reference-manuals/efm32pg12-rm.pdf"

Pedro Maat Costa Massolino

unread,
Dec 17, 2018, 10:20:52 AM12/17/18
to pqc-...@list.nist.gov

Hi Jacob,

Instead of having 2 microcontrollers and 3 FPGA families, why not recommend 1 microcontroller, 1 FPGA family, 1 ASIC cell library and 1 "big" CPU?

I am not saying we should not have results for different devices, as those show how different instructions or characteristics can be exploited to get better results.

I am saying that for comparison purposes we should have one base device to aim for.

Best Regards,

Pedro

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
Visit this group at https://groups.google.com/a/list.nist.gov/group/pqc-forum/.

Carlos Andres Lara Niño

unread,
Dec 17, 2018, 11:46:49 AM12/17/18
to Alperin-Sheriff, Jacob (Fed), pqc-...@list.nist.gov
Dear Jacob Alperin-Sheriff, Dear All,

In addition to specifying implementation platforms as previous comments
have suggested,
I'd recommend to also state what are the design goals the implementers
should be aiming
for (i.e.. high performance/low energy/small area, etc.).

The evaluation metrics should also be consistent. For software
realizations, the cycle
count and operational frequency are good candidates. In the case of
hardware, the cycle
count and critical path, as well as the hardware usage, should be
reported.

For FPGA, since all the proposals seem to be Xilinx, the use of SLC,
LUT, FF, BRAM, DSP,
uC, MUX ought to be disclosed. Alternatively, you could request results
for architectures
exploiting the FPGA fabric, as well as results for the same design but
disabling the use
of FPGA-specific elements (BRAM, DSP, MUX).

If you consider ASIC libraries, as has been previously mentioned, the
area in um^2 might
be preferred unless you specify the conversion for reporting the results
in GEs.

Lastly, in the case of hardware realizations it should be specified what
is the kind of
architecture you are interested in. Coprocessors? accelerators?
stand-alone cores? The
results for one or the other would vary wildly as we all can imagine. If
the architectures
are targeted at working with a general processor, it might be a good
idea to also specify
such element (physical core, soft core? uBlaze, pBlaze?). The new SoC
devices have built-in
ARMs tightly linked with an FPGA fabric, and might provide interesting
results for
high-throughput applications.

All of this hoping that common goals can be set, so that fairer
comparisons can be made.

Kind regards,
Andres
--
Carlos Andres Lara Niño
CINVESTAV Campus Tamaulipas

D. J. Bernstein

unread,
Feb 1, 2019, 1:28:07 AM2/1/19
to pqc-...@list.nist.gov
Pedro Maat Costa Massolino writes:
> Instead of having 2 microcontrollers and 3 FPGA families, why not
> recommend 1 microcontroller, 1 FPGA family, 1 ASIC cell library and 1
> "big" CPU?

I support this idea, for three basic reasons:

* Implementors have less time for serious optimization when they're
splitting work between (e.g.) Cortex-M4 and AVR. This means larger
risks that important optimizations are still missing a year from
now, and thus larger risks of big errors in performance comparisons
across submissions. It's already a huge project to seriously
optimize 26 submissions for just _one_ platform.

* The top argument for benchmarking more than one platform is that
different platforms sometimes favor different submissions. Some of
the submissions rely critically on multiplication, for example,
while others don't; if the selected platforms all have serious
hardware area devoted to multipliers then the results will be poor
predictors of ASIC results. It's much more important to have some
sort of ASIC results than to cover a second FPGA with multipliers.

* As the number of different platforms grows, it becomes harder and
harder for readers to absorb all the different benchmark results.
Centralized selection of one platform minimizes this difficulty.
Centralized averaging across platforms (with some sort of weights)
loses touch with reality---one can't compare the average to the
requirements of any particular application. Decentralized selection
encourages each submission to cherry-pick its favorite platform,
again making comparisons difficult.

Concretely, I presume that implementors can agree that Cortex-M4 will be
the best microcontroller for comparisons carried out next year. NIST can
and should specify Cortex-M4 as the recommended microcontroller for
round-2 comparisons. See

https://eprint.iacr.org/2018/723
https://eprint.iacr.org/2018/1018
https://eprint.iacr.org/2018/1116

for examples of initial results. If there are disputes about the choice
of Cortex-M4 then implementors can speak up to say what they think the
best microcontroller for comparisons will be, and NIST can take the most
frequently given answer---the point is to prioritize _something_.
Similar comments apply to FPGAs et al.

I see that NIST's press release mentions smartphones. If it's important
to NIST to name a smartphone CPU then Cortex-A7 is the obvious choice:
more than a billion devices sold, easy availability for development in
(e.g.) the Raspberry Pi 2 v1.1, and more separation from Haswell than a
64-bit ARM CPU would have. But if submission teams are still working on
Haswell optimization today (e.g., I haven't seen AVX2 code for Saber,
even though clearly it would be faster than the portable code), does it
really make sense to say that in a year NIST will be doing software
comparisons for Haswell _and_ Cortex-M4 _and_ Cortex-A7?

---Dan
signature.asc

Sujoy Sinha Roy

unread,
Feb 1, 2019, 4:58:10 AM2/1/19
to pqc-...@list.nist.gov
On 2019-02-01 06:27, D. J. Bernstein wrote:
> Pedro Maat Costa Massolino writes:
>> Instead of having 2 microcontrollers and 3 FPGA families, why not
>> recommend 1 microcontroller, 1 FPGA family, 1 ASIC cell library and 1
>> "big" CPU?
>
> I support this idea, for three basic reasons:
>
> * Implementors have less time for serious optimization when they're
> splitting work between (e.g.) Cortex-M4 and AVR. This means larger
> risks that important optimizations are still missing a year from
> now, and thus larger risks of big errors in performance
> comparisons
> across submissions. It's already a huge project to seriously
> optimize 26 submissions for just _one_ platform.

Cortex-M4 has been a popular choice for implementing the post-quantum
crypto algorithms.
If the concern is "can it be implemented on constrained platforms? can
it be implemented in a manner that it won't take forever to run?", then
more constrained platforms such as Cortex-M0 or AVR could be useful.

>
> * The top argument for benchmarking more than one platform is that
> different platforms sometimes favor different submissions. Some of
> the submissions rely critically on multiplication, for example,
> while others don't; if the selected platforms all have serious
> hardware area devoted to multipliers then the results will be poor
> predictors of ASIC results. It's much more important to have some
> sort of ASIC results than to cover a second FPGA with multipliers.

I think benchmarks of hardware implementations on both ASIC and FPGA
will be
useful and should be treated differently.

All FPGAs have multipliers/adders/RAMs. Hence it is meaningful to use
these building blocks to implement optimized hardware architectures.
Note that NIST allows the use of platform-specific AVX instructions for
writing more efficient software implementation.
Sujoy

Alessandro Barenghi

unread,
Feb 1, 2019, 6:18:40 AM2/1/19
to pqc-...@list.nist.gov
On 2/1/19 10:58 AM, Sujoy Sinha Roy wrote:
> On 2019-02-01 06:27, D. J. Bernstein wrote:
>> Pedro Maat Costa Massolino writes:
>>> Instead of having 2 microcontrollers and 3 FPGA families, why not
>>> recommend 1 microcontroller, 1 FPGA family, 1 ASIC cell library and 1
>>> "big" CPU?
>>
>
> Cortex-M4 has been a popular choice for implementing the post-quantum
> crypto algorithms.
> If the concern is "can it be implemented on constrained platforms? can
> it be implemented in a manner that it won't take forever to run?", then
> more constrained platforms such as Cortex-M0 or AVR could be useful.

Speaking as an implementor, I second the idea of targeting a more energy
efficient microcontroller of the cortex family, i.e., Cortex-M0 or
Cortex-M0+ as a valuable platform for benchmarking.
Indeed, Cortex-M0+ has superior energy efficiency even w.r.t. AVRs while
Cortex-M4 is closer to a cut-down Cortex-A than to an actual embedded,
low power uC.


>> I see that NIST's press release mentions smartphones. If it's important
>> to NIST to name a smartphone CPU then Cortex-A7 is the obvious choice:
>> more than a billion devices sold, easy availability for development in
>> (e.g.) the Raspberry Pi 2 v1.1, and more separation from Haswell than a
>> 64-bit ARM CPU would have.

While Cortex-A7 is surely a widely deployed platform, I'd like to point
out that a very significant number of current smartphones/tablets rely
on at least on a Cortex-A53 CPU based SoC.
Cortex-A53 development platforms are also easily available in the same
price range as the Raspberry Pi, and have the same price range.
Concerning the separation from Haswell point, I agree with Dan, a
Cortex-A53 is "closer" to a Haswell platform (despite the vector
instruction width being half of the AVX2) than to a uC.
On the other hand, a Cortex-A7 is arguably "closer" to the Cortex-M4
than to a modern smartphone CPU (dual issuing only for a
three-operand-two operand instruction pair, simple in order pipeline).

Cheers,

-- Alessandro

Peter Schwabe

unread,
Feb 1, 2019, 6:30:50 AM2/1/19
to Sujoy Sinha Roy, pqc-...@list.nist.gov
Sujoy Sinha Roy <Sujoy.S...@esat.kuleuven.be> wrote:

Dear Sujoy, dear all,

> > * Implementors have less time for serious optimization when they're
> > splitting work between (e.g.) Cortex-M4 and AVR. This means larger
> > risks that important optimizations are still missing a year from
> > now, and thus larger risks of big errors in performance comparisons
> > across submissions. It's already a huge project to seriously
> > optimize 26 submissions for just _one_ platform.
>
> Cortex-M4 has been a popular choice for implementing the post-quantum crypto
> algorithms.
> If the concern is "can it be implemented on constrained platforms? can
> it be implemented in a manner that it won't take forever to run?", then
> more constrained platforms such as Cortex-M0 or AVR could be useful.

Benchmarks on the M4 should of course always report the usage of ROM and RAM,
so that you can draw some conclusions about if it fits into smaller
microcontrollers. The big advantage of the M4 is that you can get most
of the algorithms to work there and compare. Suggesting the M0 as a
common microcontroller platform means that for most candidates you won't
have any implementation. I'm not saying that optimizing for and
benchmarking on the M0 isn't useful, it's just maybe not the best
platform to compare (most of) the round-2 candidates.

All the best,

Peter
signature.asc

Markku-Juhani O. Saarinen

unread,
Feb 1, 2019, 6:31:45 AM2/1/19
to pqc-forum
Hi,

Regarding hardware implementations: I'd expect that for many candidates the Keccak permutation will and/or AES core will consume a significant portion of the implementation footprint. However, If we look at how hardware implementations of asymmetric algorithms are done in practice, the implementations rarely live in "isolation" but as a part of some module with a controlling CPU. There we would clearly like to have the Keccak permutation or AES core available to the module CPU for other purposes as well.

This is how most current lightweight security controllers, smart cards, TPMs, and HSMs actually work. The RSA / ECC implementation is often simply a raw bignum unit, not some isolated module that expands seeds, performs padding, generates randomness etc. Message formatting and various other signaling tasks are actually performed by firmware (i.e. the CPU in the module). This allows larger flexibility and support for wider range of applications. See for example https://eprint.iacr.org/2018/425

Therefore I hope that NIST will *not* choose to define a hardware API that treats the algorithm as a "hermetic black box", but allows things like trivial padding operations to be offloaded to drivers. Power consumption and throughput measurements would, of course, expect to take into account the work done by the CPU.

Other notes:

- Cortex M4 has optional DSP and floating-point units. The Cortex M0 has options for 1 cycle and 32 cycle multiply. So it's important to note which options are being used in an implementation.
- AVR has no multiplier and only 1-bit shifts (making things like Keccak permutation a pain, but doable). Only a couple of algorithms will fit into AVR (Round5 does, largely thanks to its use of ternary secrets), some more on Cortex M4, many on neither. Some are borderline and open on for discussion (SIKE requires minutes for a single operation on M4).
- Regarding differences between FPGA platforms; Actually not all FPGAs have DSP blocks ("multipliers"). I quite like iCE40 LP/HX/LM myself, especially thanks to fully open source toolchain from Verilog to bitstream ( http://www.clifford.at/icestorm/ ). Furthermore, NIST should of ask for generic Verilog/VHDL source code, which we may synthesize for multiple targets.
- Regarding ARMv7 vs. ARMv8: NEON makes a big difference and ARMv7 without it is rather close to Cortex M4. ARMv8 is not really an "embedded" CPU as it is also used in laptops and mainframes. The Apple Bionic chips are more powerful than any desktop computer just a few years ago..

Cheers,
- markku

Dr. Markku-Juhani O. Saarinen <mj...@pqshield.com> PQShield, Oxford UK. 

Sujoy Sinha Roy

unread,
Feb 1, 2019, 6:42:49 AM2/1/19
to Peter Schwabe, pqc-...@list.nist.gov
Hi Peter,

Cortex M4 is a definitely better and more popular choice as most schemes
can be implemented on it.
In my previous email I wanted to say that Cortex M0 can be an additional
platform to test "can it be implemented on constrained platforms?"

I agree with you that Benchmarks should report the usage of ROM and RAM.

Regards
Sujoy

Kevin Chadwick

unread,
Feb 1, 2019, 6:46:51 AM2/1/19
to pqc-...@list.nist.gov
On Fri, 1 Feb 2019 11:18:34 +0000


> > Cortex-M4 has been a popular choice for implementing the
> > post-quantum crypto algorithms.
> > If the concern is "can it be implemented on constrained platforms?
> > can it be implemented in a manner that it won't take forever to
> > run?", then more constrained platforms such as Cortex-M0 or AVR
> > could be useful.
>
> Speaking as an implementor, I second the idea of targeting a more
> energy efficient microcontroller of the cortex family, i.e.,


Cortex-M4 has an FPU if needed and may be half the power, but both are
low. They are also similar in package size, despite M4 variants often
having far more flash/ram which may aid in reducing power consumption.
Flash/RAM sizes will only increase, so it would make sense to provide
that IMO.

4.1.6 M0+ 46ua/Mhz @ 26 mhz 128kb flash 32kb ram
https://www.silabs.com/documents/public/data-sheets/efm32tg11-datasheet.pdf

4.1.5 M4 102ua/Mhz @26 mhz 1024kb flash 256kb ram
https://www.silabs.com/documents/public/data-sheets/efm32pg12-datasheet.pdf
f

> Cortex-M0 or Cortex-M0+ as a valuable platform for benchmarking.
> Indeed, Cortex-M0+ has superior energy efficiency even w.r.t. AVRs
> while Cortex-M4 is closer to a cut-down Cortex-A than to an actual
> embedded, low power uC.

That is a completely false statement. Cortex-M4 is almost no relation at
all to Cortex-A and a direct relative of M-0.

Peter Schwabe

unread,
Feb 1, 2019, 6:49:53 AM2/1/19
to Alessandro Barenghi, pqc-...@list.nist.gov
Alessandro Barenghi <alessandr...@polimi.it> wrote:

Dear Alessandro, dear all,

> > Cortex-M4 has been a popular choice for implementing the post-quantum
> > crypto algorithms.
> > If the concern is "can it be implemented on constrained platforms? can
> > it be implemented in a manner that it won't take forever to run?", then
> > more constrained platforms such as Cortex-M0 or AVR could be useful.
>
> Speaking as an implementor, I second the idea of targeting a more energy
> efficient microcontroller of the cortex family, i.e., Cortex-M0 or
> Cortex-M0+ as a valuable platform for benchmarking.
> Indeed, Cortex-M0+ has superior energy efficiency even w.r.t. AVRs while
> Cortex-M4 is closer to a cut-down Cortex-A than to an actual embedded,
> low power uC.

Could compromise on M3, which doesn't have the DSP instructions, but at
least you can get development boards with enough RAM and ROM to make
most of the round-2 candidates work.

My impression is that the 8-bit AVR is mainly superseded by 32-bit
M0/M0+. It's fun to optimize for the AVR, but I wouldn't put much effort
into it anymore, in particular when taking into account that it's going
to be a few more years until standardized PQC will be deployed.

> >> I see that NIST's press release mentions smartphones. If it's important
> >> to NIST to name a smartphone CPU then Cortex-A7 is the obvious choice:
> >> more than a billion devices sold, easy availability for development in
> >> (e.g.) the Raspberry Pi 2 v1.1, and more separation from Haswell than a
> >> 64-bit ARM CPU would have.
>
> While Cortex-A7 is surely a widely deployed platform, I'd like to point
> out that a very significant number of current smartphones/tablets rely
> on at least on a Cortex-A53 CPU based SoC.
> Cortex-A53 development platforms are also easily available in the same
> price range as the Raspberry Pi, and have the same price range.

Unfortunately, the A53, unlike most other ARM CPUs isn't properly
documented. Figuring out latencies and throughputs of instructions
requires a lot of micro-benchmarking.

All the best,

Peter
signature.asc

Alessandro Barenghi

unread,
Feb 1, 2019, 6:59:55 AM2/1/19
to pqc-...@list.nist.gov
I agree that it is a bit of an overstatement stating that a Cortex-M4 is
closer to an A-series CPU, however, it's still a microcontroller with a
significantly richer ISA than an M0 and possibly lower instruction
latencies (e.g., it has no option for a cut-down 32 cycle multiplier).

Alessandro

Alessandro Barenghi

unread,
Feb 1, 2019, 7:05:12 AM2/1/19
to pqc-...@list.nist.gov
On 2/1/19 12:49 PM, Peter Schwabe wrote:
> Alessandro Barenghi <alessandr...@polimi.it> wrote:
>
> Dear Alessandro, dear all,
>
>>> Cortex-M4 has been a popular choice for implementing the post-quantum
>>> crypto algorithms.
>>> If the concern is "can it be implemented on constrained platforms? can
>>> it be implemented in a manner that it won't take forever to run?", then
>>> more constrained platforms such as Cortex-M0 or AVR could be useful.
>>
>> Speaking as an implementor, I second the idea of targeting a more energy
>> efficient microcontroller of the cortex family, i.e., Cortex-M0 or
>> Cortex-M0+ as a valuable platform for benchmarking.
>> Indeed, Cortex-M0+ has superior energy efficiency even w.r.t. AVRs while
>> Cortex-M4 is closer to a cut-down Cortex-A than to an actual embedded,
>> low power uC.
>
> Could compromise on M3, which doesn't have the DSP instructions, but at
> least you can get development boards with enough RAM and ROM to make
> most of the round-2 candidates work.

Both of these sound like good points to me, getting M3s with enough
Flash/ROM to compare the candidates shouldn't be an issue.

> My impression is that the 8-bit AVR is mainly superseded by 32-bit
> M0/M0+. It's fun to optimize for the AVR, but I wouldn't put much effort
> into it anymore, in particular when taking into account that it's going
> to be a few more years until standardized PQC will be deployed.

I also have the same impression. While AVR retains a significant legacy
in deployed devices, Cortex-M seems to be a more future-proof choice for
benchmarks.

>>>> I see that NIST's press release mentions smartphones. If it's important
>>>> to NIST to name a smartphone CPU then Cortex-A7 is the obvious choice:
>>>> more than a billion devices sold, easy availability for development in
>>>> (e.g.) the Raspberry Pi 2 v1.1, and more separation from Haswell than a
>>>> 64-bit ARM CPU would have.
>>
>> While Cortex-A7 is surely a widely deployed platform, I'd like to point
>> out that a very significant number of current smartphones/tablets rely
>> on at least on a Cortex-A53 CPU based SoC.
>> Cortex-A53 development platforms are also easily available in the same
>> price range as the Raspberry Pi, and have the same price range.
>
> Unfortunately, the A53, unlike most other ARM CPUs isn't properly
> documented. Figuring out latencies and throughputs of instructions
> requires a lot of micro-benchmarking.

I agree on this, even the information available from compiler backends
is less detailed. On the other hand, it is a core gaining quite some
significance even in relatively low-end smartphones

Cheers,

Alessandro

Kevin Chadwick

unread,
Feb 1, 2019, 7:29:03 AM2/1/19
to pqc-...@list.nist.gov
On 2/1/19
>> I agree that it is a bit of an overstatement stating that a Cortex-M4 is
>> closer to an A-series CPU, however, it's still a microcontroller with a
>> significantly richer ISA than an M0 and possibly lower instruction
>> latencies (e.g., it has no option for a cut-down 32 cycle multiplier).

Perhaps completely false was also an overstatement then as you appear to know
more of the underlying details than I do, actually. I'm not even sure if all As
have MMU or not.


> Could compromise on M3, which doesn't have the DSP instructions, but at
> least you can get development boards with enough RAM and ROM to make
> most of the round-2 candidates work.

An M3 is 20%/80p cheaper, but for us in contrast to product price it's worth the
extra in case the features are beneficial in the future (e.g. why not, size and
power are similar). I guess for others, any level of price saving may be crucial.

If the DSP are useful to an implementation then unless there is good reason
(consistency of availability maybe?) to take DSP out of the equation then I
don't see why an implementation should be penalised?? I don't have the
experience to answer the DSP benefit/relevancy question at all, though.

Sujoy Sinha Roy

unread,
Feb 1, 2019, 12:30:55 PM2/1/19
to pqc-...@list.nist.gov
Hi,

Microcontrollers have widely varying computational capabilities.

Cortex-M0 falls in the most constrained section, Cortex-M4 falls in the
middle and then there are more powerful microcontrollers such as
Cortex-A7.

Cortex-M3 and Cortex-M4 are nearly identical with the only major
difference being that Cortex-M4 includes DSP instructions. Lattice-based
schemes with power-of-two modulus can utilize DSP instructions to
speedup polynomial multiplications.

https://eprint.iacr.org/2018/682.pdf
https://eprint.iacr.org/2018/1018.pdf

Additionally, DSP instructions have been used to speedup Frodo in
https://eprint.iacr.org/2018/1116.pdf

Choosing Cortex M3 instead of Cortex M4 would penalize the schemes that
use generic polynomial multiplications and will skew the performance
comparison in favor of NTT based schemes.

If we should choose a single microcontroller for the evaluation of
candidates (as Bernstein suggested in his email), it should be the
Cortex-M4. Moreover, it is very often chosen by implementers and many
schemes have been optimized for this platform.

Comparisons on M3 are better because it doesn't have DSP instructions is
like saying that high-end implementations should not use AVX
instructions for comparisons. When we optimize an implementation we want
to use the resources available in the best way, and we do optimizations
at all levels.

If we should choose more than one microcontroller, then the wide range
starting from Cortex-M0 (most constrained), Cortex-M4 (in the middle) to
Cortex-A7 (more powerful) seems to be more rational.


Regards
Sujoy

James

unread,
Feb 4, 2019, 6:42:52 AM2/4/19
to Markku-Juhani O. Saarinen, pqc-forum
Hi,

I think Markku brings up an important point about Keccak. This consumed a large amount of area and in the end was essentially the bottleneck in the FrodoKEM FPGA implementation [1]. I've swapped this out for a smaller PRNG (unrolled Trivium) and the performance gains for the same area consumption are about x16. This is interesting because I wouldn't expect this to affect other schemes (e.g., Kyber) as much as this.

A few people have mentioned this already, but regarding FPGA families I think it's sensible to just stick to one: a common one that has DSP instructions and has been used a few times already is Artix-7 (Spartan-6 is quite old now). Also, from what Carlos said, when reporting area consumption of the designs it's pretty straightforward to report both cases of with and without BRAM usage, I think this is quite important and enables a much fairer comparison. And (this might be obvious) as well as reporting LUT/FF/Slices/ etc I think it's also key to state how much each sub-module (e.g., Gaussian samplers) is utilising. Another obvious point is adding the exact model of the FPGA (e.g., Artix-7 XC7A35T CPG236) and version of software (e.g., Xilinx Vivado v2018.3).


Cheers
James

--

Nicholas Cavignac

unread,
Mar 29, 2019, 6:59:44 PM3/29/19
to pqc-forum
Hello, I'm new to the PQC Forum. I would like to contribute a suggestion regarding programmable hardware and vector intructions, though I do not have any implementations ready at this moment.

How well do you think a would a GPU-based implementation would work? General Purpose GPU computing has advanced quite far in terms of just how powerful and flexible it is, and while nowhere near as efficient as an ASIC or purpose-engineered microcontroller, still considerably more efficient than conventional CPUs as far as Single-Instruction-Multiple-Data (SIMD)/Vector operations go. And as modern GPUs are commonplace, a GPU-based implementation might be ideal for helping Post-Quantum Cryptography techniques quickly achieve widespread adoption.
Reply all
Reply to author
Forward
0 new messages