Optimizing embedded polynomial multiplication with the Helium vector extension and low-memory Toom-Cook.

257 views

Skip to first unread message

Angshuman Karmakar

unread,

Aug 5, 2021, 3:46:54 AM8/5/21

to pqc-...@list.nist.gov

Dear all,

In the context of the performance of PQC on embedded systems, we would
like to bring two advances to your attention:

- A "striding" variant of Toom-Cook/Karatsuba based negacyclic
polynomial multiplication, reducing the memory usage of
Toom-Cook/Karatsuba by 50%

Striding Toom-Cook/Karatsuba is known (e.g. [1]) but doesn't appear to
have been applied in the context of lattice-based cryptography, and
Saber in particular. In essence, by using an even/odd instead of
lower/upper half decomposition for Karatsuba, the base ring remains
negacyclic, and thus size expansion can be avoided during the base
multiplication. The same applies to k-way Toom-Cook. We have
implemented a striding degree-256 negacyclic polynomial multiplication
on the Cortex-M4 CPU.

- The study of negacyclic polynomial multiplication on the Cortex-M55
processor, using the Helium vector extension

We give an overview over the challenges of bringing a vector extension
to resource-constrained embedded systems and how the Helium vector
extension and the Cortex-M55 CPU address those. We then implement
NTT-based and striding Toom-Cook/Karatsuba based negacyclic polynomial
multiplications on the Cortex-M55 CPU.

For the 32-bit NTT, we obtain a near-optimal speedup of 3.4x compared
to highly optimized scalar assembly on Cortex-M4 with DSP. This 3.4x
speedup is achieved with only a 2x increase in data-path width, and
maintaining numerous cost-saving measures such as 8 128-bit vector
registers, a single-issue microarchitecture, and vector instructions
spreading their work over multiple cycles while allowing overlapping.
For Toom-Cook/Karatsuba based multiplication, we obtain a 5x speedup,
which makes sense since it's a 16-bit workload and we're comparing it
to an implementation leveraging the 16-bit DSP extension for
Cortex-M4.

The details can be found in our paper https://eprint.iacr.org/2021/998.

We refer to https://developer.arm.com/architectures/instruction-sets/simd-isas/helium
and https://developer.arm.com/ip-products/processors/cortex-m/cortex-m55
for numerous further resources, including a blog series and two
whitepapers on the design of the Helium vector extension and the
Cortex-M55 CPU. Our code for the Cortex-M55 is available on
https://gitlab.com/arm-research/security/pqmx. The repository also
contains instructions for how to setup and use a freely available
functional model for the Cortex-M55. Development boards and FPGA
images can be obtained from
https://developer.arm.com/tools-and-software/development-boards/fpga-prototyping-boards.

Please reach out if there are any questions.

Regards,
Angshuman, Hanno, Jose, Joseph, Ingrid

[1]: Daniel Bernstein, Multidigit Multiplication for Mathematicians:
https://cr.yp.to/papers/m3.pdf

hanno....@arm.com

unread,

Oct 20, 2021, 4:43:00 AM10/20/21

to pqc-forum, angs...@gmail.com

Speaking for myself, some additions to what Angshuman said:

1) Researchers can get access to Cortex-M55 RTL and cycle accurate models

(as well as other Arm IP) as part of the Arm Academic Access program.

I encourage everyone who's interested to check if your institution is already

enrolled, and reach out if not. It's free and a one-off effort to sign up.

Link: https://www.arm.com/resources/research/enablement/academic-access

2) Technically: The paper focuses on Saber because its primary goal is to introduce

the Helium vector extension and the Cortex-M55 CPU in one example of interest and

because Saber benefits from the algorithmic improvement for Toom-Cook.

Further research on using Helium for other candidates is required.

For Dilithium, most of the 32-bit NTT code is expected to be the same.

For Kyber, some new ideas will be needed since we use the "doubling rounding" Montgomery

multiplication technique from [1] which benefits from a larger gap between prime and

(half-)word size boundary, and because some code is specific to 32-bit.

Moreover, the lower number of 8 vector registers in Helium appears to work best

with merging two NTT layers a time, hence an even total number of layers.

So, some interesting implementation challenges to think about.

Regards,

Hanno

[1]: https://eprint.iacr.org/2021/986

Reply all

Reply to author

Forward

0 new messages