Dear all,
In the context of the performance of PQC on embedded systems, we would
like to bring two advances to your attention:
- A "striding" variant of Toom-Cook/Karatsuba based negacyclic
polynomial multiplication, reducing the memory usage of
Toom-Cook/Karatsuba by 50%
Striding Toom-Cook/Karatsuba is known (e.g. [1]) but doesn't appear to
have been applied in the context of lattice-based cryptography, and
Saber in particular. In essence, by using an even/odd instead of
lower/upper half decomposition for Karatsuba, the base ring remains
negacyclic, and thus size expansion can be avoided during the base
multiplication. The same applies to k-way Toom-Cook. We have
implemented a striding degree-256 negacyclic polynomial multiplication
on the Cortex-M4 CPU.
- The study of negacyclic polynomial multiplication on the Cortex-M55
processor, using the Helium vector extension
We give an overview over the challenges of bringing a vector extension
to resource-constrained embedded systems and how the Helium vector
extension and the Cortex-M55 CPU address those. We then implement
NTT-based and striding Toom-Cook/Karatsuba based negacyclic polynomial
multiplications on the Cortex-M55 CPU.
For the 32-bit NTT, we obtain a near-optimal speedup of 3.4x compared
to highly optimized scalar assembly on Cortex-M4 with DSP. This 3.4x
speedup is achieved with only a 2x increase in data-path width, and
maintaining numerous cost-saving measures such as 8 128-bit vector
registers, a single-issue microarchitecture, and vector instructions
spreading their work over multiple cycles while allowing overlapping.
For Toom-Cook/Karatsuba based multiplication, we obtain a 5x speedup,
which makes sense since it's a 16-bit workload and we're comparing it
to an implementation leveraging the 16-bit DSP extension for
Cortex-M4.
The details can be found in our paper
https://eprint.iacr.org/2021/998.
We refer to
https://developer.arm.com/architectures/instruction-sets/simd-isas/helium
and
https://developer.arm.com/ip-products/processors/cortex-m/cortex-m55
for numerous further resources, including a blog series and two
whitepapers on the design of the Helium vector extension and the
Cortex-M55 CPU. Our code for the Cortex-M55 is available on
https://gitlab.com/arm-research/security/pqmx. The repository also
contains instructions for how to setup and use a freely available
functional model for the Cortex-M55. Development boards and FPGA
images can be obtained from
https://developer.arm.com/tools-and-software/development-boards/fpga-prototyping-boards.
Please reach out if there are any questions.
Regards,
Angshuman, Hanno, Jose, Joseph, Ingrid
[1]: Daniel Bernstein, Multidigit Multiplication for Mathematicians:
https://cr.yp.to/papers/m3.pdf