New paper: Neon NTT - Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1

385 views
Skip to first unread message

hanno....@arm.com

unread,
Jul 27, 2021, 9:17:29 AM7/27/21
to pqc-forum
Dear all,

We would like to share some results on NTT-based polynomial and matrix-vector multiplication on Arm from our recent work

"Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1",

The core technical novelty of the paper is a simple -- yet, to the best of our knowledge, unknown -- relation between Barrett and Montgomery reduction. Moreover, this relation extends to a relation between Montgomery multiplication and a new algorithm for modular multiplication with a known constant which we call "Barrett multiplication". We demonstrate that vectorized Barrett multiplication has a 3-instruction implementation on Armv8-A + Neon, shorter than all previously known sequences for vectorized one-known-factor modular multiplications. We also introduce a number of new variants for two-unknown-factor Montgomery multiplications on Armv8-A + Neon, leveraging doubling and rounding high-multiply instructions originating from fixed-point arithmetic.

Finally, we introduce an improved caching technique for matrix-vector multiplication based on the incomplete NTT which we call "asymmetric multiplication": The incomplete NTT of the vector-operand is expanded to include additional precomputation, reducing the complexity of subsequent base multiplication from F_q[X]/(X^4-zeta) to F_q[X]/(X^4-1), but without requiring a 4-th root of zeta.

We apply our findings in implementations of Saber, Kyber, and Dilithium, and benchmark their performance on the Cortex-A72 processor and the Apple M1, improving performance of matrix-vector multiplication by 1.7x-2.1x compared to the implementations in [Nguyen-Gaj 2021].

Regards,
The team

hanno....@arm.com

unread,
Nov 16, 2021, 8:48:00 AM11/16/21
to pqc-forum, hanno....@arm.com, vic...@shoup.net, d.ha...@unsw.edu.au
Hi all, We have updated [1] which now contains results for all parameter sets of Kyber, Saber, and Dilithium. In addition, we added a clarification regarding > ... a new algorithm for modular multiplication with a known constant which we call "Barrett multiplication". We have just found that the resulting "Barrett multiplication" was already known in the unsigned context and attributed to Shoup -- see e.g. [2]. The relation to Montgomery multiplication and the extension to signed arithmetic and arbitrary approximations (which is relevant for mapping the algorithm to Armv8-A fixed point instructions)
remain new to the best of our knowledge, but we appreciate clarification in case there's further prior art. Kind regards, The team, [1] https://eprint.iacr.org/2021/986 [2]: David Harvey: Faster Arithmetic for Number Theoretic Transforms https://arxiv.org/abs/1205.2926

Reply all
Reply to author
Forward
0 new messages