Dear all,
We would like to share some results on NTT-based polynomial and matrix-vector multiplication on Arm from our recent work
"Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1",
The core technical novelty of the paper is a simple -- yet, to the best of our knowledge, unknown -- relation between Barrett and Montgomery reduction. Moreover, this relation extends to a relation between Montgomery multiplication and a new algorithm for modular multiplication with a known constant which we call "Barrett multiplication". We demonstrate that vectorized Barrett multiplication has a 3-instruction implementation on Armv8-A + Neon, shorter than all previously known sequences for vectorized one-known-factor modular multiplications. We also introduce a number of new variants for two-unknown-factor Montgomery multiplications on Armv8-A + Neon, leveraging doubling and rounding high-multiply instructions originating from fixed-point arithmetic.
Finally, we introduce an improved caching technique for matrix-vector multiplication based on the incomplete NTT which we call "asymmetric multiplication": The incomplete NTT of the vector-operand is expanded to include additional precomputation, reducing the complexity of subsequent base multiplication from F_q[X]/(X^4-zeta) to F_q[X]/(X^4-1), but without requiring a 4-th root of zeta.
We apply our findings in implementations of Saber, Kyber, and Dilithium, and benchmark their performance on the Cortex-A72 processor and the Apple M1, improving performance of matrix-vector multiplication by 1.7x-2.1x compared to the implementations in [Nguyen-Gaj 2021].
Regards,
The team