Fast and Clean: Auditable high-performance assembly via constraint solving

228 views

Skip to first unread message

Matthias Kannwischer

unread,

Mar 26, 2023, 9:08:29 PM3/26/23

to pqc-forum, amin.abd...@mpi-sp.org, beck...@amazon.co.uk, fabien...@arm.com

We would like to make the community aware of our work

"Fast and Clean: Auditable high-performance assembly via constraint solving"

https://eprint.iacr.org/2022/1303

(Abdulrahman, Becker, Kannwischer, Klein)

The paper proposes a development workflow for implementors of (PQ)C which frees them from the often-faced trade-off between performance and 'cleanliness' of code, thereby facilitating the transition of PQC from research prototypes into practice: Developers use symbolic registers and macros to write 'clean' assembly in full control of instruction selection, while the time-consuming and quickly inscrutable application of register allocation and (microarchitecture-specific) instruction scheduling is left to automated superoptimization.

Concretely, we implement SLOTHY -- [S]uper ([L]azy) [O]ptimization of [T]tricky [H]andwritten assembl[Y] -- an architecture- and microarchitecture-agnostic assembly optimizer based on constraint-solving (using CP-SAT as the current backend). As a start, we instantiate SLOTHY with prototypes of architecture models for Armv8.1-M + Helium and AArch64 + Neon, and microarchitecture models for the Cortex-M55, Cortex-M85, Cortex-A55, and Cortex-A72 CPUs.

We demonstrate the power of SLOTHY by optimizing three fundamental workloads: (a) the FFT (on Cortex-M55/M85), (b) the NTTs underlying Kyber and Dilithium (on Cortex-M55/M85/A55/A72), and (c) the scalar multiplication of X25519 (on Cortex-A55). In all cases, our SLOTHY-optimized implementations match or beat prior art -- even highly optimized and microarchitecture-specific handwritten assembly -- while being automatically derived from readable base implementations, and retaining compact code (no unrolling).

For example, our SLOTHY-optimized Cortex-A72 Dilithium NTT is 31% faster than the previously best implementation [1]. Moreover, as this example shows, the use of SLOTHY is not confined to in-order microarchitectures. In fact, even microarchitectures with simple pipelines can benefit from SLOTHY's ability to minimize register usage through suitable instruction reordering and register renaming.

Finally, we outline how the optimizations found by SLOTHY can be checked without knowledge of or trust in the approach or the implementation. Turning this outline into a mechanized formal verification is left as a promising avenue for future work, alongside the exploration of the use of SLOTHY for other architectures and microarchitectures.