A question for SNOVA team (and other teams using binary Galois fields).

niux_d...@icloud.com

unread,

Sep 16, 2025, 11:24:09 PMSep 16

to pqc-forum

The question is simple. Have we used the carry-less multiplication instructions available on x86 and ARM chips? F_2, F_16 can be easily accelerated by these instructions.

These instructions were intended for GCM and XTS, but can be applied to any scheme that use binary polynomials in fact.

Thanks.

Jan Adriaan Leegwater

unread,

Sep 17, 2025, 5:13:45 AMSep 17

to pqc-forum

The short answer is no. The version of SNOVA submitted to round 2 uses AVX2 shuffle instructions (_mm256_shuffle_epi8) to vectorize the multiplication.

Next week at the 6th PQC Standardization Conference we will present benchmark results for a faster version of SNOVA that uses regular, 16 bit multiplications (l=4 only). This version can be found in the oddqsrc directory at https://github.com/PQCLAB-SNOVA/SNOVA (You will probably want to use the minor update which is available at https://github.com/vacuas/SNOVA)

While this version uses only C statements, the compiler will actually vectorize the code to _mm256_mullo_epi16 instructions on x86. Interestingly, when using gcc 15.2.1, the plain-C version is faster than the version using explicit AVX2 instructions.

Jan Adriaan Leegwater

SNOVA team

Op wo 17 sep 2025 om 05:24 schreef niux_dannyniu via pqc-forum <pqc-...@list.nist.gov>:

The question is simple. Have we used the carry-less multiplication instructions available on x86 and ARM chips? F_2, F_16 can be easily accelerated by these instructions.

These instructions were intended for GCM and XTS, but can be applied to any scheme that use binary polynomials in fact.

Thanks.

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/8236D7F3-8E8E-4301-8D62-A69E501F871A%40icloud.com.

Matthias Kannwischer

unread,

Sep 17, 2025, 6:46:49 AMSep 17

to niux_d...@icloud.com, pqc-forum

Dear Danny,

Yes, carry-less multiplications have been used to accelerate GF(16) arithmetic. For example, in [1] for UOV.

Multiplications themselves are usually done using shuffles/table instructions, but carry-less multiplication is useful for generating the required multiplication tables using Neon on AArch64 - see [2].

I'm not an x86 expert, but my understanding is that on AVX2 only 64-bit carry-less multiplications are available which appear not that useful for generating GF(16) and GF(256) multiplication tables.

Btw, GFNI is of course very useful for both GF(16) and GF(256) - we have UOV implementations using GFNI [3].

Kind regards,

Matthias

[1] https://eprint.iacr.org/2023/059

[2] https://github.com/pqov/pqov/tree/main/src/neon

[3] https://github.com/pqov/pqov/tree/main/src/gfni

Reply all

Reply to author

Forward