FIPS204 ML-DSA: Sharing on use of Montgomery reduction

Bo Lin

unread,

Jul 5, 2025, 9:26:08 AMJul 5

to pqc-forum

Dear all,

One way to understand the standard is to code it. I recently coded FIPS 204 ML-DSA (Dilithium) and have some observations on the Montgomery reduction (Algorithm 49) and NTT (Algorithm 41 and Algorithm 42).

1. If Montgomery reduction is used, then the following table,

zetasR[0..255] = {
0, 25847, 5771523, 7861508, 237124, 7602457, 7504169, 466468,
1826347, 2353451, 8021166, 6288512, 3119733, 5495562, 3111497, 2680103,
2725464, 1024112, 7300517, 3585928, 7830929, 7260833, 2619752, 6271868,
6262231, 4520680, 6980856, 5102745, 1757237, 8360995, 4010497, 280005,
2706023, 95776, 3077325, 3530437, 6718724, 4788269, 5842901, 3915439,
4519302, 5336701, 3574422, 5512770, 3539968, 8079950, 2348700, 7841118,
6681150, 6736599, 3505694, 4558682, 3507263, 6239768, 6779997, 3699596,
811944, 531354, 954230, 3881043, 3900724, 5823537, 2071892, 5582638,
4450022, 6851714, 4702672, 5339162, 6927966, 3475950, 2176455, 6795196,
7122806, 1939314, 4296819, 7380215, 5190273, 5223087, 4747489, 126922,
3412210, 7396998, 2147896, 2715295, 5412772, 4686924, 7969390, 5903370,
7709315, 7151892, 8357436, 7072248, 7998430, 1349076, 1852771, 6949987,
5037034, 264944, 508951, 3097992, 44288, 7280319, 904516, 3958618,
4656075, 8371839, 1653064, 5130689, 2389356, 8169440, 759969, 7063561,
189548, 4827145, 3159746, 6529015, 5971092, 8202977, 1315589, 1341330,
1285669, 6795489, 7567685, 6940675, 5361315, 4499357, 4751448, 3839961,
2091667, 3407706, 2316500, 3817976, 5037939, 2244091, 5933984, 4817955,
266997, 2434439, 7144689, 3513181, 4860065, 4621053, 7183191, 5187039,
900702, 1859098, 909542, 819034, 495491, 6767243, 8337157, 7857917,
7725090, 5257975, 2031748, 3207046, 4823422, 7855319, 7611795, 4784579,
342297, 286988, 5942594, 4108315, 3437287, 5038140, 1735879, 203044,
2842341, 2691481, 5790267, 1265009, 4055324, 1247620, 2486353, 1595974,
4613401, 1250494, 2635921, 4832145, 5386378, 1869119, 1903435, 7329447,
7047359, 1237275, 5062207, 6950192, 7929317, 1312455, 3306115, 6417775,
7100756, 1917081, 5834105, 7005614, 1500165, 777191, 2235880, 3406031,
7838005, 5548557, 6709241, 6533464, 5796124, 4656147, 594136, 4603424,
6366809, 2432395, 2454455, 8215696, 1957272, 3369112, 185531, 7173032,
5196991, 162844, 1616392, 3014001, 810149, 1652634, 4686184, 6581310,
5341501, 3523897, 3866901, 269760, 2213111, 7404533, 1717735, 472078,
7953734, 1723600, 6577327, 1910376, 6712985, 7276084, 8119771, 4546524,
5441381, 6144432, 7959518, 6094090, 183443, 7403526, 1612842, 4834730,
7826001, 3919660, 8332111, 7018208, 3937738, 1400424, 7534263, 1976782}

can be used, where zetasR[i] = zetas[i]*R mod q. Here, we take R, the Montgomery Radix (See Peter L. Montgomery (1985). "Modular multiplication without trial division". Mathematics of Computation. 44 (170): 519–521.) as 2^32.

If the zetasR table is used, the result in Line 12 of Algorithm 41 will be in q-residue (say the subtraction version of Montgomery reduction is used. The no-subtraction version is discussed later). As a result, the conversion between q-residue and Montgomery domain can be avoided. The same goes to Line 15 of Algorithm 42 but the f value (or the fR value) should be 16382.

2. Use of No-Subtraction of Montgomery reduction as suggested by the standard (page 50) is advantageous. I made a performance comparison and it appears the no-subtraction version could be 30% to 50% faster than the subtraction version. The following is from a screen-shot:

--------------------------------------------------------------------
test information -
CPU: Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
OS: Microsoft Windows 11, 10.0.26621 Build 26621
Loops: 5000000, Warm cache, Overhead removed

overhead 203 clicks (0.203000 seconds), 5000000 loops

perf - GM 309 clicks (0.309000 seconds), 5000000 loops

perf - mulmod 31 clicks (0.031000 seconds), 5000000 loops

perf - Mont 24 clicks (0.024000 seconds), 5000000 loops

perf - MontNoSub 14 clicks (0.014000 seconds), 5000000 loops
---------------------------------------------------------------------

The above GM is for Generalised Mersenne prime because the q = 2^23 - (2^13 - 1), mulmod for a * b mod q straight, mont for montmul(a, b, q) = a*b*R^-1 mod q (R = 2^32, the Montgomery Radix as in the Montgomery paper) and MontNoSub self-explanatory.

The above performance figures are only for reference because it was run on an office laptop with all the background applications turned on. The point is, even with the powerful x64 architecture with multi-issue, multi-stage pipeline and hardware divider, Montgomery multiplication prevail.

3. As mentioned in page 50, when the no-subtraction version Montgomery reduction is used, the result would be in [0, 2*q - 1] instead of in the q-residual [0, q - 1]. The operations in Line 13 and Line 14 may make the result being accumulated to a multiple of q parasite, i.e., resulting value = actual value + k*q for a small k. There is no concern to use the resulting value for Algorithm 49 because as long as the value is smaller than 2^32 * q (the T < RN, in the Montgomery paper mentioned above).

To remove the k*q, a montmul(wHAT[i], 2^32 mod q, q) *with* subtraction can be used.

3. When Montgomery multiplication is used for dot multiplication, such as Lines 12, 18, 19, and 25 of Algorithm 7, the conversion between q-residue and Montgomery domain is inevitable. To remove the Montgomery parasite, the 2^-32 mod q, a montmul(wHAT[i], 2^64 mod q, q) *with* subtraction can be used.

The above observations apply to other lattice-based scheme and I hope they are helpful for some practitioners.

Kind regards,

Bo LIN

Message has been deleted

Bo Lin

unread,

Sep 7, 2025, 4:50:02 PMSep 7

to pqc-forum, Bo Lin

Dear all,

I addition to applying Montgomery multiplication for polynomial multiplication (NTT), I added its application to calculate modulo-5 and work out modulo-2gamma2 with the Chinese Remainder Theory. They can be used to avoid a hardware divider or a microcode division routine to save
some silicon area for ML-DSA hardware or microcode implementations.

APPLYING MONTGOMERY MULTIPLICATION TO IMPLEMENTING FIPS 204 (ML-DSA)

Best regards,

Reply all

Reply to author

Forward