Bo Lin
unread,Jul 5, 2025, 9:26:08 AMJul 5Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to pqc-forum
Dear all,
One way to understand the standard is to code it. I recently coded FIPS 204 ML-DSA (Dilithium) and have some observations on the Montgomery reduction (Algorithm 49) and NTT (Algorithm 41 and Algorithm 42).
1. If Montgomery reduction is used, then the following table,
zetasR[0..255] = {
0, 25847, 5771523, 7861508, 237124, 7602457, 7504169, 466468,
1826347, 2353451, 8021166, 6288512, 3119733, 5495562, 3111497, 2680103,
2725464, 1024112, 7300517, 3585928, 7830929, 7260833, 2619752, 6271868,
6262231, 4520680, 6980856, 5102745, 1757237, 8360995, 4010497, 280005,
2706023, 95776, 3077325, 3530437, 6718724, 4788269, 5842901, 3915439,
4519302, 5336701, 3574422, 5512770, 3539968, 8079950, 2348700, 7841118,
6681150, 6736599, 3505694, 4558682, 3507263, 6239768, 6779997, 3699596,
811944, 531354, 954230, 3881043, 3900724, 5823537, 2071892, 5582638,
4450022, 6851714, 4702672, 5339162, 6927966, 3475950, 2176455, 6795196,
7122806, 1939314, 4296819, 7380215, 5190273, 5223087, 4747489, 126922,
3412210, 7396998, 2147896, 2715295, 5412772, 4686924, 7969390, 5903370,
7709315, 7151892, 8357436, 7072248, 7998430, 1349076, 1852771, 6949987,
5037034, 264944, 508951, 3097992, 44288, 7280319, 904516, 3958618,
4656075, 8371839, 1653064, 5130689, 2389356, 8169440, 759969, 7063561,
189548, 4827145, 3159746, 6529015, 5971092, 8202977, 1315589, 1341330,
1285669, 6795489, 7567685, 6940675, 5361315, 4499357, 4751448, 3839961,
2091667, 3407706, 2316500, 3817976, 5037939, 2244091, 5933984, 4817955,
266997, 2434439, 7144689, 3513181, 4860065, 4621053, 7183191, 5187039,
900702, 1859098, 909542, 819034, 495491, 6767243, 8337157, 7857917,
7725090, 5257975, 2031748, 3207046, 4823422, 7855319, 7611795, 4784579,
342297, 286988, 5942594, 4108315, 3437287, 5038140, 1735879, 203044,
2842341, 2691481, 5790267, 1265009, 4055324, 1247620, 2486353, 1595974,
4613401, 1250494, 2635921, 4832145, 5386378, 1869119, 1903435, 7329447,
7047359, 1237275, 5062207, 6950192, 7929317, 1312455, 3306115, 6417775,
7100756, 1917081, 5834105, 7005614, 1500165, 777191, 2235880, 3406031,
7838005, 5548557, 6709241, 6533464, 5796124, 4656147, 594136, 4603424,
6366809, 2432395, 2454455, 8215696, 1957272, 3369112, 185531, 7173032,
5196991, 162844, 1616392, 3014001, 810149, 1652634, 4686184, 6581310,
5341501, 3523897, 3866901, 269760, 2213111, 7404533, 1717735, 472078,
7953734, 1723600, 6577327, 1910376, 6712985, 7276084, 8119771, 4546524,
5441381, 6144432, 7959518, 6094090, 183443, 7403526, 1612842, 4834730,
7826001, 3919660, 8332111, 7018208, 3937738, 1400424, 7534263, 1976782}
can be used, where zetasR[i] = zetas[i]*R mod q. Here, we take R, the Montgomery Radix (See Peter L. Montgomery (1985). "Modular multiplication without trial division". Mathematics of Computation. 44 (170): 519–521.) as 2^32.
If the zetasR table is used, the result in Line 12 of Algorithm 41 will be in q-residue (say the subtraction version of Montgomery reduction is used. The no-subtraction version is discussed later). As a result, the conversion between q-residue and Montgomery domain can be avoided. The same goes to Line 15 of Algorithm 42 but the f value (or the fR value) should be 16382.
2. Use of No-Subtraction of Montgomery reduction as suggested by the standard (page 50) is advantageous. I made a performance comparison and it appears the no-subtraction version could be 30% to 50% faster than the subtraction version. The following is from a screen-shot:
--------------------------------------------------------------------
test information -
CPU: Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
OS: Microsoft Windows 11, 10.0.26621 Build 26621
Loops: 5000000, Warm cache, Overhead removed
overhead 203 clicks (0.203000 seconds), 5000000 loops
perf - GM 309 clicks (0.309000 seconds), 5000000 loops
perf - mulmod 31 clicks (0.031000 seconds), 5000000 loops
perf - Mont 24 clicks (0.024000 seconds), 5000000 loops
perf - MontNoSub 14 clicks (0.014000 seconds), 5000000 loops
---------------------------------------------------------------------
The above GM is for Generalised Mersenne prime because the q = 2^23 - (2^13 - 1), mulmod for a * b mod q straight, mont for montmul(a, b, q) = a*b*R^-1 mod q (R = 2^32, the Montgomery Radix as in the Montgomery paper) and MontNoSub self-explanatory.
The above performance figures are only for reference because it was run on an office laptop with all the background applications turned on. The point is, even with the powerful x64 architecture with multi-issue, multi-stage pipeline and hardware divider, Montgomery multiplication prevail.
3. As mentioned in page 50, when the no-subtraction version Montgomery reduction is used, the result would be in [0, 2*q - 1] instead of in the q-residual [0, q - 1]. The operations in Line 13 and Line 14 may make the result being accumulated to a multiple of q parasite, i.e., resulting value = actual value + k*q for a small k. There is no concern to use the resulting value for Algorithm 49 because as long as the value is smaller than 2^32 * q (the T < RN, in the Montgomery paper mentioned above).
To remove the k*q, a montmul(wHAT[i], 2^32 mod q, q) *with* subtraction can be used.
3. When Montgomery multiplication is used for dot multiplication, such as Lines 12, 18, 19, and 25 of Algorithm 7, the conversion between q-residue and Montgomery domain is inevitable. To remove the Montgomery parasite, the 2^-32 mod q, a montmul(wHAT[i], 2^64 mod q, q) *with* subtraction can be used.
The above observations apply to other lattice-based scheme and I hope they are helpful for some practitioners.
Kind regards,
Bo LIN