Hi David,
Interesting, thanks for the suggestion. In the <=82 bit range, the speedup on the primality test when using cmp3 instead of cmp1 is around 1-3% for random input.
We can actually do even better, potentially.
Suppose instead of comparing (ahi * 2^64 + alo) <= (bhi * 2^64 + blo), we just do ahi < bhi. This is now up to 5% faster than the primality test using cmp1.
A priori this is now an inexact divisibility test which might allow some composites to slip through, though this isn't a disaster as long as such composites are sufficiently rare, since they will be caught by the base-2 sprp test later anyway.
Testing this, it seems that failures only become frequent when the high word is very close to UWORD_MAX. Empirically, for input nhi * 2^64 + n_randlimb(), the probability that this inexact trial division by 64 primes doesn't give the same result as the exact trial division by 64 primes seems to be:
nhi = UWORD_MAX -> 80%
nhi = (UWORD_MAX - 1) -> 30%
nhi = (UWORD_MAX - 2) -> 22%
...
nhi = (UWORD_MAX - 264) -> 0.071%
For nhi <= UWORD_MAX - 265 I don't get *any* failures after trying millions of random inputs.
Let's see what happens if we set an artificially small limb size 2^bits and check by brute force where testing (odd) double-limb x for divisibility by (odd) single-limb d gives the wrong result:
#include "ulong_extras.h"
int main()
{
ulong bits, B, B2, d, x, inv1d, inv2d, minx, minxd;
int divis1, divis2;
bits = 4; // Bits in limb
B = UWORD(1) << bits; // Limb radix
B2 = UWORD(1) << (2 * bits); // Double limb radix
minx = B2;
minxd = 0;
// Check all odd single-limb divisors
for (d = 3; d < B; d += 2)
{
inv1d = n_binvert(d);
inv2d = (B2 - 1) / d;
// Check all odd double-limb integers
for (x = 3; x < B2; x += 2)
{
// Exact divisibility of x by d
divis1 = ((x * inv1d) & (B2 - 1)) <= inv2d;
// Inexact(?) divisibility of x by d comparing only the top limb
divis2 = (((x * inv1d) & (B2 - 1)) >> bits) < (inv2d >> bits);
if (divis1 != divis2)
{
if (x < minx)
{
minx = x;
minxd = d;
}
flint_printf("d = %wu fails at x = %wu (%.8f B^2)\n",
d, x, x / (double) B2);
break;
}
}
}
flint_printf("smallest failing x: %wu (%.8f B^2) (for d = %wu)\n",
minx, minx / (double) B2, minxd);
}
Example output for 4-bit limbs:
d = 3 fails at x = 243 (0.949219 B^2)
d = 5 fails at x = 245 (0.957031 B^2)
d = 7 fails at x = 231 (0.902344 B^2)
d = 9 fails at x = 153 (0.597656 B^2)
d = 11 fails at x = 187 (0.730469 B^2)
d = 13 fails at x = 221 (0.863281 B^2)
d = 15 fails at x = 255 (0.996094 B^2)
smallest failing x: 153 (0.597656 B^2) (for d = 9)
Conjecture 1: this sloppy version of the 2x1 limb divisibility test is actually exact for x < 2^(2*bits-1) (e.g. for all 127-bit integers when we have 64-bit limbs).
Numerical evidence:
bits = 12: smallest failing x: 8394753 (0.500366 B^2) (for d = 2049)
bits = 13: smallest failing x: 33566721 (0.500183 B^2) (for d = 4097)
bits = 14: smallest failing x: 134242305 (0.500092 B^2) (for d = 8193)
...
Conjecture 2: for bounded d <= D, the test is exact for all x < (1 - epsilon) * 2^(2*bits), where epsilon converges (rapidly) to zero as a function of the bit size (of order 2^(-bits)?).
Numerical evidence:
8-bit limbs:
d = 3 fails at x = 65283 (0.99613953 x B^2)
d = 5 fails at x = 65285 (0.99617004 x B^2)
d = 7 fails at x = 64519 (0.98448181 x B^2)
d = 9 fails at x = 64521 (0.98451233 x B^2)
d = 11 fails at x = 64779 (0.98844910 x B^2)
d = 13 fails at x = 63245 (0.96504211 x B^2)
d = 15 fails at x = 65295 (0.99632263 x B^2)
d = 17 fails at x = 65297 (0.99635315 x B^2)
...
16-bit limbs:
d = 3 fails at x = 4294901763 (0.99998474 x B^2)
d = 5 fails at x = 4294901765 (0.99998474 x B^2)
d = 7 fails at x = 4294836231 (0.99996948 x B^2)
d = 9 fails at x = 4294508553 (0.99989319 x B^2)
d = 11 fails at x = 4294377483 (0.99986267 x B^2)
d = 13 fails at x = 4294770701 (0.99995423 x B^2)
d = 15 fails at x = 4294901775 (0.99998474 x B^2)
d = 17 fails at x = 4294901777 (0.99998475 x B^2)
...
20-bit limbs:
d = 3 fails at x = 1099510579203 (0.99999905 x B^2)
d = 5 fails at x = 1099510579205 (0.99999905 x B^2)
d = 7 fails at x = 1099507433479 (0.99999619 x B^2)
...
Question: can one work out some effective, explicit bound for epsilon as a function of bits and D? For example, can we give an explicit bound for epsilon when bits = 64 and D = 2^8 or D = 2^16?
Neither conjecture needs a proof to be used in a primality test, since we ultimately fall back on other methods as noted above. However, proofs would allow using this slightly faster divisibility test in other settings where exactness does matter.
Fredrik