Re: Some comments on n_root()

16 views

Skip to first unread message

Fredrik Johansson

unread,

Jan 20, 2026, 4:26:23 AMJan 20

to David Sparks, flint...@googlegroups.com

On Mon, Jan 19, 2026 at 7:11 PM David Sparks <spar...@proton.me> wrote:

A couple of questions about recent commit e9c1838b541
"Clean up and optimize n_root, n_rootrem"

Why is the inv_table[] initialized with decimal constants?
Wouldn't "1.0f/4, 1.0f/5, 1.0f/6, ..., 1.0f/40" be clearer and less
error-prone?

That or hex float literals, though it doesn't really matter for this particular table.

The 64-bit max_base[] array looks Wrong. If you're going to initialize
it with maximum-1, then the first entry should be 65534, no?
The 4th root of 2^64-1 is 65535.99999999999999911182158... but that
still rounds down to 65535.

Given this, the whole "-1" thing seems superfluous. Just store
floor(UWORD_MAX^(1/n)).

Correct, that can be simplified.

(It's also possible to shrink the 32-bit table to uint8_t, but that's
just the overoptimizer in me; it's hardly important.)

Also, the "if (upper_limit == 2)" test is good, but wouldn't it
be simpler as "if (root > 40)" *before* the lookup in max_base[]?
Same code complexity, smaller table.

(The upper limit, of course, is (sizeof max_base/sizeof *max_base) + 4.)

If root == 4, you're allowed to use "sqrtf" rather than "sqrt";
you only need 16 bits of precision, and float gives 24.

At least on my machine, sqrtf isn't faster than sqrt.

When casting the result back to integer, one useful trick is to
cast to (int32_t). Conversion between float and *signed* integer
types has hardware support; *unsigned* types require extra code
to deal with the msbit. So if you ever have a conversion that you
know will fit into 31 or 63 bits, always convert via a signed
integer type.

That's good to know. I should check if that makes a difference here and in some other places.

(Intel added unsigned integer conversions in AVX512, but that's still
uncommon. 32-bit unsigned is implemented by converting via signed
64-bit, so that's not nearly as painful as uint64_t.)

The code in n_rootrem seems wasteful.

First of all, it's easy to add a small epsilon to the root estimate
to ensure it's only ever too high, so the "adjust upward" branch
can disappear.

Second, it's easy to show that it's only ever too high by +1, so the
"adjust downward" loop can be an if(). (After avoiding square roots,
exhaustive testing of all perfect powers - 1 is quite easy.)

Even if you have overshot (e.g. estimated n_root(UWORD_MAX, 8) as 256)
so n_pow(256, 8) wraps to 0), currval - n still fits into a signed
slong.

Unfortunately, the libm logf and expf don't come with any accuracy guarantees, so we need to loop in both direction to verify the result. We could of course do this kind of thing if we shipped our own versions with known error bounds.

If it's >= 0, this difference *is* the remainder; there is no need to
call n_pow a second time. Only if it's < 0 *and* the remainder is
requested do you need to call n_pow() again.

Yes, the original version code redundantly recomputed the remainder and I didn't bother fixing this.

(Don't get me started on the fact that n_pow is an iterative loop
and starts with a multiply by 1. Or the fact that inv_table saving
one divide is lost in the noise compared to logf() & expf().)

logf and expf in e.g. glibc are very fast; a division isn't negligible in comparison. However, instead of a one-off table just for n_root it would make sense to have a table of small floating-point reciprocals global to FLINT as it has various other uses.

There's certainly room to improve n_pow.

Fredrik

David Sparks

unread,

Jan 20, 2026, 1:34:52 PMJan 20

to flint...@googlegroups.com, Fredrik Johansson, David Sparks

On Tuesday, January 20th, 2026 at 9:26 AM, Fredrik Johansson <fredrik....@gmail.com> wrote:
> On Mon, Jan 19, 2026 at 7:11 PM David Sparks <spar...@proton.me> wrote:
>> A couple of questions about recent commit e9c1838b541
>> "Clean up and optimize n_root, n_rootrem"
>>
>> Why is the inv_table[] initialized with decimal constants?
>> Wouldn't "1.0f/4, 1.0f/5, 1.0f/6, ..., 1.0f/40" be clearer and less
>> error-prone?
>
> That or hex float literals, though it doesn't really matter for this particular table.

The two issues are
- Make the intent clearer to the reader (readable code is better
than the best comment), and
- Can't be corrupted by a typo. Are you sure the 6th decimal
place in 1/23 is correct?

While hex float literals protect against rounding errors in the
compiler's decimal parsing, an expression like 1.0/5 solves everything.

>> If root == 4, you're allowed to use "sqrtf" rather than "sqrt";
>> you only need 16 bits of precision, and float gives 24.
>
> At least on my machine, sqrtf isn't faster than sqrt.

Huh! According to Agner Fog, here are the x86 timings:

SQRTSS (float) SQRTSD (double)
Latency Throughput Latency Throughpout
Sandy Bridge 10-14 10-14 10-21 10-21
Ivy Bridge 11 7 16 8-14
Haswell (same) (same)
Broadwell 11 4 15-16 4-8
Skylake 12 3 15-16 4-6
Coffee Lake (same) (same)
Cannon Lake (same) 16 4-6 (probably just a measurement glitch)
Ice Lake (same) 15-16 4-6

Steamroller 12-13 4-9 26-29 4-18
Excavator 13 4 16 4-9
Zen1 9-10 4-5 14-15 4-8
Zen2 14 6 20 9
Zen3 14 5 20 9
Zen4 15 5 21 8

Do you have an Intel processor? They have a smaller
double-precision penalty.

> Unfortunately, the libm logf and expf don't come with any accuracy guarantees, so we need to loop in both direction to verify the result. We could of course do this kind of thing if we shipped our own versions with known error bounds.

Actually, we do know some things about logf & expf. Most importantly,
they didn't exist until C99; K&R and C89 only defined log() and exp(),
and early implementations were just wrappers around the existing double-
precision code. Only then were the implementations tuned for speed
without sacrificing too much accuracy.

In general, it's far easier to produce a high-precision float function
if you have double available internally, so it is not credible that
there has *ever* been a logf or expf implementation with more than a
few ULPs of error (except maybe bugs with extreme values like
denormalized numbers, but that doesn't apply here).

Here are some surveys of various implementations:
"Accuracy of Mathematical Functions in Single Precision"
https://members.loria.fr/PZimmermann/papers/glibc232-20200917.pdf
"Accuracy of Mathematical Functions in Single, Double, Double Extended, and Quadruple Precision"
https://members.loria.fr/PZimmermann/papers/glibc238-20230921.pdf

Note that they can't find any implementation of logf or expf with
more than 1.0 ULP of error. glibc is migrating to the core-math
transcendental functions, which are correctly rounded (<0.5 ULP
of error).

On the other hand, the maximum n_root(UWORD_MAX,5) = ‭7,131.

log(7132) = 8.87234697898... = 0x8.df522
log(7131) = 8.87220675602... = 0x8.df48f
a difference of 0x93 = 147 ULPs!

log(7132^5) = 5*log(7132) = 44.36173489491... = 0xb.1726bp+2
log(7131^5) = 5*log(7131) = 44.36103378014... = 0xb.171b3p+2
here, the difference is 0xb8 = 184 ULPs, because the mantissas are
larger.

In the exp() output, the difference between 7132 and 7131 is
even larger (0xd.ee000p+9 vs. 0xd.ed800p+9 is 0x800 = 2048 ULPs)
because more of the information is in the exponent.

The key point is that these errors are more than 100x larger than any
plausible implementation, and we can easily make a reasonable-but-
conservative assumption of, say, 10 ULPs. And add a test case to
verify. There are only a few thousand cases, basically n^k-1 and n^k,
and you could test the much tighter expected bounds directly, basically
examining n - expf(logf(n_pow(n,k)) * (1.0/k)) and generating a self-test
failure (and hopefully a bug report) on anything much over 1 ULP.

(The reason we care is that it affects how much we have to round up to
ensure the answer is never too low, which in turn affects how often a
correction is applied and a second exponentiation must be performed to
recompute the remainder.)

>> (Don't get me started on the fact that n_pow is an iterative loop
>> and starts with a multiply by 1. Or the fact that inv_table saving
>> one divide is lost in the noise compared to logf() & expf().)
>
> logf and expf in e.g. glibc are very fast; a division isn't negligible in comparison. However, instead of a one-off table just for n_root it would make sense to have a table of small floating-point reciprocals global to FLINT as it has various other uses.

I just looked at the glibc implementations of logf and expf, and was
surprised (but gratified) to see that they use no divisions; they break
each octave into 16 or 32 intervals, and use a polynomial approximation
within each.

The core-math correctly rounded versions is a bit more complex, but
still not bad; log2m uses 2*65*8 = 1040 bytes of tables, plus a few
words of polynomial coefficients. (Contrast that with the serious hair
in glibc/sysdeps/ieee754/flt-32/s_exp2m1f.c.)

One thing I just realized... C99 introduced both logf *and* log2f.
In practice (including glibc), log(f) is implemented in terms of
log2(f), because the first step is to extract the binary exponent.

So using log2f/exp2f should be no less portable and might be a tiny
bit faster.

Reply all

Reply to author

Forward

0 new messages