Any interest in nat.mulRange simplification/optimization?

John Jannotti

unread,

Jan 7, 2024, 1:46:23 AMJan 7

to golang-nuts

I enjoy bignum implementations, so I was looking through nat.go and saw that `mulRange` is implemented in a surprising, recursive way,. In the non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) * mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).

That's fine, but I didn't see any advantage over the straightforward (and simpler?) for loop.

```

z = z.setUint64(a)
for m := a + 1; m <= b; m++ {
z = z.mul(z, nat(nil).setUint64(m))
}
return z

```

In fact, I suspected the existing code was slower, and allocated a lot more. That seems true. A quick benchmark, using the existing unit test as the benchmark, yields

BenchmarkRecusiveMulRangeN-10 169417 6856 ns/op 9452 B/op 338 allocs/op

BenchmarkIterativeMulRangeN-10 265354 4269 ns/op 2505 B/op 196 allocs/op

I doubt `mulRange` is a performance bottleneck in anyone's code! But it is exported as `int.MulRange` so I guess it's viewed with some value. And seeing as how the for-loop seems even easier to understand that the recursive version, maybe it's worth submitting a PR? (If so, should I create an issue first?)

Rob Pike

unread,

Jan 7, 2024, 2:42:11 AMJan 7

to John Jannotti, Robert Griesemer, golang-nuts

It seems reasonable but first I'd like to understand why the recursive method is used. I can't deduce why, but the CL that adds it, by gri, does Karatsuba multiplication, which implies something deep is going on. I'll add him to the conversation.

-rob

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com.

John Jannotti

unread,

Jan 7, 2024, 6:35:30 AMJan 7

to Rob Pike, Robert Griesemer, golang-nuts

I'm equally curious.

FWIW, I realized the loop should perhaps be

```

mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on 32-bit arch

for m := a + 1; m <= b; m++ {

mb.setUint64(m)
z = z.mul(z, mb)
}

```

to avoid allocating repeatedly for `m`, which yields:

BenchmarkIterativeMulRangeN-10 354685 3032 ns/op 2129 B/op 48 allocs/op

John Jannotti

unread,

Jan 7, 2024, 7:47:36 AMJan 7

to Rob Pike, Robert Griesemer, golang-nuts

Actually, both implementations have bugs!

The recursive implementation ends with:

```

m := (a + b) / 2
return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))

```

That's a bug whenever `(a+b)` overflows, making `m` small.

FIX: `m := a + (b-a)/2`

My iterative implementation went into an infinite loop here:

`for m := a + 1; m <= b; m++ {`

if b is `math.MaxUint64`

FIX: add `&& m > a` to the exit condition is an easy fix, but pays a small penalty for the vast majority of calls that don't have b=MaxUint64

I would add these to `mulRangesN` of the unit test:

```

{math.MaxUint64 - 3, math.MaxUint64 - 1, "6277101735386680760773248120919220245411599323494568951784"},
{math.MaxUint64 - 3, math.MaxUint64, "115792089237316195360799967654821100226821973275796746098729803619699194331160"}

```

Robert Griesemer

unread,

Jan 8, 2024, 2:48:51 PMJan 8

to John Jannotti, Rob Pike, golang-nuts

Hello John;

Thanks for your interest in this code.

In a (long past) implementation of the factorial function, I noticed that computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed in a recursive fashion than when computed iteratively: the reason (I believed) was that the iterative approach seemed to produce a lot more "internal fragmentation", that is medium-size intermediate results where the most significant word (or "limb" as is the term in other implementations) is only marginally used, resulting in more work than necessary if those words were fully used.

I never fully investigated, it was enough at the time that the recursive approach was much faster. In retrospect, I don't quite believe my own theory. Also, that implementation didn't have Karatsuba multiplication, it just used grade-school multiplication.

Since a, b are uint64 values (words), this could probably be implemented in terms of mulAddVWW directly, with a suitable initial allocation for the result - ideally this should just need one allocation (not sure how close we can get to the right size). That would cut down the allocations massively.

In a next step, one should benchmark the implementation again.

But at the very least, the overflow bug should be fixed, thanks for finding it! I will send out a CL to fix that today.

Thanks,

- gri

Robert Griesemer

unread,

Jan 8, 2024, 4:56:24 PMJan 8

to John Jannotti, Rob Pike, golang-nuts

The overflow fix is pending: CL 554617

I filed https://github.com/golang/go/issues/65027 for a possibly faster mulRange implementation.

- gri

Bakul Shah

unread,

Jan 8, 2024, 10:22:17 PMJan 8

to Robert Griesemer, John Jannotti, Rob Pike, golang-nuts

Perhaps you were thinking of this?

At iteration number k, the value x_k contains O(klog(k)) digits, thus the computation of x_k+1 = kx_k has cost O(klog(k)). Finally, the total cost with this basic approach is O(2log(2)+¼+n log(n)) = O(n²log(n)).
A better approach is the binary splitting : it just consists in recursively cutting the product of m consecutive integers in half. It leads to better results when products on large integers are performed with a fast method.

http://numbers.computation.free.fr/Constants/Algorithms/splitting.html

I think you can do recursive splitting without using function recursion by allocating N/2 array (where b = a+N-1) and iterating over it; each time the array "shrinks" by half. A "cleverer" algorithm would allocate an array of *words* of a bignum, as you know that the upper limit on size is N*64 (for 64 bit numbers) so you can just reuse the same space for each outer iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration onwards. Not sure if this is easy in Go.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAKy0tf7Lcd8hiF2Qv3NFfjGcfvXDn%2BA%2BxJ1bfKta1w9P-OAs%3Dw%40mail.gmail.com.

Rob Pike

unread,

Jan 9, 2024, 12:23:49 AMJan 9

to Bakul Shah, Robert Griesemer, John Jannotti, golang-nuts

Here's an example where it's the bottleneck: ivy factorial

!1e7
1.20242340052e+65657059

)cpu
1m10s (1m10s user, 167.330ms sys)

-rob

Bakul Shah

unread,

Jan 9, 2024, 12:54:28 AMJan 9

to Rob Pike, Robert Griesemer, John Jannotti, golang-nuts

For that you may wish to explore Peter Luschny's "prime swing" factorial algorithm and variations!

https://oeis.org/A000142/a000142.pdf

And implementations in various languages including go:

https://github.com/PeterLuschny/Fast-Factorial-Functions

John Jannotti

unread,

Jan 9, 2024, 9:38:16 AMJan 9

to Bakul Shah, Robert Griesemer, Rob Pike, golang-nuts

I wrote the single allocation version and added it to the issue, but Bakul is absolutely right that the recursive solution has merit when the input is large enough. By building up large values on each "side" of the recursion, Karatsuba gets used for the larger multiplies and the recursive version begins to win. On `mulRange(1_000, 200_000)`, it's more than 20 times faster than the single allocation iterative version.

I can put together a hybrid that uses iterative for shorter spans, and recursive on larger, if that sounds interesting.

To get this off the mailing list, I will add this note to the issue, and discussion can continue there. https://github.com/golang/go/issues/65027

Rob Pike

unread,

Jan 13, 2024, 12:32:39 AMJan 13

to Bakul Shah, Robert Griesemer, John Jannotti, golang-nuts

Thanks for the tip. A fairly straightforward implementation of this algorithm gives me about a factor of two speedup for pretty much any value. I went up to 1e8!, which took about half an hour compared to nearly an hour for MulRange.

I'll probably stick in ivy after a little more tuning. I may even try parallelization.

-rob

Bakul Shah

unread,

Jan 13, 2024, 3:04:32 AMJan 13

to Rob Pike, Robert Griesemer, John Jannotti, golang-nuts

FYI Julia (on M1 MBP) seems much faster:

julia> @which factorial(big(100000000))
factorial(x::BigInt) in Base.GMP at gmp.jl:645

julia> @time begin; factorial(big(100000000)); 1; end
27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)

Probably they use Schönhage-Strassen multiplication algorithm for very large numbers as the 1E8! result will have over a 3/4 billion digits. I should try this in Gambit-Scheme (which has an excellent multiply implementation).

Rob Pike

unread,

Jan 13, 2024, 5:32:02 PMJan 13

to Bakul Shah, Robert Griesemer, John Jannotti, golang-nuts

Oh, I did say my implementation was straightforward. It's free of any clever multiplication algorithms or mathematical delights. It could easily be giving up 10x or more for that reason alone. And I haven't even profiled it yet.

-rob

Rob Pike

unread,

Jan 13, 2024, 7:08:35 PMJan 13

to Bakul Shah, Robert Griesemer, John Jannotti, golang-nuts

This is getting interesting but maybe only for me, so I'll stop here.

I did some profiling and the majority of the time is spent in math/big.addMulVVW. Thinking I might be doing something stupid, I compared it with the Go implementation at http://www.luschny.de/math/factorial/scala/FactorialScalaCsharp.htm, and found that my version (although "straightforward"), is about the same amount of code but without special casing and support tables, yet about 25% faster - and yes, they get the same answer.

So I believe the factor of 60 comes from comparing an implementation in a language and libraries designed for numerical computation against a much less specialized and optimized world. Or perhaps from Julia using a different and dramatically more efficient algorithm. It does seem like a big gap, but I am no expert in this area.

Maybe worth investigating further but not by me.

-rob

Bakul Shah

unread,

Jan 13, 2024, 8:05:00 PMJan 13

to Rob Pike, golang-nuts

Fair enough!

I replied as I found a factor of 60 a bit surprising. Though I shouldn't have been -- my n digits of pi scheme program was consistently about 16-17 times slower than gmp based one (at least up to 1E9), using the same algorithm (Chudnovsky's), which is also highly dependent on multiplication speed. And both (gambit-scheme and gmp) use faster algorithms for very large numbers. Adding these faster algorithms to math/big is probably not justifiable....

Julia seems to use libgmp. "info gmp" reveals it implements Luchny's algorithm.

Reply all

Reply to author

Forward