Fast integer square root in Go

2,521 views
Skip to first unread message

Job van der Zwan

unread,
Jul 16, 2012, 12:16:44 AM7/16/12
to golan...@googlegroups.com
TL;DR: Challenge: try to beat these integer square root algorithms using pure Go!

So, because I started working on a little (at the moment highly unreliable) math library, I've been trying out a bunch of different integer square root algorithms. This is the fastest code I've been able to come up with:

func Sqrt(x uint64) uint64 {
// The most significant bit of the square root of x is equal to log2(x)/2. 
// Using uint instead of uint64 guarantees native word size, which is 
// more than 60% faster on my x86 Netbook.
r := uint(1 << (u64.Log2(x) >> 1))
s := r >> 1
t := r + s
// Try to set a bit in the intermediate result, see if the square of the
// resulting number is smaller than the input. If not, set it as the new
// intermediate result. Shift a bit to the right, repeat.
for s > 0 {
if uint64(t)*uint64(t) <= x {
r = t
}
s >>= 1
t = r + s
}
return uint64(r)
}

This kind of surprised me, being such a simple algorithm, and involving a multiplication. It's the fastest option on an x86 simply because it actually avoids using uint64 most of the time. When that optimisation is taken out (or in the case of int32/uint32) it still appears to tie with Jack Crenshaw's algorithm, which I adapted as follows:

func Sqrt(x uint64) (r uint64) {
// p starts at the highest power of four less or equal to x
p := uint64(1 << ((Log2(x) >> 1) << 1))
for p > 0 {
if x >= r+p {
x -= r + p
r += p << 1
}
r >>= 1
p >>= 2
}
return
}

However, using the standard math package with casting to/from float64 is still two to three times faster (although I'm not entirely sure if I'm benchmarking in a way that is representative of real world situations - if not, the difference is probably even bigger). The FPU is hard to beat when it comes to square roots. This kind of frustrates me, because with everything else so far I've already managed to beat it :P. Any ideas?

PS: Also, I'm not entirely sure if using u64.Log2 is really effective as an optimisation with the last algorithm: with evenly distributed numbers, the highest bit is set in 50% of the cases, right? So is it worth it? Should I assume even distribution when designing these algorithms?

Sonia Keys

unread,
Jul 16, 2012, 1:48:46 PM7/16/12
to golan...@googlegroups.com
This raises some interesting questions for an integer math library.  It does seem likely that different algorithms would be advantageous in different situations, on different machine architectures, for example, or with different input distributions.  A math library that offered different algorithms would seem valuable.  In cases where it had different algorithms, it might provide a benchmark function where you could supply data typical of your application and it would show the relative speeds of of the algorithms, for that data set, and on your machine.

Job van der Zwan

unread,
Jul 19, 2012, 7:20:46 AM7/19/12
to golan...@googlegroups.com
On Monday, 16 July 2012 19:48:46 UTC+2, Sonia Keys wrote:
This raises some interesting questions for an integer math library.  It does seem likely that different algorithms would be advantageous in different situations, on different machine architectures, for example, or with different input distributions.  A math library that offered different algorithms would seem valuable.

Well, case in point, I found the Hacker's Delight code page (while looking for a cube root algorithm, actually). It has a slightly more efficient variant of Jack Crenshaw's algorithm, which I converted as follows:

func SqrtHDU64(x uint64) uint64 {
var b, r uint64
for p := uint64(1 << ((uint(u64.Log2(x)) >> 1) << 1)); p != 0; p >>= 2 {
b = r | p
r >>= 1
if x >= b {
x -= b
r |= p
}
}
return uint64(r)
}

func SqrtHDU32(x uint32) uint32 {
//Using uint guarantees native word width
var t, b, r uint
t = uint(x)
p := uint(1 << 30)
for p > t {
p >>= 2
}
for ; p != 0; p >>= 2 {
b = r | p
r >>= 1
if t >= b {
t -= b
r |= p
}
}
return uint32(r)
}

Faster in uint32, slower in uint64. Again, probably only because it has to use uint64. However, I'm pretty sure it's faster when using 64bit architectures.

Job van der Zwan

unread,
Jul 19, 2012, 7:27:29 AM7/19/12
to golan...@googlegroups.com
On Monday, 16 July 2012 19:48:46 UTC+2, Sonia Keys wrote:
A math library that offered different algorithms would seem valuable.  In cases where it had different algorithms, it might provide a benchmark function where you could supply data typical of your application and it would show the relative speeds of of the algorithms, for that data set, and on your machine.

Actually, I would like some input on this: how to properly benchmark math algorithms? For 32 bit integer functions with singular input I guess I can just enter every number once to test uniform distribution - 2^32 seems like a good minimum number of tests anyway. But in the case of, say, Pow(x, y) that would already result in 2^64 tests. And then there's 64 bit integers...

Should I benchmark for uniform distribution in the first place? Or are low numbers more likely than big numbers in most real-life situations?

Kevin Gillette

unread,
Jul 20, 2012, 9:21:23 AM7/20/12
to golan...@googlegroups.com
2^32 tests is a lot already. On my netbook, a for loop that does nothing takes 8 seconds to iterate 2^32 times. I may be way off, but 2^64 iterations in an empty loop would take more than 1000 years on my machine -- 2^64 tests that actually do something on your machine would still probably take, say, 125 years if parallelized over 8 cores.

It's my impression that most Go testing is pragmatic, not exhaustive -- find exceptional cases, or just a reasonable but small distribution (0, 1, 2, 3, 4, 9, 17, and 2^32-1) -- if all of those work, you can be reasonably sure that everything else works. Besides which, your algorithms are all entirely small enough to trace and prove by hand, making exhaustive testing pointless ;-)

Job van der Zwan

unread,
Jul 20, 2012, 12:06:07 PM7/20/12
to golan...@googlegroups.com
On Friday, 20 July 2012 15:21:23 UTC+2, Kevin Gillette wrote:
2^32 tests is a lot already. On my netbook, a for loop that does nothing takes 8 seconds to iterate 2^32 times. I may be way off, but 2^64 iterations in an empty loop would take more than 1000 years on my machine -- 2^64 tests that actually do something on your machine would still probably take, say, 125 years if parallelized over 8 cores.

It's my impression that most Go testing is pragmatic, not exhaustive -- find exceptional cases, or just a reasonable but small distribution (0, 1, 2, 3, 4, 9, 17, and 2^32-1) -- if all of those work, you can be reasonably sure that everything else works. Besides which, your algorithms are all entirely small enough to trace and prove by hand, making exhaustive testing pointless ;-)

Well, I was thinking more of benchmarking than of testing: to benchmark properly I should use input that simulated uniform number distribution, at least with those algorithms that are slower if the numbers are bigger (like Sqrt). In the end I figured that simply ignoring the lower bits was the way to go, as the smaller numbers are an insignificant percentage of the whole number spectrum:

func BenchmarkSqrtUint64(b *testing.B) {
        for i := 0; i < b.N; i++ {
                for j := uint64(0x1000000000000); j < 0x400000000000000; j += 0x1000000000000 {
                        _ = u64.Sqrt(j)
                }
        }
}

Kevin Gillette

unread,
Jul 20, 2012, 10:02:27 PM7/20/12
to golan...@googlegroups.com
In those cases, I like to bench the 32-bit and 64-bit implementations separately for a single reasonable small value (0 or 1), then 2^32-1, and for the 64-bit implementation, 2^64-1. Those values should be representative of the upper and lower time bounds, and allows you to compare both size implementations easily :-)

Job van der Zwan

unread,
Jul 21, 2012, 6:30:17 PM7/21/12
to golan...@googlegroups.com
On Saturday, 21 July 2012 04:02:27 UTC+2, Kevin Gillette wrote:
In those cases, I like to bench the 32-bit and 64-bit implementations separately for a single reasonable small value (0 or 1), then 2^32-1, and for the 64-bit implementation, 2^64-1. Those values should be representative of the upper and lower time bounds, and allows you to compare both size implementations easily :-)

Well, some functions specifically test for 0 or 1 as an input by necessity (a number to the power of zero, for example), so that would mess up the benchmark, but I get your point :-). I think I'll add all three versions (upper bound, lower bound, "uniform distribution") and leave interpretation the results to the end users.
Reply all
Reply to author
Forward
0 new messages