Re: [go-nuts] Slow big integer performance in Go

2,460 views
Skip to first unread message

Dave Cheney

unread,
May 7, 2013, 6:57:48 PM5/7/13
to pca...@gmail.com, golang-nuts
Are you using an amd64 machine ?

On Wed, May 8, 2013 at 5:47 AM, <pca...@gmail.com> wrote:
> Hello Everyone,
>
> I've started to learn Go so I was interested in comparing Go performance
> against some other languages.
> I have a simple microbenchmark that I use to get a quick idea of relative
> performance.
> I was disappointed to see that the performance of Go was not as as good as I
> expected.
>
> My benchmark factors a couple of big integers using Pollard's Rho method:
>
> N5 = 766150476015982127183457373
> N6 = 62823675885202669644104829577
>
> Both are the products of just two primes.
>
> -------------------------------------------
> Go 1.1rc2 :
>
> C:\Projects\GoCode\src\rho>go test -v
> === RUN: ExampleN5
> --- PASS: ExampleN5 (10.053s)
> === RUN: ExampleN6
> --- PASS: ExampleN6 (15.348s)
> PASS
> ok rho 25.452s
>
> C:\Projects\GoCode\src\rho>go test -cpuprofile=rho.prof -bench="Both"
> -benchmem=true
> testing: warning: no tests to run
> PASS
> BenchmarkBoth 1 25958000000 ns/op 1105711392 B/op
> 25343461 allocs/op
> ok rho 26.018s
>
> -------------------------------------------
> Python 3.3 :
> rho(N5) = 1178524040059
> 9.34 s per run
> rho(N6) = 663410067979
> 15.11 s per run
>
> ------------------------------------------
> Scala 2.10.1 :
> scala> TimeRho(N5)
> rho(766150476015982127183457373) = 1178524040059
> Time = 2112 ms
> scala> TimeRho(N6)
> rho(62823675885202669644104829577) = 663410067979
> Time = 3104 ms
> -------------------------------------------
>
> As you can see above, Go and Python have about the same speed, while Scala
> (and Java) are much faster.
> This surprised me because I had thought that because Go is statically typed
> it would have been faster than Python,
> and I expected its performance to be closer to Scala or Java.
>
> I've attached the Go files if anyone is interested.
>
> After some editing to make pprof work on Windows, I managed to collect a
> profile.
> It seems that there is a lot of time spent on memory allocation, but that's
> about all I can determine at this point,
> given my unfamiliarity with Go.
> The output from pprof is below.
>
> Any ideas on how to make big.Int faster in Go?
>
> -- Peter
>
> -------------------------------------------------------------
> C:\Projects\GoCode\src\rho>go tool pprof rho.test.exe rho.prof
> The system cannot find the path specified.
> Welcome to pprof! For help, type 'help'.
> (pprof) top50 --cum
> Total: 2593 samples
> 0 0.0% 0.0% 2370 91.4% gosched0
> 0 0.0% 0.0% 2370 91.4% rho.BenchmarkBoth
> 3 0.1% 0.1% 2370 91.4% rho.Rho
> 0 0.0% 0.1% 2370 91.4% testing.(*B).launch
> 0 0.0% 0.1% 2370 91.4% testing.(*B).runN
> 0 0.0% 0.1% 1906 73.5% math/big.(*Int).GCD
> 144 5.6% 5.7% 1902 73.4% math/big.(*Int).binaryGCD
> 276 10.6% 16.3% 606 23.4% math/big.(*Int).Sub
> 156 6.0% 22.3% 596 23.0% math/big.(*Int).Rsh
> 104 4.0% 26.3% 431 16.6% math/big.(*Int).Set
> 240 9.3% 35.6% 384 14.8% math/big.nat.sub
> 79 3.0% 38.6% 364 14.0% math/big.nat.set
> 7 0.3% 38.9% 346 13.3% math/big.(*Int).Mod
> 110 4.2% 43.2% 326 12.6% math/big.nat.make
> 13 0.5% 43.7% 309 11.9% math/big.(*Int).QuoRem
> 8 0.3% 44.0% 297 11.5% math/big.nat.div
> 65 2.5% 46.5% 289 11.1% math/big.nat.divLarge
> 59 2.3% 48.7% 267 10.3% runtime.makeslice
> 54 2.1% 50.8% 238 9.2% runtime.mallocgc
> 8 0.3% 51.1% 207 8.0% makeslice1
> 61 2.4% 53.5% 205 7.9% runtime.copy
> 106 4.1% 57.6% 198 7.6% math/big.nat.shr
> 22 0.8% 58.4% 196 7.6% math/big.(*Int).Neg
> 84 3.2% 61.7% 163 6.3% math/big.nat.add
> 152 5.9% 67.5% 152 5.9% math/big.nat.norm
> 143 5.5% 73.0% 143 5.5% runtime.memmove
> 125 4.8% 77.9% 125 4.8% math/big.nat.cmp
> 64 2.5% 80.3% 120 4.6% math/big.nat.trailingZeroBits
> 10 0.4% 80.7% 80 3.1% math/big.(*Int).Mul
> 0 0.0% 80.7% 78 3.0% gc
> 24 0.9% 81.6% 78 3.0% runtime.MCache_Alloc
> 0 0.0% 81.6% 78 3.0% runtime.gc
> 7 0.3% 81.9% 70 2.7% math/big.nat.mul
> 66 2.5% 84.5% 66 2.5% math/big.trailingZeroBits
> 64 2.5% 86.9% 64 2.5% math/big.subVV
> 0 0.0% 86.9% 62 2.4% runtime.parfordo
> 6 0.2% 87.2% 51 2.0% runtime.MCentral_AllocList
> 4 0.2% 87.3% 50 1.9% runtime.new
> 34 1.3% 88.6% 44 1.7% sweepspan
> 5 0.2% 88.8% 42 1.6% MCentral_Grow
> 34 1.3% 90.1% 34 1.3% runtime.markallocated
> 33 1.3% 91.4% 33 1.3% math/big.shrVU
> 27 1.0% 92.4% 30 1.2% scanblock
> 14 0.5% 93.0% 28 1.1% math/big.basicMul
> 22 0.8% 93.8% 22 0.8% math/big.divWW
> 7 0.3% 94.1% 19 0.7% math/big.(*Int).Add
> 0 0.0% 94.1% 18 0.7% markroot
> 17 0.7% 94.8% 17 0.7% runtime.markspan
> 0 0.0% 94.8% 16 0.6% runtime.MHeap_Alloc
> 12 0.5% 95.2% 12 0.5% math/big.addVV
>
> (pprof) top50
> Total: 2593 samples
> 276 10.6% 10.6% 606 23.4% math/big.(*Int).Sub
> 240 9.3% 19.9% 384 14.8% math/big.nat.sub
> 156 6.0% 25.9% 596 23.0% math/big.(*Int).Rsh
> 152 5.9% 31.8% 152 5.9% math/big.nat.norm
> 144 5.6% 37.3% 1902 73.4% math/big.(*Int).binaryGCD
> 143 5.5% 42.8% 143 5.5% runtime.memmove
> 125 4.8% 47.7% 125 4.8% math/big.nat.cmp
> 110 4.2% 51.9% 326 12.6% math/big.nat.make
> 106 4.1% 56.0% 198 7.6% math/big.nat.shr
> 104 4.0% 60.0% 431 16.6% math/big.(*Int).Set
> 84 3.2% 63.2% 163 6.3% math/big.nat.add
> 79 3.0% 66.3% 364 14.0% math/big.nat.set
> 66 2.5% 68.8% 66 2.5% math/big.trailingZeroBits
> 65 2.5% 71.3% 289 11.1% math/big.nat.divLarge
> 64 2.5% 73.8% 120 4.6% math/big.nat.trailingZeroBits
> 64 2.5% 76.3% 64 2.5% math/big.subVV
> 61 2.4% 78.6% 205 7.9% runtime.copy
> 59 2.3% 80.9% 267 10.3% runtime.makeslice
> 54 2.1% 83.0% 238 9.2% runtime.mallocgc
> 34 1.3% 84.3% 34 1.3% runtime.markallocated
> 34 1.3% 85.6% 44 1.7% sweepspan
> 33 1.3% 86.9% 33 1.3% math/big.shrVU
> 27 1.0% 87.9% 30 1.2% scanblock
> 24 0.9% 88.9% 78 3.0% runtime.MCache_Alloc
> 22 0.8% 89.7% 196 7.6% math/big.(*Int).Neg
> 22 0.8% 90.6% 22 0.8% math/big.divWW
> 17 0.7% 91.2% 17 0.7% runtime.markspan
> 14 0.5% 91.7% 28 1.1% math/big.basicMul
> 13 0.5% 92.2% 309 11.9% math/big.(*Int).QuoRem
> 12 0.5% 92.7% 12 0.5% math/big.addVV
> 12 0.5% 93.2% 12 0.5% math/big.nat.clear
> 11 0.4% 93.6% 11 0.4% runtime.memclr
> 11 0.4% 94.0% 11 0.4% runtime.settype_flush
> 10 0.4% 94.4% 80 3.1% math/big.(*Int).Mul
> 9 0.3% 94.8% 9 0.3% math/big.addVW
> 8 0.3% 95.1% 207 8.0% makeslice1
> 8 0.3% 95.4% 8 0.3% math/big.addMulVVW
> 8 0.3% 95.7% 297 11.5% math/big.nat.div
> 8 0.3% 96.0% 8 0.3% runtime.SizeToClass
> 8 0.3% 96.3% 8 0.3% runtime.casp
> 7 0.3% 96.6% 19 0.7% math/big.(*Int).Add
> 7 0.3% 96.8% 346 13.3% math/big.(*Int).Mod
> 7 0.3% 97.1% 70 2.7% math/big.nat.mul
> 6 0.2% 97.3% 6 0.2% math/big.subVW
> 6 0.2% 97.6% 51 2.0% runtime.MCentral_AllocList
> 5 0.2% 97.8% 42 1.6% MCentral_Grow
> 5 0.2% 98.0% 11 0.4% math/big.(*Int).Lsh
> 4 0.2% 98.1% 4 0.2% math/big.mulWW
> 4 0.2% 98.3% 50 1.9% runtime.new
> 3 0.1% 98.4% 3 0.1% math/big.(*Int).Cmp
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Peter Caven

unread,
May 7, 2013, 7:23:46 PM5/7/13
to golan...@googlegroups.com, pca...@gmail.com
Yes, on Vista.

So, I used:

Go:      go1.1rc2.windows-amd64
Python:  3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32
Scala:   JVM 1.7.0_21, 64-Bit Server VM (build 23.21-b01, mixed mode)

-- Peter

agl

unread,
May 7, 2013, 7:55:15 PM5/7/13
to golan...@googlegroups.com
On Tuesday, May 7, 2013 3:47:19 PM UTC-4, Peter Caven wrote:
It seems that there is a lot of time spent on memory allocation, but that's about all I can determine at this point, 
given my unfamiliarity with Go.
The output from pprof is below.

Any ideas on how to make big.Int faster in Go?

You can save a few allocations and some time by avoiding some aliasing. For example:

a.Mul(a, a)

Will end up allocating because Mul cannot write the result into a as it goes. By using a temp variable and swapping afterwards, some allocations will be saved:

t1.Mul(a, a)
t1, a = a, t1

(Yes, this is a little ugly.)

But this doesn't get you too much in this test because 75% of the time is spent in binaryGCD[1].

A much clearer (to me, at least) profile output results from 'gv', but that might not work in Windows without ghostscript installed.

I suspect that some work in optimisation work in binaryGCD would pay dividends here. For example, 30 seconds and the following change resulted in a modest speedup:

diff -r 93752156e2b5 src/pkg/math/big/int.go
--- a/src/pkg/math/big/int.go   Fri May 03 15:24:05 2013 -0400
+++ b/src/pkg/math/big/int.go   Tue May 07 19:51:22 2013 -0400
@@ -703,14 +703,15 @@
                // reduce t
                t.Rsh(t, t.abs.trailingZeroBits())
                if t.neg {
-                       v.Neg(t)
+                       v, t = t, v
+                       v.neg = len(v.abs) > 0 && !v.neg // 0 has no sign
                } else {
-                       u.Set(t)
+                       u, t = t, u
                }
                t.Sub(u, v)
        }
 
-       return u.Lsh(u, k)
+       return z.Lsh(u, k)
 }
 
 // ProbablyPrime performs n Miller-Rabin tests to check whether x is prime.

The combination of those two tricks got a 20% speedup. Tuning Rsh for negative numbers may also pay dividends.

None the less, the Java BigInt speed is impressive.



Cheers

AGL

Robert Griesemer

unread,
May 7, 2013, 8:47:14 PM5/7/13
to pca...@gmail.com, golang-nuts
agl already gave good answers. I haven't looked at this in detail, but there's a few things to keep in mind:

1) Go's math/big library is a native Go with some core assembly routines. It's written from scratch and while it has reasonable performance, it simply cannot compete with say the GMP library which has years of development and big integer arithmetic specialists working on it and adding the latest mathematic discoveries in fast multiplication and division.

2) Using a.Op(a, b) instead of c.Op(a, b) should provide a significant performance improvement in many cases where one of the operands (a) can be re-used for the result.

3) Most other languages have very fast bignum arithmetic, but they typically just call the GMP (or equivalent) C routines. One has to be careful when comparing languages this way: It's not the language, it the underlying library. I'm actually surprised the Python is not much faster than Go.

4) I suspect the main culprit our div/mod implementation which is textbook (Knuth) and could use improvements.

- gri
 


It seems that there is a lot of time spent on memory allocation, but that's about all I can determine at this point, 
given my unfamiliarity with Go.
The output from pprof is below.

Any ideas on how to make big.Int faster in Go?

Peter Caven

unread,
May 7, 2013, 11:51:27 PM5/7/13
to golan...@googlegroups.com
Thanks for pointing out the aliasing problems.  
I see in the big.Int code that there are a lot of checks for aliasing, and that many of the functions also return a pointer to the receiver in addition to updating it.

Yes, I don't think there's very much we can gain by optimizing binaryGCD alone.
After looking more closely at the lower level arithmetic functions in big.nat, I think that may be where most of the performance goes.

Consider that the Python code I have is just these two simple high-level functions:

def gcd(a,b):
    while b:
        a,b = b,(a % b)
    return a

def rho(n):
    a = 2
    b = 2
    d = 1
    while d == 1:
        a = (a**2 + 1) % n
        b = (b**2 + 1) % n
        b = (b**2 + 1) % n
        d = gcd(a-b,n)
    return d

Compare that to binaryGCD.
It looks like it would be a big win to put the core functions, add,sub,mul,div,divmod, and the rest of big.nat in a lower-level C package.

-- Peter

Rémy Oudompheng

unread,
May 8, 2013, 2:49:59 AM5/8/13
to Peter Caven, golang-nuts
I don't understand why you think using C will make anything faster.
We may:
* have a few issues with divmod algorithm
* binaryGCD is not always more effective than the ordinary GCD
* your numbers are very small, so ordinary arithmetic (sum, product)
is dwarfed by various overheads in function calls

A C library will not help for these issues.

For large integer multiplication you can try using
github.com/remyoudompheng/bigfft which is faster than math/big for
numbers with more than 40k digits, and "only" 1.5x-2x slower than GMP
but that's not your use case. It's written in Go and calls the same
assembly routines as math/big when necessary.

Rémy.

Peter Caven

unread,
May 8, 2013, 3:35:17 AM5/8/13
to golan...@googlegroups.com, Peter Caven
Hello Rémy,

Lol! You've got some big numbers.  Your package is very impressive!


On Wednesday, May 8, 2013 2:49:59 AM UTC-4, Rémy Oudompheng wrote:

I don't understand why you think using C will make anything faster.
We may:
* have a few issues with divmod algorithm
* binaryGCD is not always more effective than the ordinary GCD

I was guessing of course, but here are my reasons: 
As you can see, the Python code is just as fast as the Go code, but the Python code is completely high level, even the gcd function, and this is all in a dynamic language.
The low level arithmetic functions inside the Python VM are what makes the Python code fast, which means that tinkering with divmod or gcd is unlikely to make the Go code faster. 
There is also the problem of the unusually large number of memory allocations in the Go code. This looks like a fundamental problem with the implementation of big.nat.
 
* your numbers are very small, so ordinary arithmetic (sum, product)
is dwarfed by various overheads in function calls 
A C library will not help for these issues.

  
So, I suppose you mean that the reason for the JVM code being 5 times faster is that Hotspot inlines everything?

 
For large integer multiplication you can try using
github.com/remyoudompheng/bigfft which is faster than math/big for
numbers with more than 40k digits, and "only" 1.5x-2x slower than GMP
but that's not your use case. It's written in Go and calls the same
assembly routines as math/big when necessary.


That doesn't surprise me actually - with 40k digits isn't the computation time determined less by the arithmetic than by the handling of the memory for those numbers?

-- Peter

David DENG

unread,
May 8, 2013, 4:18:50 AM5/8/13
to golan...@googlegroups.com
I didn't check the correctness the result, but using the following gcd (direct from your python code) is much faster:

func gcd(a, b *big.Int) *big.Int {
    for b.Sign() != 0 {
        a, b = b, new(big.Int).Mod(a, b)
    }
    return a
}

David

Rémy Oudompheng

unread,
May 8, 2013, 4:23:34 AM5/8/13
to Peter Caven, golang-nuts
2013/5/8 Peter Caven <pca...@gmail.com>:
> Hello Rémy,
>
> Lol! You've got some big numbers. Your package is very impressive!
>
>
> On Wednesday, May 8, 2013 2:49:59 AM UTC-4, Rémy Oudompheng wrote:
>>
>>
>> I don't understand why you think using C will make anything faster.
>> We may:
>> * have a few issues with divmod algorithm
>> * binaryGCD is not always more effective than the ordinary GCD
>
>
> I was guessing of course, but here are my reasons:
> As you can see, the Python code is just as fast as the Go code, but the
> Python code is completely high level, even the gcd function, and this is all
> in a dynamic language.
> The low level arithmetic functions inside the Python VM are what makes the
> Python code fast, which means that tinkering with divmod or gcd is unlikely
> to make the Go code faster.

divmod *is* a low-level arithmetic primitive, and it's the most
complex one. As I said, your numbers are just 2 uint64 blocks wide, so
the arithmetic itself will be a handful of instructions, which is
already more than the cost of a function call, except for divmod.

> There is also the problem of the unusually large number of memory
> allocations in the Go code. This looks like a fundamental problem with the
> implementation of big.nat.

For carefully written code (see agl's comments), the amortized cost is
visible but usually small.

>> * your numbers are very small, so ordinary arithmetic (sum, product)
>> is dwarfed by various overheads in function calls
>>
>> A C library will not help for these issues.
>>
>
> So, I suppose you mean that the reason for the JVM code being 5 times faster
> is that Hotspot inlines everything?
>
>
>>
>> For large integer multiplication you can try using
>> github.com/remyoudompheng/bigfft which is faster than math/big for
>> numbers with more than 40k digits, and "only" 1.5x-2x slower than GMP
>> but that's not your use case. It's written in Go and calls the same
>> assembly routines as math/big when necessary.
>>
>
> That doesn't surprise me actually - with 40k digits isn't the computation
> time determined less by the arithmetic than by the handling of the memory
> for those numbers?

There's a lot of cache-unfriendliness issues that arise but the choice
of algorithm can also make a noticeable difference.

Rémy.

David DENG

unread,
May 8, 2013, 4:34:11 AM5/8/13
to golan...@googlegroups.com
An ad-hoc optimization to gcd can make it even faster (suppose a can be modified within gcd(), b cannot).

var newB *big.Int = new(big.Int)
func gcd(a, b *big.Int) *big.Int {
    b = newB.Set(b)
    for b.Sign() != 0 {
        a, b = b, a.Mod(a, b)
    }
    return a
}

David

Peter Caven

unread,
May 8, 2013, 1:39:32 PM5/8/13
to golan...@googlegroups.com
I'm seeing times that are twice as long with your code ...
Can you post your benchmark results?

-- Peter

Uli Kunitz

unread,
May 8, 2013, 7:00:44 PM5/8/13
to golan...@googlegroups.com, pca...@gmail.com
In the sources I find only the Python and the Go code. Is the same algorithm used in Scala and Java? There are modifications to the Pollard's Rho methods that don't require to compute the GCD for every cycle in the loop. 

Could you provide the Java or Scala source that you are using?

Peter Caven

unread,
May 8, 2013, 8:52:59 PM5/8/13
to golan...@googlegroups.com, pca...@gmail.com
Sure, it's essentially just a literal translation of the Python code, and makes use of the Java BigInteger gcd():


val N5 = BigInt("766150476015982127183457373")
val N6 = BigInt("62823675885202669644104829577")

def rho(n : BigInt) : BigInt = {
    var a  = BigInt(2)
    var b  = BigInt(2)
    var d  = BigInt(1)

    while (d == 1)
    {
        a = ((a * a) + 1) % n
        b = ((b * b) + 1) % n
        b = ((b * b) + 1) % n

        d = n.gcd(a - b)
    }
    d
}

def TimeRho(n : BigInt) = {
    val t1 = System.currentTimeMillis()
    val r = rho(n)
    val t2 = System.currentTimeMillis()
    println("rho(" + n + ") = " + r + " \nTime = " + (t2 - t1) + " ms")
}

Enjoy,
-- Peter

David DENG

unread,
May 9, 2013, 10:02:31 PM5/9/13
to golan...@googlegroups.com
Maybe it's because I was using go 1.0.3 (windows). I'll try it later on go 1.1.

David
Reply all
Reply to author
Forward
0 new messages