math/big performance is bad - are there any plans to address it?

1,557 views
Skip to first unread message

Piotr Narewski

unread,
Sep 12, 2013, 9:01:03 AM9/12/13
to golan...@googlegroups.com
Are there any plans on the road map to do something about the performance of the big number package?

Since a public key cryptography is very important these days, I would like to appeal to the devs to address this problem as soon as possible.
I understand that it might require bringing it down to the assembly level, but well... it wouldn't be the only package.

And: yes, I do need a better performance of the big numbers, in an actual go application.
So far I have been dealing with it using cgo wrappers, but it isn't a solution that I would like to stick to forever.

Michael Jones

unread,
Sep 12, 2013, 9:07:16 AM9/12/13
to Piotr Narewski, golang-nuts
I'm intimately familiar with the nat aspect of big, and this is surely the slow part of which you speak. I've my own timings against GMP, but would very much like to know:

a. What is slow for you, (I am expecting that you will say multiply, which means more special cases, but please share your experience.)

b. How slow is it. In particular, is there more than a 2x difference between what you expect and what you get?

Not being defensive. Performance is itself an enabling technology. Just looking for focus.

Michael



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Jan Mercl

unread,
Sep 12, 2013, 9:14:43 AM9/12/13
to Michael Jones, Piotr Narewski, golang-nuts
On Thu, Sep 12, 2013 at 3:07 PM, Michael Jones <m...@google.com> wrote:
> a. What is slow for you, (I am expecting that you will say multiply, which
> means more special cases, but please share your experience.)

Wrt multiply, I'd like to mention:
https://github.com/remyoudompheng/bigfft even when I don't know the
status of that package.

-j

Piotr Narewski

unread,
Sep 12, 2013, 9:18:24 AM9/12/13
to golan...@googlegroups.com, Piotr Narewski
Thanks for your answer,

So my application does a lot of ECDSA operations - the verify function in particular.

My rough measurements are that the ecdsa.Verify function provided by the Go lib is almost 20 times slower from a cgo wrapper based on OpenSSL.
I did some further research and it seems to be more of a problem with the performance of math/big, rather than how crypto/ecdsa and crypto/elliptic are implemented.

If it was only 2x slower - that would be really nice :)

Ian Lance Taylor

unread,
Sep 12, 2013, 11:44:53 AM9/12/13
to Piotr Narewski, golang-nuts
On Thu, Sep 12, 2013 at 6:01 AM, Piotr Narewski <piot...@gmail.com> wrote:
> Are there any plans on the road map to do something about the performance of
> the big number package?
>
> Since a public key cryptography is very important these days, I would like
> to appeal to the devs to address this problem as soon as possible.
> I understand that it might require bringing it down to the assembly level,
> but well... it wouldn't be the only package.

To be clear, math/big already uses assembly code.

I would encourage you to open an issue with some benchmarks (see the
existing BenchmarkXXX in math/big/*_test.go), and if possible point to
what specific code in math/big needs to speed up.

Ian

Brad Fitzpatrick

unread,
Sep 12, 2013, 11:51:48 AM9/12/13
to Ian Lance Taylor, Piotr Narewski, golang-nuts
I don't see much CPU time being spent in math/big.

      85  35.6%  35.6%      113  47.3% crypto/elliptic.p224Mul
      56  23.4%  59.0%       56  23.4% crypto/elliptic.p224ReduceLarge
      28  11.7%  70.7%       56  23.4% crypto/elliptic.p224Square
      18   7.5%  78.2%       18   7.5% runtime.usleep
      12   5.0%  83.3%       12   5.0% crypto/elliptic.p224Reduce
       6   2.5%  85.8%        6   2.5% crypto/elliptic.p224Contract
       6   2.5%  88.3%       66  27.6% crypto/elliptic.p224DoubleJacobian
       4   1.7%  90.0%      135  56.5% crypto/elliptic.p224AddJacobian
       4   1.7%  91.6%        4   1.7% crypto/elliptic.p224CopyConditional
       3   1.3%  92.9%        4   1.7% MHeap_AllocLocked
       3   1.3%  94.1%        3   1.3% crypto/elliptic.p224Sub
       2   0.8%  95.0%        2   0.8% crypto/elliptic.p224Add
       2   0.8%  95.8%        8   3.3% crypto/elliptic.p224IsZero
       2   0.8%  96.7%        2   0.8% math/big.nat.set
       2   0.8%  97.5%        5   2.1% runtime.makeslice
       1   0.4%  97.9%        3   1.3% math/big.(*Int).Set
       1   0.4%  98.3%        3   1.3% math/big.nat.div
       1   0.4%  98.7%        1   0.4% runtime.nanotime
       1   0.4%  99.2%        1   0.4% runtime.settype_flush
       1   0.4%  99.6%        1   0.4% stkbucket
       1   0.4% 100.0%        1   0.4% testing.BenchmarkResult.String
       0   0.0% 100.0%        4   1.7% MCentral_Grow
       0   0.0% 100.0%        1   0.4% MHeap_FreeLocked
       0   0.0% 100.0%       19   7.9% System
       0   0.0% 100.0%        3   1.3% cnew
       0   0.0% 100.0%      219  91.6% crypto/ecdsa.BenchmarkVerify
       0   0.0% 100.0%      219  91.6% crypto/ecdsa.Verify

From:

diff -r c6a1a1c4051a src/pkg/crypto/ecdsa/ecdsa_test.go
--- a/src/pkg/crypto/ecdsa/ecdsa_test.go Thu Sep 12 22:03:53 2013 +1000
+++ b/src/pkg/crypto/ecdsa/ecdsa_test.go Thu Sep 12 16:50:55 2013 +0100
@@ -189,3 +189,21 @@
  }
  }
 }
+
+func BenchmarkVerify(b *testing.B) {
+ c := elliptic.P224()
+ priv, _ := GenerateKey(c, rand.Reader)
+
+ hashed := []byte("testing")
+ r, s, err := Sign(rand.Reader, priv, hashed)
+ if err != nil {
+ b.Fatal(err)
+ }
+ b.ResetTimer()
+
+ for i := 0; i < b.N; i++ {
+ if !Verify(&priv.PublicKey, hashed, r, s) {
+ b.Fatal("Verify failed")
+ }
+ }
+}


Piotr, is that representative of what your application does?

Michael Gehring

unread,
Sep 12, 2013, 12:39:24 PM9/12/13
to Brad Fitzpatrick, Ian Lance Taylor, Piotr Narewski, golang-nuts
On Thu, Sep 12, 2013 at 04:51:48PM +0100, Brad Fitzpatrick wrote:
> On Thu, Sep 12, 2013 at 4:44 PM, Ian Lance Taylor <ia...@golang.org> wrote:
>
> > On Thu, Sep 12, 2013 at 6:01 AM, Piotr Narewski <piot...@gmail.com>
> > wrote:
> > > Are there any plans on the road map to do something about the
> > performance of
> > > the big number package?
> > >
> > > Since a public key cryptography is very important these days, I would
> > like
> > > to appeal to the devs to address this problem as soon as possible.
> > > I understand that it might require bringing it down to the assembly
> > level,
> > > but well... it wouldn't be the only package.
> >
> > To be clear, math/big already uses assembly code.
> >
> > I would encourage you to open an issue with some benchmarks (see the
> > existing BenchmarkXXX in math/big/*_test.go), and if possible point to
> > what specific code in math/big needs to speed up.
> >
>
> I don't see much CPU time being spent in math/big.
>
[..]
> + c := elliptic.P224()

elliptic.P224 has a separate (constant-time) implementation that doesn't really
use math/big (elliptic.p224Curve).

The profile looks different for curves using the generic implementation
(elliptic.CurveParams).

For elliptic.P256():

128 13.4% 13.4% 452 47.2% math/big.nat.divLarge
76 7.9% 21.3% 76 7.9% math/big.divWW
54 5.6% 26.9% 262 27.3% runtime.makeslice
53 5.5% 32.5% 53 5.5% math/big.addMulVVW
53 5.5% 38.0% 260 27.1% runtime.mallocgc
40 4.2% 42.2% 40 4.2% math/big.mulAddVWW
37 3.9% 46.0% 37 3.9% runtime.markallocated
33 3.4% 49.5% 33 3.4% runtime.memclr
28 2.9% 52.4% 34 3.5% sweepspan
27 2.8% 55.2% 197 20.6% cnew
27 2.8% 58.0% 27 2.8% math/big.subVV
25 2.6% 60.6% 85 8.9% math/big.basicMul
24 2.5% 63.2% 33 3.4% scanblock
21 2.2% 65.3% 21 2.2% math/big.nat.norm
20 2.1% 67.4% 98 10.2% runtime.MCache_Alloc
18 1.9% 69.3% 482 50.3% math/big.(*Int).QuoRem
17 1.8% 71.1% 191 19.9% math/big.nat.mul
17 1.8% 72.9% 19 2.0% runtime.settype_flush
14 1.5% 74.3% 57 5.9% math/big.(*Int).Sub
14 1.5% 75.8% 14 1.5% math/big.nat.clear

Piotr Narewski

unread,
Sep 12, 2013, 12:59:46 PM9/12/13
to golan...@googlegroups.com
Thanks everyone for your input.

Please find attached a package with a built in openssl-wrapper and two benchmarks; Native and OpenSSL.
Hope you won't have problems building it.

I don't have access to any Linux system ATM, but these are the results from my Windows 7 (64 bit):
* BenchmarkNative      100          13810789 ns/op
* BenchmarkOpenSSL            5000            708040 ns/op

If the ~20x difference does not come from the big numbers lib, then I would be even more confused.

Cheers
bigbench.zip

Brad Fitzpatrick

unread,
Sep 12, 2013, 1:11:35 PM9/12/13
to Piotr Narewski, golang-nuts
Piotr, Michael,

Perhaps this improved recently in Go tip, because I see:

BenchmarkNative     500   3074883 ns/op
BenchmarkOpenSSL    2000   1411538 ns/op

Which at 2.17x the time is much better than the 19.5x reported above.

Also, in my profile I see a very different picture, with divLarge at only 1%.



Brad Fitzpatrick

unread,
Sep 12, 2013, 1:19:33 PM9/12/13
to Piotr Narewski, golang-nuts
There have been a bunch of math/big speed-ups in tip since Go 1.1 came out:

andrey mirtchovski

unread,
Sep 12, 2013, 1:38:48 PM9/12/13
to Brad Fitzpatrick, Piotr Narewski, golang-nuts
i don't have much to contribute to the discussion, but running
benchmarks is always fun. i see 9x slowdown, but i'm not seeing
divLarge in the top20:

Linux, Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

$ go test -test.bench B*
PASS
BenchmarkNative 500 3745324 ns/op
BenchmarkOpenSSL 5000 408609 ns/op
ok _/tmp/bigbench 4.348s
$ go test -test.bench B*Native -cpuprofile cpu.prof
PASS
BenchmarkNative 500 3752920 ns/op
ok _/tmp/bigbench 2.268s
$ go tool pprof bigbench.test cpu.prof
(pprof) top20
Total: 226 samples
133 58.8% 58.8% 135 59.7% crypto/elliptic.p256ReduceDegree
12 5.3% 64.2% 12 5.3% crypto/elliptic.p256Diff
10 4.4% 68.6% 76 33.6% crypto/elliptic.p256Mul
9 4.0% 72.6% 78 34.5% crypto/elliptic.p256Square
7 3.1% 75.7% 88 38.9% crypto/elliptic.p256PointDouble
6 2.7% 78.3% 52 23.0% crypto/elliptic.p256PointAdd
6 2.7% 81.0% 6 2.7% crypto/elliptic.p256SelectAffinePoint
5 2.2% 83.2% 31 13.7% crypto/elliptic.p256PointAddMixed
4 1.8% 85.0% 4 1.8% crypto/elliptic.p256ReduceCarry
4 1.8% 86.7% 6 2.7% runtime.makeslice
3 1.3% 88.1% 5 2.2% crypto/elliptic.p256Sum
3 1.3% 89.4% 5 2.2% math/big.(*Int).Set
2 0.9% 90.3% 2 0.9% crypto/elliptic.p256Scalar8
2 0.9% 91.2% 2 0.9% crypto/elliptic.p256SelectJacobianPoint
2 0.9% 92.0% 9 4.0% math/big.nat.div
1 0.4% 92.5% 1 0.4% MHeap_FreeLocked
1 0.4% 92.9% 2 0.9% cnew
1 0.4% 93.4% 1 0.4% crypto/elliptic.p256CopyConditional
1 0.4% 93.8% 1 0.4% crypto/elliptic.p256Scalar4
1 0.4% 94.2% 2 0.9% crypto/elliptic.p256ToBig

Piotr Narewski

unread,
Sep 12, 2013, 1:44:39 PM9/12/13
to golan...@googlegroups.com, Piotr Narewski
Cool - that's actually a good news. 
Means that my wishes have been coming true, while I was asking for them... :)

I will be looking forward for 1.2 then.

Cheers!

Piotr Narewski

unread,
Sep 12, 2013, 2:13:49 PM9/12/13
to golan...@googlegroups.com, Piotr Narewski
I've just built the latest sources and indeed the results are... a very surprising (in a positive way)

So with "go1.1.2 windows/amd64" I have:
 * BenchmarkNative      100          14400824 ns/op
 * BenchmarkOpenSSL            5000            719641 ns/op

.. while with "devel +5694c6095fc7 Thu Sep 12 09:31:07 2013 -0700 windows/amd64":
 * BenchmarkNative      500           3240185 ns/op
 * BenchmarkOpenSSL            5000            683839 ns/op

Seems like a huge improvement though when I look at the changes in math/big I don't quite see any changes that could have caused it.

Let me rebuild 1.1.2 with my own mingw... hold on!

Michael Gehring

unread,
Sep 12, 2013, 2:14:35 PM9/12/13
to Brad Fitzpatrick, Piotr Narewski, golang-nuts
I probably should have mentioned I ran the benchmark/profile on go
1.1.2.

Using tip (5694c6095fc7) I get

BenchmarkNative 500 3797857 ns/op
BenchmarkOpenSSL 5000 449084 ns/op

I'm not sure if the improvements really can be attributed to the
math/big changes. We are benchmarking elliptic.P256 and there is now a
constant time P256 implementation in tip (which again doesn't really use
math/big).

For elliptic.P384 I only see a small (-3.3%) improvement in tip:

go version go1.1.2 linux/amd64
BenchmarkVerify 100 33087171 ns/op

go version devel +5694c6095fc7 Thu Sep 12 09:31:07 2013 -0700 linux/amd64
BenchmarkVerify 100 31992077 ns/op

Piotr Narewski

unread,
Sep 12, 2013, 2:19:07 PM9/12/13
to golan...@googlegroups.com, Brad Fitzpatrick, Piotr Narewski, m...@ebfe.org
We should also keep in mind that there is a number of ways to build the openssl lib - performance wise they might make a big difference.
Not to mention that there are plenty different versions of it.

Piotr Narewski

unread,
Sep 12, 2013, 2:24:40 PM9/12/13
to golan...@googlegroups.com, Piotr Narewski


On Thursday, September 12, 2013 8:13:49 PM UTC+2, Piotr Narewski wrote:
I've just built the latest sources and indeed the results are... a very surprising (in a positive way)

So with "go1.1.2 windows/amd64" I have:
 * BenchmarkNative      100          14400824 ns/op
 * BenchmarkOpenSSL            5000            719641 ns/op

.. while with "devel +5694c6095fc7 Thu Sep 12 09:31:07 2013 -0700 windows/amd64":
 * BenchmarkNative      500           3240185 ns/op
 * BenchmarkOpenSSL            5000            683839 ns/op

Seems like a huge improvement though when I look at the changes in math/big I don't quite see any changes that could have caused it.

Let me rebuild 1.1.2 with my own mingw... hold on!

FYI, rebuilding 1.1.2 with my own toolset does not change the bit, which means that the ~4.5x performance improvement (that I observe) must have been added after 1.1.2




Michael Gehring

unread,
Sep 12, 2013, 2:47:35 PM9/12/13
to Brad Fitzpatrick, Piotr Narewski, golang-nuts
go tip:

BenchmarkNative 500 3788556 ns/op
BenchmarkNativeGeneric 100 17064095 ns/op
BenchmarkOpenSSL 5000 500617 ns/op

BenchmarkNative is the new constant-time P256 implemenation,
BenchmarkNativeGeneric is the generic one as seen in 1.1.2.
bigbench.diff

Piotr Narewski

unread,
Sep 12, 2013, 3:09:26 PM9/12/13
to golan...@googlegroups.com, Brad Fitzpatrick, Piotr Narewski, m...@ebfe.org


On Thursday, September 12, 2013 8:47:35 PM UTC+2, Michael Gehring wrote:
go tip:

BenchmarkNative          500     3788556  ns/op
BenchmarkNativeGeneric   100    17064095  ns/op
BenchmarkOpenSSL        5000      500617  ns/op

BenchmarkNative is the new constant-time P256 implemenation,
BenchmarkNativeGeneric is the generic one as seen in 1.1.2.

Right - that totally explains the difference, thanks!
I totally missed the fact that there is a new file; p256.go

Though, the bad news for me is that this improvement does not help me at all, since in my app I use a different curve:


agl

unread,
Sep 12, 2013, 4:12:48 PM9/12/13
to golan...@googlegroups.com, Brad Fitzpatrick, Piotr Narewski, m...@ebfe.org

Though, the bad news for me is that this improvement does not help me at all, since in my app I use a different curve:

You're welcome to write an optimised secp256k1 :) But I don't personally play with bitcoin, and that's the only think I know of that uses Koblitz curves, so I'm unlikely to do so I'm afraid.


Cheers

AGL 

Piotr Narewski

unread,
Sep 12, 2013, 4:43:42 PM9/12/13
to golan...@googlegroups.com
Sounds like a challenge. I have no idea what these calculations do, but I so much want to find it out... Therefore expect more questions concerning it in a future :-)
Cheers

Rémy Oudompheng

unread,
Sep 13, 2013, 4:43:28 AM9/13/13
to Jan Mercl, Michael Jones, Piotr Narewski, golang-nuts
As far as I know it is compatible with Go 1.1 and I have not planned
to work furthermore on the package. I think it works quite correctly,
but it might need testing.

Rémy.

Ignacio Grande

unread,
Sep 13, 2013, 6:00:39 AM9/13/13
to golan...@googlegroups.com, Jan Mercl, Michael Jones, Piotr Narewski
Hi all,

Good timing with the question, as I'm having a similar problem with a weird behavior. 

I was trying (for fun) to solve Project Euler problem 216 http://projecteuler.net/problem=216 with Go, that is very easy to implement and parallelize when I noticed that the concurrent/parallel version with 4 cores was running slower than the 1-thread version, and the CPU usage was about 30-40% in total. Doing some checks, I think that the bottleneck is the garbage collector. 

The test program (I'm sure that can be implemented much better) is here: http://play.golang.org/p/22Tj823pKr (it cannot be run in the playground)

I'm using go version go1.1.2 windows/386, in Windows 7 and a 4-core Intel CPU.

When running the code with GOMAXPROCS=1,  the important line with the timing says: 

Final results: 17187 278ms 2.604s

It means 278 ms from the Garbage collector, 2.604 s in total.

Whe running it with GOMAXPROCS=4 it says:

Final results: 17187 3.676s 3.97s

The garbage collector time has increased dramatically, and the total running time is increased instead of reduced. 

I'm rather new to Go so maybe my code is flawed, but I found these results strange, however I cannot say where is the problem (my fault or a bug?)

Thanks a lot for your help

andrey mirtchovski

unread,
Sep 13, 2013, 11:26:33 AM9/13/13
to Ignacio Grande, golang-nuts, Jan Mercl, Michael Jones, Piotr Narewski
tip improves the situation somewhat, although I can't speak for
windows. without modifications:

$ for i in 1 2 4 8; do GOMAXPROCS=$i ./t | grep Final; done # Linux
Final results: 17187 114.941708ms 1.354362265s
Final results: 17187 143.089324ms 778.95197ms
Final results: 17187 155.106826ms 618.619641ms
Final results: 17187 110.605596ms 558.827463ms
$

$ for i in 1 2 4 8; do GOMAXPROCS=$i ./t | grep Final; done # OSX
Final results: 17187 73.940457ms 834.480812ms
Final results: 17187 70.742853ms 465.854495ms
Final results: 17187 59.786305ms 274.526691ms
Final results: 17187 70.464003ms 313.129448ms
$

in general though, the question "Why does using GOMAXPROCS>1 sometimes
make my program slower?" is answered in the FAQ:
http://golang.org/doc/faq#Why_GOMAXPROCS

Ignacio Grande

unread,
Sep 13, 2013, 5:03:24 PM9/13/13
to golan...@googlegroups.com
Thank you for the test!
I thought that this particular problem could be easily parallelized. I've tried at home (a linux amd64 with 2 cores) and it looks more like your test, although the GC is using much more time with GOMAXPROCS=2 than with =1, I'm not sure why.

I've modified the code to send slices of 1000 elements to be computed in parallel, instead of one int64 each time and it seems to be a slight improvement, but not much. I hope it will run better with Go1.2 :)

Regards.
 
Reply all
Reply to author
Forward
0 new messages