slow Mod function in math/big, a better way?

426 views
Skip to first unread message

jeff.temp...@gmail.com

unread,
Jul 7, 2017, 8:12:25 AM7/7/17
to golang-nuts
Hi

Exploring Go, 1st project is to port over my favourite playground (prime number factorisation) from Python to Go.

Aside: I was a big Oberon-2 fan back in the day, so I'm pretty excited about Go making possible what never was in Oberon.  

Back to business: I was disappointed that Go is only a factor of 3 or so faster than Python.  The numbers I'm dealing with are big enough that they don't fit into a uint64 so I'm using math/big ... turns out that about 70% of the runtime is spent in the Mod function of math/big, and mostly this is inside the underlying function divLarge. See profile info below.   

The naive explanation would be that Mod in Go math/big is about just as fast as the % primitive in python (Python has big integers automatically where needed) but the rest of the program is a factor of 10 or so faster in Go than Python.  I wonder though, especially if there is a better way to take a Mod.  The numbers are not THAT large (less than 100 decimal digits in most cases) ...

(pprof) top10 -cum
13.01s of 41.11s total (31.65%)
Dropped 86 nodes (cum <= 0.21s)
Showing top 10 nodes out of 74 (cum >= 6.66s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     40.45s 98.39%  runtime.goexit
         0     0%     0%     40.20s 97.79%  main.main
         0     0%     0%     40.20s 97.79%  runtime.main
     0.28s  0.68%  0.68%     40.19s 97.76%  _/Users/templon/gofact/factoring.Mybrent
         0     0%  0.68%     40.19s 97.76%  _/Users/templon/gofact/factoring.Trial_n_X
     0.44s  1.07%  1.75%     29.42s 71.56%  math/big.(*Int).Mod
     0.77s  1.87%  3.62%     28.98s 70.49%  math/big.(*Int).QuoRem
     0.51s  1.24%  4.86%     28.18s 68.55%  math/big.nat.div
    10.39s 25.27% 30.14%     27.47s 66.82%  math/big.nat.divLarge
     0.62s  1.51% 31.65%      6.66s 16.20%  runtime.makeslice

Andy Balholm

unread,
Jul 7, 2017, 12:48:01 PM7/7/17
to jeff.temp...@gmail.com, golang-nuts
That’s normal for languages like Python. The code that is actually running in Python is slow, but library functions are fast, because they are written in C.

Andy

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Templon

unread,
Jul 8, 2017, 4:23:38 AM7/8/17
to golang-nuts, jeff.temp...@gmail.com
Hi,


On Friday, July 7, 2017 at 6:48:01 PM UTC+2, Andy Balholm wrote:
That’s normal for languages like Python. The code that is actually running in Python is slow, but library functions are fast, because they are written in C.

Sure ... that's why I wrote the 'naive explanation' that says basically that :-)  The question remains, whether there is a better way to do it.  I was first taking DivMod, then tried Mod in the hope that it would be faster (I don't need the quotient) ... I even tried ExpMod with an exponent of 1, which didn't quite work for some reason I don't really understand; 

a.Mod(stufffff, ...)  seems to set "a" to one value, but
a.Exp(stuff, one, ....) sets it to a different value, or not, depending on exactly where in the loop I do it.  My conclusion from that is that I don't really understand pointers / receivers in Golang yet :-)

JT
 

Andy Balholm

unread,
Jul 8, 2017, 11:58:59 AM7/8/17
to Jeff Templon, golang-nuts
I noticed your “naive explanation” after I sent my message. But I think it is the real explanation.

Is there a better way to do it in Go? Probably not. The math/big library isn’t quite as fast as gmp, but it’s probably faster than anything you’d write yourself. If the numbers were really huge, you’d probably gain some by calling gmp with cgo. But for moderate-sized numbers, the cgo overhead would probably cancel out your gains.

Andy


Jeff Templon

unread,
Jul 17, 2017, 9:56:47 AM7/17/17
to golang-nuts, jeff.temp...@gmail.com
Coming back to this, now that my Go knowledge is increasing:


On Saturday, July 8, 2017 at 5:58:59 PM UTC+2, Andy Balholm wrote:
I noticed your “naive explanation” after I sent my message. But I think it is the real explanation.

it turns out that 77% of the time of the program is spent in runtime.mallocgc :-)  I think the reason why is stuff like this:

func (z nat) shr(x nat, s uint) nat {
  	m := len(x)
  	n := m - int(s/_W)
  	if n <= 0 {
  		return z[:0]
  	}
  	// n > 0
  
  	z = z.make(n)
  	shrVU(z, x[m-n:], s%_W)
  
  	return z.norm() 
  }
 
If I understand it right, there are two "make slice" operations in this code, and this bit is being run billions of times (heart of a tight loop).  Given the way Go works, the GC is unavoidable.  I've managed to make the code slightly faster by both swapping right-shift in instead of a Mod (Barrett reduction) and also avoiding my own object creation inside the loop, and there is 10 to 20% to be won by playing with GOGC, but the only way to do better is probably to rewrite the math/big library.  I tried using GMP but given all the temporaries created, it was even worse than using math/big!

Jan Mercl

unread,
Jul 17, 2017, 10:02:29 AM7/17/17
to Jeff Templon, golang-nuts
On Mon, Jul 17, 2017 at 3:57 PM Jeff Templon <jeff.temp...@gmail.com> wrote:

> Coming back to this, now that my Go knowledge is increasing:

Out of curiosity, how big are the numbers you want to factorize?
--

-j

Jeff Templon

unread,
Jul 17, 2017, 10:09:42 AM7/17/17
to golang-nuts, jeff.temp...@gmail.com
Hi

Generally I'm factoring numbers that I can do in less than a day, so in the order of 60 to 100 decimal digits.  

JT

Jan Mercl

unread,
Jul 17, 2017, 10:20:21 AM7/17/17
to Jeff Templon, golang-nuts
On Mon, Jul 17, 2017 at 4:09 PM Jeff Templon <jeff.temp...@gmail.com> wrote:

> Generally I'm factoring numbers that I can do in less than a day, so in the order of 60 to 100 decimal digits. 

Just for fun: Please provide a sample in that range.

--

-j

Jeff Templon

unread,
Jul 17, 2017, 1:18:05 PM7/17/17
to golang-nuts, jeff.temp...@gmail.com
Hi Jan,

On Monday, July 17, 2017 at 4:20:21 PM UTC+2, Jan Mercl wrote:

Just for fun: Please provide a sample in that range.
  
Sure:

medina:gofact> time ./factorize -r -M 191
factoring 3138550867693340381917894711603833208051177722232017256447
factors [ 383  7.068.569.257  39.940.132.241  332.584.516.519.201  87.274.497.124.602.996.457  ]
./factorize -r -M 191  29.06s user 0.48s system 106% cpu 27.773 total

The -M 191 means the 191st Mersenne number ( 2^191 - 1 )

That one's 58 digits.   Here's a 66 digit case ...

medina:gofact> time ./factorize -r -M 219
factoring 842498333348457493583344221469363458551160763204392890034487820287
factors [ 7  439  3.943  2.298.041  9.361.973.132.609  671.165.898.617.413.417  4.815.314.615.204.347.717.321  ]
./factorize -r -M 219  1207.98s user 19.46s system 106% cpu 19:10.47 total

The algorithm being used is Brent's modification of the Pollard Rho.  What limits the speed of that algorithm is basically how big the 2nd largest factor is ... 90 digit number that is the product of 10, 12, 14, and 54 digit numbers (2nd largest = 14 digits), would go considerably faster than factoring a 60 digit number composed of 18 digit and 42 digit factors (2nd largest = 18 digits).

I am aware that there are faster methods.  This one was easy to code and is challenging enough to give a realistic feel of how well a given language might do for scientific code.


Jan Mercl

unread,
Jul 17, 2017, 4:20:00 PM7/17/17
to Jeff Templon, golang-nuts

On Mon, Jul 17, 2017 at 7:18 PM Jeff Templon <jeff.temp...@gmail.com> wrote:

> Sure:

I hoped for a number of which you don't know the factors and wanted to cheat by linking to factordb.com ;-)

Earlier you wrote also

it turns out that 77% of the time of the program is spent in runtime.mallocgc :-)  I think the reason why is stuff like this:

I wonder if you happen to use Linux/amd64 and if you want to try (published right now): https://github.com/cznic/minigmp. If so, please share the performance changes, if any, thank you.

--

-j

Rémy Oudompheng

unread,
Jul 17, 2017, 4:39:56 PM7/17/17
to Jeff Templon, golang-nuts
How are you writing your code ? Wisely using variables should only
cause very minimal memory allocation.

Rémy.

Jeff Templon

unread,
Jul 18, 2017, 3:09:54 AM7/18/17
to golang-nuts, jeff.temp...@gmail.com
Hi Remy,

On Monday, July 17, 2017 at 10:39:56 PM UTC+2, Rémy Oudompheng wrote:
2017-07-17 15:56 GMT+02:00 Jeff Templon <jeff.temp...@gmail.com>:
 
> it turns out that 77% of the time of the program is spent in
> runtime.mallocgc :-)  I think the reason why is stuff like this:
>
> func (z nat) shr(x nat, s uint) nat {
>   m := len(x)
>   n := m - int(s/_W)
>   if n <= 0 {
>   return z[:0]
>   }
>   // n > 0
>
>   z = z.make(n)
>   shrVU(z, x[m-n:], s%_W)
>
>   return z.norm()
>
>   }
>

How are you writing your code ? Wisely using variables should only
cause very minimal memory allocation.


It's not my code that is taking all the time ... it's math/big itself.  The way I understand it, for example in the above code, x[m-n:] creates a three-byte slice object.  In the tightest part of the loop of my code, this function is called twice, for a million loop iterations this is already 3 MB that will need to be garbage collected, and this is not the only function being called.  

I include the profile for a short run, you can see most of the time is spent in routine "Barrett", of which almost all that time is spent in functions in math/big, tracing through the graph you can see that ultimately most of that time is spent inside runtime.mallocgc ... is there a way to call math/big functions that does not result in so much garbage collection?

Jeff Templon

unread,
Jul 18, 2017, 3:10:57 AM7/18/17
to golang-nuts, jeff.temp...@gmail.com

Oops, I forgot the attachment.

webbar.png

Jeff Templon

unread,
Jul 18, 2017, 3:58:30 AM7/18/17
to golang-nuts, jeff.temp...@gmail.com
Hi Remy,

It turns out that you are at least partially on the right track :-)  I changed the way Barrett is written, so as to use as few temporary variables as possible.  In particular, the Rsh doesn't use any at all anymore ... the 60% of the Rsh time that was being used in runtime.mallocgc disappeared, resulting in a bit more than 10% overall performance improvement in the entire code.  Now I understand better the comments about reuse of arguments in the math/big documentation!   Thanks for your suggestion.

Jeff Templon

unread,
Jul 18, 2017, 7:51:19 AM7/18/17
to golang-nuts, jeff.temp...@gmail.com
I managed to do quite a bit better (30% savings in execution time) proceeding along these lines.  Finally I got to the point where runtime.makeslice, and underneath runtime.mallocgc, are responsible for 50% of the run time, with no other real hotspots.  Almost all of the makeslice time is coming from math/big.nat/mul, in this section:

         .          .    420: // use basic multiplication if the numbers are small
     100ms      100ms    421: if n < karatsubaThreshold {
     100ms      9.13s    422: z = z.make(m + n)
     180ms      3.13s    423: basicMul(z, x, y)
     150ms      390ms    424: return z.norm()
         .          .    425: }

This z.make statement is what is doing it, and I don't see any way to control it ... it's buried deep in the math/big package, and n has to do with the length of a passed big.Int argument, that length "is what it is" and is apparently always under this karatsuba Threshold, so the make *always* gets called.

Any other suggestions? :-)

Jeff Templon

unread,
Jul 18, 2017, 8:23:19 AM7/18/17
to golang-nuts, jeff.temp...@gmail.com
One more bit of progress which shaved off more time: following this page,


I introduced a pair of global cache big.Int variables for the function which was consuming most of the time; one for the smaller intermediate results and one for the larger.  Turns out that the z.make function listed below inspects the capacity of z before doing anything ... if it's OK it just returns the slice as is.  The problem I was having was that the intermediate results were of mixed size, so the slice was being created one or two times per function execution.  With the cache variables in place, mallocgc is taking only 13% of the total time instead of 46%.  Thanks for all your suggestions.


Reply all
Reply to author
Forward
0 new messages