performance on pi calculation with monte carlo

christophe petit

unread,

Oct 31, 2011, 10:50:16 PM10/31/11

to golang-nuts

Hello,

I have a problem with a little code that computes pi with monte carlo
method. I use goroutine to perform parallelized loops and in each of
them, I generate random numbers. The goal is for example to have 10
threads with for each 1000 random numbers instead having only one
which does 10.000 iterations.

The code works fine but I haven't got the expected results for the
gain of execution time. I always get the same time whether with 1
process, 2, 10 or 100 ...

I tried to use the environment variable GOMAXPROCS (set to 2, 10,
100 ...) and it doesn't lower the execution time, worse it increases
it.

I suspect the problem comes form the "rand.Float64" function which
could not be "thread safe" like in posix where the "rand_r" function
is thread safe and the "rand" function is not.

here is the code :
----------------------------------------------------------------------------------------------------------

package main

import (
"fmt"
"math"
"flag"
"rand"
"strconv"
"time"
// "runtime"
)

func main() {

var pi_eval float64;
flag.Parse();

if (flag.NArg() != 2) {
fmt.Println("Arguments Error : ./6.out nb_process
nb_iterations ");

} else {
nb_process, err := strconv.Atoi(flag.Arg(0));
if err != nil {
// Invalid string
fmt.Println("Error");
}
nb_iter, err := strconv.Atoi(flag.Arg(1))
if err != nil {
// Invalid string
fmt.Println("Error");
}
// runtime.GOMAXPROCS(nb_process);
time1 := time.UTC();
pi_eval=random_thread(nb_process,nb_iter);
time2 := time.UTC();
fmt.Printf("Pi Estimation: %f\n",pi_eval);
fmt.Printf("Elapsed Time: %f seconds\n",((float64)((float64)
(time2.Nanoseconds()-time1.Nanoseconds())/(float64)
(math.Pow(10,9)))));
}
}

func random_thread(nb_process int, nb_iter int) float64 {

var k,l int;
var pi float64 = 0;
var nb_eff int = nb_iter/nb_process;
ch := make(chan float64);
for k = 1; k <= nb_process; k++ {
go pi_process(ch, k, nb_eff);
}
for l = 1; l <= nb_process; l++ {
pi += <-ch;
}
pi=pi/(float64) (nb_iter);
pi=pi*4;

return pi;

}

func pi_process(ch chan float64, k int, nb_eff int) {

var x,y,z float64;
var nb_in int =0 ;

rand.Seed(rand.Int63());

for i :=0; i <nb_eff; i++ {
x = rand.Float64();
y = rand.Float64();

z = math.Pow(x,2) + math.Pow(y,2);
if z <= 1.0 {
nb_in++;
}
}
ch <- (float64) (nb_in);
}

---------------------------------------------------------------------------------------------------------
For example, here is two executions ( with 1 process and 10 processes.
The second argument is the number of iterations) :

$ ./6.out 1 100000000
Pi Estimation: 3.141822
Elapsed Time: 22.086858 seconds

----------------------------------------------------------

$./6.out 10 100000000
Pi Estimation: 3.141577
Elapsed Time: 21.882880 seconds

----------------------------------------------------------

Could you help me please to increase the gain of performance in terms
of execution time with a number of processes greater than one ...

Thanks.

andrey mirtchovski

unread,

Oct 31, 2011, 10:54:50 PM10/31/11

to christophe petit, golang-nuts

Others have hit exactly the same problem with exactly the same code.
The issue is that the Go pseudorandom number generator uses a global
state and protects it with a mutex. All the goroutines you spawn are
waiting on that single mutex.

roger peppe

unread,

Oct 31, 2011, 11:47:02 PM10/31/11

to christophe petit, golang-nuts

andrey's explanation sounds plausible to me.

use rand.New (the generator created this way is not
mutexed) and use crypto/rand.Int to seed (using rand to seed itself
is probably not going to give you much more randomness!)

Kyle Consalus

unread,

Nov 1, 2011, 12:14:24 AM11/1/11

to roger peppe, christophe petit, golang-nuts

A lot of the time is probably also spent in math.Pow.

I did a quick rework of the code (http://pastie.org/2791625) and without the rand contention and some of the inefficiencies like math.Pow, it seems to run much faster and performance increases with more goroutines.

$ gorun oldpi.go 1 10000000

Pi Estimation: 3.142061

Elapsed Time: 3.323649 seconds

$ gorun oldpi.go 2 10000000

Pi Estimation: 3.141887

Elapsed Time: 5.099973 seconds

$ gorun pi.go -n 10000000 -g 1

Pi Estimation: 3.140931

Elapsed Time: 0.513560 seconds

$ gorun pi.go -n 10000000 -g 2

Pi Estimation: 3.141690

Elapsed Time: 0.381710 seconds

However, I didn't really test my version other than to see that it generates a number that looks something like pi, so my benchmark may mean nothing.

Kyle Consalus

unread,

Nov 1, 2011, 12:21:58 AM11/1/11

to roger peppe, christophe petit, golang-nuts

.. and to be clear, those numbers are using Go 96df3afe1c8a+ tip on a 2.53 Ghz Intel Core 2 Duo running OS X 10.5.8.

Also, the last "Elapsed time" isn't terribly representative; I've gotten 0.25 on all subsequent runs.

But that's what I get for microbenchmarking a short-running application sloppily.

Michael Jones

unread,

Nov 1, 2011, 5:55:17 AM11/1/11

to Kyle Consalus, roger peppe, christophe petit, golang-nuts

Nice!

Adding

"runtime"

to the import list, and

runtime.GOMAXPROCS(*nb_process)

just after flag.Parse() makes a program that scales nicely (inversely linear) with CPUs for me.

Also, 1e8 might be a better default iteration count. (~3 seconds)

--

Michael T. Jones | Chief Technology Advocate | m...@google.com | +1 650-335-5765

Michael Jones

unread,

Nov 1, 2011, 9:09:29 AM11/1/11

to Kyle Consalus, roger peppe, christophe petit, golang-nuts

Interesting observation. I ran this on a varied number of threads on my 2 processor x 4 cores Xeon MacPro (8 CPUs) at a variety of thread settings and considered the efficiency by simply watching the CPU utilization bar chart of the system's Activity Monitor program. Here are the results:

babar:big mtj$ ./pi -n 1000000000 -g 1

Pi Estimation: 3.141606

Elapsed Time: 42.328063 seconds

babar:big mtj$ ./pi -n 1000000000 -g 2

Pi Estimation: 3.141491

Elapsed Time: 21.222779 seconds

babar:big mtj$ ./pi -n 1000000000 -g 3

Pi Estimation: 3.141594

Elapsed Time: 14.112936 seconds

babar:big mtj$ ./pi -n 1000000000 -g 4

Pi Estimation: 3.141547

Elapsed Time: 21.210744 seconds

babar:big mtj$ #4 runs on 2 CPUs only

babar:big mtj$ ./pi -n 1000000000 -g 5

Pi Estimation: 3.141589

Elapsed Time: 16.952144 seconds

babar:big mtj$ #5 runs on 1 until half done, then 5

babar:big mtj$ ./pi -n 1000000000 -g 6

Pi Estimation: 3.141640

Elapsed Time: 14.128896 seconds

babar:big mtj$ #6 runs on 1 until mostly done, then 6

babar:big mtj$ ./pi -n 1000000000 -g 7

Pi Estimation: 3.141571

Elapsed Time: 12.116916 seconds

babar:big mtj$ #7 runs on 1 for a while then spreads out

babar:big mtj$ ./pi -n 1000000000 -g 8

Pi Estimation: 3.141514

Elapsed Time: 10.607424 seconds

babar:big mtj$ #8 runs on 1 briefly then spreads out

babar:big mtj$ ./pi -n 1000000000 -g 9

Pi Estimation: 3.141717

Elapsed Time: 9.614367 seconds

babar:big mtj$ ./pi -n 1000000000 -g 10

Pi Estimation: 3.141628

Elapsed Time: 9.138594 seconds

babar:big mtj$ ./pi -n 1000000000 -g 11

Pi Estimation: 3.141611

Elapsed Time: 8.764789 seconds

babar:big mtj$ ./pi -n 1000000000 -g 16

Pi Estimation: 3.141627

Elapsed Time: 7.993622 seconds

babar:big mtj$ ./pi -n 1000000000 -g 24

Pi Estimation: 3.141527

Elapsed Time: 6.986677 seconds

babar:big mtj$ ./pi -n 1000000000 -g 32

Pi Estimation: 3.141585

Elapsed Time: 7.075722 seconds

Odd behavior. Here's the precise code (from Kyle modified by me as stated):

package main

import (

"fmt"

"flag"

"rand"

"runtime"

"time"

)

func main() {

nb_process := flag.Int("g", 1, "goroutines")

nb_iter := flag.Int("n", 1e8, "number of iterations")

flag.Parse()

runtime.GOMAXPROCS(*nb_process)

t0 := time.Nanoseconds()

pi_eval := random_thread(*nb_process, *nb_iter)

t1 := time.Nanoseconds()

fmt.Printf("Pi Estimation: %f\n", pi_eval)

fmt.Printf("Elapsed Time: %f seconds\n", float64(t1-t0)/1e9)

}

func random_thread(nb_process int, nb_iter int) float64 {

nb_eff := nb_iter / nb_process

ch := make(chan float64, nb_process)

for k := 0; k < nb_process; k++ {

go pi_process(ch, nb_eff)

}

pi := float64(0)

for l := 0; l < nb_process; l++ {

pi += <-ch

}

return 4 * (pi / (float64)(nb_iter))

}

func pi_process(ch chan float64, nb_eff int) {

r := rand.New(rand.NewSource(time.Nanoseconds()))

nb_in := 0

for i := 0; i < nb_eff; i++ {

x, y := r.Float64(), r.Float64()

if x*x+y*y <= 1.0 {

nb_in++

}

ch <- float64(nb_in)

}

Dmitry Vyukov

unread,

Nov 1, 2011, 2:45:13 PM11/1/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

Looks OK. What is odd here?

Dmitry Vyukov

unread,

Nov 1, 2011, 2:47:31 PM11/1/11

to christophe petit, golang-nuts

Here is a version I prepared some time ago. It does some load-balancing and handles parallel random number generation the right way.

package main

import (

"flag"

"fmt"

"runtime"

)

var (

nThrow = flag.Int("n", 1e6, "number of throws")

nCPU = flag.Int("cpu", 1, "number of CPUs to use")

)

func main() {

flag.Parse()

runtime.GOMAXPROCS(*nCPU) // Set number of OS threads to use.

// Split all throws to that number of parallel tasks.

// Constant factor of 64 is to ensure sufficient dynamic load balancing.

nTasks := *nCPU * 64

// Channel to collect partial results.

parts := make(chan int, nTasks)

for i := 0; i < nTasks; i++ {

// Kick off a parallel task.

go func(me int) {

hits := 0

// Create task-local PRNG to avoid synchronization issues.

r := MakeLeapFrogRand(nTasks, me)

// Calculate number of throws for this task.

n := *nThrow / nTasks

if me < (*nThrow % nTasks) {

n++

}

// Do the throws.

for i := 0; i < n; i++ {

x := r.Float64()

y := r.Float64()

if x*x+y*y < 1 {

hits++

}

parts <- hits // Send the result back.

}(i)

}

// Aggregate partial results.

hits := 0

for i := 0; i < nTasks; i++ {

hits += <-parts

}

pi := 4 * float64(hits) / float64(*nThrow)

fmt.Printf("PI = %g\n", pi)

}

// LeapFrogRand ensures that all parallel tasks

// receive non-overlapping regions of random space.

// That is to ensure absence of over sampled regions.

type LeapFrogRand struct {

mult, next uint64

pad [64]uint8

}

const (

LeapFrogMult = 764261123

LeapFrogPmod = 2147483647

)

func MakeLeapFrogRand(total, me int) LeapFrogRand {

var r LeapFrogRand

mult := uint64(LeapFrogMult)

seed := uint64(1)

for i := 0; i < total; i++ {

mult = (mult * LeapFrogMult) % LeapFrogPmod

seed = (seed * LeapFrogMult) % LeapFrogPmod

if i == me {

r.next = seed

}

r.mult = mult

return r

}

func (r *LeapFrogRand) Float64() float64 {

r.next = (r.next * r.mult) % LeapFrogPmod

return float64(r.next) / float64(LeapFrogPmod)

}

Michael Jones

unread,

Nov 1, 2011, 3:16:23 PM11/1/11

to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

The parts in bold.

Dmitry Vyukov

unread,

Nov 2, 2011, 12:22:46 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

I think it is due to absence of load balancing.

Dmitry Vyukov

unread,

Nov 2, 2011, 5:53:43 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

On Wed, Nov 2, 2011 at 1:50 PM, Michael Jones <m...@google.com> wrote:

I changed Kyle's version to add your "and also split the task 64 times more than expected" change:
pi_eval := random_thread(*nb_process * 64, *nb_iter)

The results are now nice and linear, with the CPU usage looking nice and steady. Immediately jumping to N CPU's and running steady untill the end.

babar:big mtj$ ./pi -n 1000000000 -g 1

Pi Estimation: 3.141580
Elapsed Time: 42.173500 seconds

babar:big mtj$ ./pi -n 1000000000 -g 2

Pi Estimation: 3.141636

Elapsed Time: 21.433001 seconds

babar:big mtj$ ./pi -n 1000000000 -g 3

Pi Estimation: 3.141605
Elapsed Time: 14.288153 seconds

babar:big mtj$ ./pi -n 1000000000 -g 4

Pi Estimation: 3.141662
Elapsed Time: 10.962000 seconds
babar:big mtj$ ./pi -n 1000000000 -g 5
Pi Estimation: 3.141560
Elapsed Time: 8.907828 seconds
babar:big mtj$ ./pi -n 1000000000 -g 6

Pi Estimation: 3.141544
Elapsed Time: 7.273181 seconds
babar:big mtj$ ./pi -n 1000000000 -g 7
Pi Estimation: 3.141570
Elapsed Time: 6.314425 seconds
babar:big mtj$ ./pi -n 1000000000 -g 8

Pi Estimation: 3.141534
Elapsed Time: 5.559293 seconds
babar:big mtj$ ./pi -n 1000000000 -g 9
Pi Estimation: 3.141607
Elapsed Time: 5.541590 seconds

babar:big mtj$ ./pi -n 1000000000 -g 10

Pi Estimation: 3.141619
Elapsed Time: 5.508037 seconds

babar:big mtj$ ./pi -n 1000000000 -g 11
Pi Estimation: 3.141611

Elapsed Time: 5.497377 seconds

babar:big mtj$ ./pi -n 1000000000 -g 12

Pi Estimation: 3.141615
Elapsed Time: 5.486157 seconds
babar:big mtj$ ./pi -n 1000000000 -g 13
Pi Estimation: 3.141604
Elapsed Time: 5.493715 seconds
babar:big mtj$ ./pi -n 1000000000 -g 14

Pi Estimation: 3.141674
Elapsed Time: 5.464580 seconds
babar:big mtj$ ./pi -n 1000000000 -g 15
Pi Estimation: 3.141544
Elapsed Time: 5.460681 seconds

babar:big mtj$ ./pi -n 1000000000 -g 16

Pi Estimation: 3.141614
Elapsed Time: 5.508967 seconds

Thank you for the advice. This seems an area for development of the runtime code.

Ensuring proper granularity is user's responsibility. It's impossible to execute a single goroutine on several processors.

Dmitry Vyukov

unread,

Nov 2, 2011, 6:23:06 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

On Wed, Nov 2, 2011 at 2:01 PM, Michael Jones <m...@google.com> wrote:

Indeed, but what you say suggests that we misunderstand each other.

I have 8 CPUs. I ran the program with num goroutines ranging from 1 to N and explicitly set the runtime.GOMAXPROCS to that number each time.

Physical observation showed (as I described carefully case by case):

1 on 1
2 on 2
3 on 3
4 on 2
5 on 1 then 5
6 on 1 then 6

7 on 1 then 7
8 on 8
9 on 9 shared on 8
10 in 10 shared on 8
:

As I said then and again now, it was the cases in bold, such as 4 max procs, 4 goroutines, and 100% cpu on two cpus for the whole execution that puzzled me. It still does. Why is 4 threads on 4 CPUs not granular enough?

Because then execution time is max(T0, T1, T2, T3) which can be arbitrary longer then sum(T0, T1, T2, T3)/n.

Seems a bug. A round-robin scheduler would not have this problem. A greedy scheduler would not have this problem. A "just launch them all" null scheduler would not have this problem. So, I conjecture that it is in fact a problem.

All types of schedulers have the "problem" (so that I would not even call it a problem in a scheduler, it is a problem in user code). Ideally task granularity is fixed, it's number of tasks that vary.

Dmitry Vyukov

unread,

Nov 2, 2011, 6:26:39 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

Well, after thinking some more, I think that the problem may be due to non-preemptive execution.

May you please try the following modification:

for i := 0; i < nb_eff; i++ {

x, y := r.Float64(), r.Float64()

if x*x+y*y <= 1.0 {

nb_in++

}

\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

if i % 1e4 == 0 {

runtime.Gosched()
}

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

}

Dmitry Vyukov

unread,

Nov 2, 2011, 6:56:18 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

On Wed, Nov 2, 2011 at 2:43 PM, Michael Jones <m...@google.com> wrote:

Hi! Works fine, same with gosched() frequency reduced greatly (1e6)

Does not work in the other extreme case of just one call after each Goroutine dispatch.

I am concerned that this demonstrates a problem.

OK, I think I see what is happening.

The first goroutine is started calculations. Then the second goroutine allocates the rand object, at this point the allocation procedure requests garbage collection. Third and fourth goroutines are successfully parked (or not yet started), however the first goroutine can't stop (it never checks for pending GC during calculations), so just continue its calculations till completion. Now the first goroutine finishes calculations and GC starts (however at this point all goroutines might already have finished). 3 remaining goroutines start calculation after GC. And so total execution time is 2x of what it might be. Moreover, the problematic sequence may repeat 2 times more, then execution time is 4x - gorotuines effectively run sequentially.

Such situation is unlikely to happen in a server like application, however it quite can happen in parallel computation programs.

Dmitry Vyukov

unread,

Nov 2, 2011, 6:59:34 AM11/2/11

to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

To make it clear, fine-grained work splitting is still important (static scheduling, like start 4 goroutines with N/4 iterations each and expect them to finish at exactly the same time, almost never works in practice) and it alleviates the problem (pending GC is checked "between" gorotuines). So it looks like a good solution.

Carl

unread,

Nov 2, 2011, 7:26:02 AM11/2/11

to golang-nuts

I didn't quite get it, so I google'd around and found some additional
background info which I feel I should not withhold:

some Keywords:

Efficient Operating System Scheduling Multi-Core Architectures
Performance Asymmetric kernel.

I don't know if this could be having any impact, but the activities of
the underlying OS probably deserve some additional consideration:

a possibility:
the odd behavior may be due to asymmetric core performance,
attributable to advanced power management. Some parts of the machine
could be powered down to save energy, check configuration and set
everything to maximum-performance, turn off any OS management for
energy efficiency.

by my experience, its common in load testing to run into these kind of
odd behaviors in the first round, in some advanced hardware and
software designs the "turbo" does not kick in before a specific
threshold (load). A standard best-practice is to do successive ramp-
up of the load and possibly also a warm-up phase; also the full load
has to be sustained for a meaningful length of time. Thus, later on
when graphing the performance results of the respective system, one
can easily spot any nonlinear and asymmetric behavior and do some
correlations of the key events over time to get deeper insight. BTW
(...); it would be a big help if there were some way to count and
measure the event "memory management" (GC), to make it show up
somewhere as diagnostic info or as a key performance counter with a
specific signature to facilitate correlation of this specific key
event with the rest of the system performance during subsequent
analysis.

I don't know what the current state of the Linux Kernel is, in regard
to any advanced scheduling (asymmetric core performance etc) or what
parts of it may have been implemented, but I understand there has been
some ongoing research over the last few years.

hope that can help someone,
cheers.

On 1 Nov., 20:16, Michael Jones <m...@google.com> wrote:
> The parts in bold.
>

> On Tue, Nov 1, 2011 at 11:45 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> > On Tue, Nov 1, 2011 at 4:09 PM, Michael Jones <m...@google.com> wrote:
>
> >> Interesting observation. I ran this on a varied number of threads on my 2
> >> processor x 4 cores Xeon MacPro (8 CPUs) at a variety of thread settings
> >> and considered the efficiency by simply watching the CPU utilization bar
> >> chart of the system's Activity Monitor program. Here are the results:
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 1
> >> Pi Estimation: 3.141606
> >> Elapsed Time: 42.328063 seconds
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 2
> >> Pi Estimation: 3.141491
> >> Elapsed Time: 21.222779 seconds
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 3
> >> Pi Estimation: 3.141594
> >> Elapsed Time: 14.112936 seconds
>

> >> *babar:big mtj$ ./pi -n 1000000000 -g 4*
> >> *Pi Estimation: 3.141547*
> >> *Elapsed Time: 21.210744 seconds*
> >> *babar:big mtj$ #4 runs on 2 CPUs only*
> >> *
> >> *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 5*
> >> *Pi Estimation: 3.141589*
> >> *Elapsed Time: 16.952144 seconds*
> >> *babar:big mtj$ #5 runs on 1 until half done, then 5*
> >> * *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 6*
> >> *Pi Estimation: 3.141640*
> >> *Elapsed Time: 14.128896 seconds*
> >> *babar:big mtj$ #6 runs on 1 until mostly done, then 6*
> >> * *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 7*
> >> *Pi Estimation: 3.141571*
> >> *Elapsed Time: 12.116916 seconds*
> >> *babar:big mtj$ #7 runs on 1 for a while then spreads out*

Michael Jones

unread,

Nov 2, 2011, 5:50:03 AM11/2/11

to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

I changed Kyle's version to add your "and also split the task 64 times more than expected" change:

pi_eval := random_thread(*nb_process * 64, *nb_iter)

The results are now nice and linear, with the CPU usage looking nice and steady. Immediately jumping to N CPU's and running steady untill the end.

babar:big mtj$ ./pi -n 1000000000 -g 1

Pi Estimation: 3.141580

Elapsed Time: 42.173500 seconds

babar:big mtj$ ./pi -n 1000000000 -g 2

Pi Estimation: 3.141636

Elapsed Time: 21.433001 seconds

babar:big mtj$ ./pi -n 1000000000 -g 3

Pi Estimation: 3.141605

Elapsed Time: 14.288153 seconds

babar:big mtj$ ./pi -n 1000000000 -g 4

Pi Estimation: 3.141662

Elapsed Time: 10.962000 seconds

babar:big mtj$ ./pi -n 1000000000 -g 5

Pi Estimation: 3.141560

Elapsed Time: 8.907828 seconds

babar:big mtj$ ./pi -n 1000000000 -g 6

Pi Estimation: 3.141544

Elapsed Time: 7.273181 seconds

babar:big mtj$ ./pi -n 1000000000 -g 7

Pi Estimation: 3.141570

Elapsed Time: 6.314425 seconds

babar:big mtj$ ./pi -n 1000000000 -g 8

Pi Estimation: 3.141534

Elapsed Time: 5.559293 seconds

babar:big mtj$ ./pi -n 1000000000 -g 9

Pi Estimation: 3.141607

Elapsed Time: 5.541590 seconds

babar:big mtj$ ./pi -n 1000000000 -g 10

Pi Estimation: 3.141619

Elapsed Time: 5.508037 seconds

babar:big mtj$ ./pi -n 1000000000 -g 11

Pi Estimation: 3.141611

Elapsed Time: 5.497377 seconds

babar:big mtj$ ./pi -n 1000000000 -g 12

Pi Estimation: 3.141615

Elapsed Time: 5.486157 seconds

babar:big mtj$ ./pi -n 1000000000 -g 13

Pi Estimation: 3.141604

Elapsed Time: 5.493715 seconds

babar:big mtj$ ./pi -n 1000000000 -g 14

Pi Estimation: 3.141674

Elapsed Time: 5.464580 seconds

babar:big mtj$ ./pi -n 1000000000 -g 15

Pi Estimation: 3.141544

Elapsed Time: 5.460681 seconds

babar:big mtj$ ./pi -n 1000000000 -g 16

Pi Estimation: 3.141614

Elapsed Time: 5.508967 seconds

Thank you for the advice. This seems an area for development of the runtime code.

Michael Jones

unread,

Nov 2, 2011, 6:43:45 AM11/2/11

to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

Hi! Works fine, same with gosched() frequency reduced greatly (1e6)

Does not work in the other extreme case of just one call after each Goroutine dispatch.

I am concerned that this demonstrates a problem.

On Wed, Nov 2, 2011 at 3:26 AM, Dmitry Vyukov <dvy...@google.com> wrote:

if i % 1e4 == 0 {
runtime.Gosched()
}

Michael Jones

unread,

Nov 2, 2011, 6:01:52 AM11/2/11

to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

Indeed, but what you say suggests that we misunderstand each other.

I have 8 CPUs. I ran the program with num goroutines ranging from 1 to N and explicitly set the runtime.GOMAXPROCS to that number each time.

Physical observation showed (as I described carefully case by case):

1 on 1

2 on 2

3 on 3

4 on 2

5 on 1 then 5

6 on 1 then 6

7 on 1 then 7

8 on 8

9 on 9 shared on 8

10 in 10 shared on 8

:

As I said then and again now, it was the cases in bold, such as 4 max procs, 4 goroutines, and 100% cpu on two cpus for the whole execution that puzzled me. It still does. Why is 4 threads on 4 CPUs not granular enough? Seems a bug. A round-robin scheduler would not have this problem. A greedy scheduler would not have this problem. A "just launch them all" null scheduler would not have this problem. So, I conjecture that it is in fact a problem.

Michael Jones

unread,

Nov 2, 2011, 7:05:21 AM11/2/11

to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

I am sold and appreciated your "subdive a lot more" plan. My experience in this area is in assembler/C/C++ and various fork/thread/pthread approaches that do indeed manage to run parallel computational tasks in parallel. I do it all the time. That's why I have an 8 core Xeon at home. ;-) For now my lesson will just be "subdivide the work much more than the number of cores when using Go." Thank you for demonstrating that.

Ian Lance Taylor

unread,

Nov 2, 2011, 12:25:02 PM11/2/11

to Dmitry Vyukov, Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts

Dmitry Vyukov <dvy...@google.com> writes:

> The first goroutine is started calculations. Then the second goroutine
> allocates the rand object, at this point the allocation procedure requests
> garbage collection. Third and fourth goroutines are successfully parked (or
> not yet started), however the first goroutine can't stop (it never checks
> for pending GC during calculations), so just continue its calculations till
> completion. Now the first goroutine finishes calculations and GC starts
> (however at this point all goroutines might already have finished). 3
> remaining goroutines start calculation after GC. And so total execution
> time is 2x of what it might be. Moreover, the problematic sequence may
> repeat 2 times more, then execution time is 4x - gorotuines effectively run
> sequentially.
> Such situation is unlikely to happen in a server like application, however
> it quite can happen in parallel computation programs.

Thanks for the analysis. This is definitely a known problem with the
current scheduler: a goroutine which executes a long running
calculation, one which makes no system calls, is not preempted. With
the current scheduler, you will get better parallelism in a long running
calculation if you occasionally call runtime.Gosched. But that is
really a bug which needs to be fixed.

Ian

Michael Jones

unread,

Nov 2, 2011, 2:07:10 PM11/2/11

to Ian Lance Taylor, Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts

yes, and the next "go" something is a natural rallying point!

Carl

unread,

Nov 4, 2011, 5:38:08 AM11/4/11

to golang-nuts

On 2 Nov., 17:25, Ian Lance Taylor <i...@google.com> wrote:

> With the current scheduler, you will get better parallelism in a long running
> calculation if you occasionally call runtime.Gosched. But that is
> really a bug which needs to be fixed.
>
> Ian

I thought about this some more, and for me one question remains: is
this an easy Bug to fix or does it require additional work, making it
more of a problem in the long term for folks doing benchmarks and
measuring performance? If the latter is the case then I think the
issue should be documented on a FAQ somewhere along with the
workaround for it.

Ian Lance Taylor

unread,

Nov 4, 2011, 9:11:45 AM11/4/11

to Carl, golang-nuts

Carl <2ca...@googlemail.com> writes:

> On 2 Nov., 17:25, Ian Lance Taylor <i...@google.com> wrote:
>
>
>> With the current scheduler, you will get better parallelism in a long running
>> calculation if you occasionally call runtime.Gosched. But that is
>> really a bug which needs to be fixed.
>

> I thought about this some more, and for me one question remains: is
> this an easy Bug to fix or does it require additional work, making it
> more of a problem in the long term for folks doing benchmarks and
> measuring performance? If the latter is the case then I think the
> issue should be documented on a FAQ somewhere along with the
> workaround for it.

It is not a simple bug to fix.

http://golang.org/doc/go_faq.html#Why_GOMAXPROCS

Ian

Reply all

Reply to author

Forward