performance on pi calculation with monte carlo

572 views
Skip to first unread message

christophe petit

unread,
Oct 31, 2011, 10:50:16 PM10/31/11
to golang-nuts
Hello,

I have a problem with a little code that computes pi with monte carlo
method. I use goroutine to perform parallelized loops and in each of
them, I generate random numbers. The goal is for example to have 10
threads with for each 1000 random numbers instead having only one
which does 10.000 iterations.

The code works fine but I haven't got the expected results for the
gain of execution time. I always get the same time whether with 1
process, 2, 10 or 100 ...

I tried to use the environment variable GOMAXPROCS (set to 2, 10,
100 ...) and it doesn't lower the execution time, worse it increases
it.

I suspect the problem comes form the "rand.Float64" function which
could not be "thread safe" like in posix where the "rand_r" function
is thread safe and the "rand" function is not.

here is the code :
----------------------------------------------------------------------------------------------------------


package main

import (
"fmt"
"math"
"flag"
"rand"
"strconv"
"time"
// "runtime"
)

func main() {

var pi_eval float64;
flag.Parse();

if (flag.NArg() != 2) {
fmt.Println("Arguments Error : ./6.out nb_process
nb_iterations ");

} else {
nb_process, err := strconv.Atoi(flag.Arg(0));
if err != nil {
// Invalid string
fmt.Println("Error");
}
nb_iter, err := strconv.Atoi(flag.Arg(1))
if err != nil {
// Invalid string
fmt.Println("Error");
}
// runtime.GOMAXPROCS(nb_process);
time1 := time.UTC();
pi_eval=random_thread(nb_process,nb_iter);
time2 := time.UTC();
fmt.Printf("Pi Estimation: %f\n",pi_eval);
fmt.Printf("Elapsed Time: %f seconds\n",((float64)((float64)
(time2.Nanoseconds()-time1.Nanoseconds())/(float64)
(math.Pow(10,9)))));
}
}

func random_thread(nb_process int, nb_iter int) float64 {

var k,l int;
var pi float64 = 0;
var nb_eff int = nb_iter/nb_process;
ch := make(chan float64);
for k = 1; k <= nb_process; k++ {
go pi_process(ch, k, nb_eff);
}
for l = 1; l <= nb_process; l++ {
pi += <-ch;
}
pi=pi/(float64) (nb_iter);
pi=pi*4;

return pi;

}

func pi_process(ch chan float64, k int, nb_eff int) {

var x,y,z float64;
var nb_in int =0 ;

rand.Seed(rand.Int63());

for i :=0; i <nb_eff; i++ {
x = rand.Float64();
y = rand.Float64();

z = math.Pow(x,2) + math.Pow(y,2);
if z <= 1.0 {
nb_in++;
}
}
ch <- (float64) (nb_in);
}

---------------------------------------------------------------------------------------------------------
For example, here is two executions ( with 1 process and 10 processes.
The second argument is the number of iterations) :


$ ./6.out 1 100000000
Pi Estimation: 3.141822
Elapsed Time: 22.086858 seconds

----------------------------------------------------------

$./6.out 10 100000000
Pi Estimation: 3.141577
Elapsed Time: 21.882880 seconds

----------------------------------------------------------


Could you help me please to increase the gain of performance in terms
of execution time with a number of processes greater than one ...

Thanks.

andrey mirtchovski

unread,
Oct 31, 2011, 10:54:50 PM10/31/11
to christophe petit, golang-nuts
Others have hit exactly the same problem with exactly the same code.
The issue is that the Go pseudorandom number generator uses a global
state and protects it with a mutex. All the goroutines you spawn are
waiting on that single mutex.

roger peppe

unread,
Oct 31, 2011, 11:47:02 PM10/31/11
to christophe petit, golang-nuts
andrey's explanation sounds plausible to me.

use rand.New (the generator created this way is not
mutexed) and use crypto/rand.Int to seed (using rand to seed itself
is probably not going to give you much more randomness!)

Kyle Consalus

unread,
Nov 1, 2011, 12:14:24 AM11/1/11
to roger peppe, christophe petit, golang-nuts
A lot of the time is probably also spent in math.Pow.
I did a quick rework of the code (http://pastie.org/2791625) and without the rand contention and some of the inefficiencies like math.Pow, it seems to run much faster and performance increases with more goroutines.

$ gorun oldpi.go 1 10000000
Pi Estimation: 3.142061
Elapsed Time: 3.323649 seconds
$ gorun oldpi.go 2 10000000
Pi Estimation: 3.141887
Elapsed Time: 5.099973 seconds
$ gorun pi.go -n 10000000 -g 1
Pi Estimation: 3.140931
Elapsed Time: 0.513560 seconds
$ gorun pi.go -n 10000000 -g 2
Pi Estimation: 3.141690
Elapsed Time: 0.381710 seconds

However, I didn't really test my version other than to see that it generates a number that looks something like pi, so my benchmark may mean nothing.

Kyle Consalus

unread,
Nov 1, 2011, 12:21:58 AM11/1/11
to roger peppe, christophe petit, golang-nuts
.. and to be clear, those numbers are using Go 96df3afe1c8a+ tip on a 2.53 Ghz Intel Core 2 Duo running OS X 10.5.8.
Also, the last "Elapsed time" isn't terribly representative; I've gotten 0.25 on all subsequent runs.
But that's what I get for microbenchmarking a short-running application sloppily. 

Michael Jones

unread,
Nov 1, 2011, 5:55:17 AM11/1/11
to Kyle Consalus, roger peppe, christophe petit, golang-nuts
Nice!

Adding 
    "runtime" 
to the import list, and
    runtime.GOMAXPROCS(*nb_process)
just after flag.Parse() makes a program that scales nicely (inversely linear) with CPUs for me.

Also, 1e8 might be a better default iteration count. (~3 seconds)
--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Michael Jones

unread,
Nov 1, 2011, 9:09:29 AM11/1/11
to Kyle Consalus, roger peppe, christophe petit, golang-nuts
Interesting observation. I ran this on a varied number of threads on my 2 processor x 4 cores Xeon MacPro (8 CPUs) at a variety of thread settings and considered the efficiency by simply watching the CPU utilization bar chart of the system's Activity Monitor program. Here are the results:

babar:big mtj$ ./pi -n 1000000000 -g 1
Pi Estimation: 3.141606
Elapsed Time: 42.328063 seconds

babar:big mtj$ ./pi -n 1000000000 -g 2
Pi Estimation: 3.141491
Elapsed Time: 21.222779 seconds

babar:big mtj$ ./pi -n 1000000000 -g 3
Pi Estimation: 3.141594
Elapsed Time: 14.112936 seconds

babar:big mtj$ ./pi -n 1000000000 -g 4
Pi Estimation: 3.141547
Elapsed Time: 21.210744 seconds
babar:big mtj$ #4 runs on 2 CPUs only

babar:big mtj$ ./pi -n 1000000000 -g 5
Pi Estimation: 3.141589
Elapsed Time: 16.952144 seconds
babar:big mtj$ #5 runs on 1 until half done, then 5
 
babar:big mtj$ ./pi -n 1000000000 -g 6
Pi Estimation: 3.141640
Elapsed Time: 14.128896 seconds
babar:big mtj$ #6 runs on 1 until mostly done, then 6
 
babar:big mtj$ ./pi -n 1000000000 -g 7
Pi Estimation: 3.141571
Elapsed Time: 12.116916 seconds
babar:big mtj$ #7 runs on 1 for a while then spreads out
 
babar:big mtj$ ./pi -n 1000000000 -g 8
Pi Estimation: 3.141514
Elapsed Time: 10.607424 seconds
babar:big mtj$ #8 runs on 1 briefly then spreads out
 
babar:big mtj$ ./pi -n 1000000000 -g 9
Pi Estimation: 3.141717
Elapsed Time: 9.614367 seconds

babar:big mtj$ ./pi -n 1000000000 -g 10
Pi Estimation: 3.141628
Elapsed Time: 9.138594 seconds

babar:big mtj$ ./pi -n 1000000000 -g 11
Pi Estimation: 3.141611
Elapsed Time: 8.764789 seconds

babar:big mtj$ ./pi -n 1000000000 -g 16
Pi Estimation: 3.141627
Elapsed Time: 7.993622 seconds

babar:big mtj$ ./pi -n 1000000000 -g 24
Pi Estimation: 3.141527
Elapsed Time: 6.986677 seconds

babar:big mtj$ ./pi -n 1000000000 -g 32
Pi Estimation: 3.141585
Elapsed Time: 7.075722 seconds

Odd behavior. Here's the precise code (from Kyle modified by me as stated):

package main

import (
"fmt"
"flag"
"rand"
"runtime"
"time"
)

func main() {
nb_process := flag.Int("g", 1, "goroutines")
nb_iter := flag.Int("n", 1e8, "number of iterations")
flag.Parse()

runtime.GOMAXPROCS(*nb_process)

t0 := time.Nanoseconds()
pi_eval := random_thread(*nb_process, *nb_iter)
t1 := time.Nanoseconds()
fmt.Printf("Pi Estimation: %f\n", pi_eval)
fmt.Printf("Elapsed Time: %f seconds\n", float64(t1-t0)/1e9)
}

func random_thread(nb_process int, nb_iter int) float64 {
nb_eff := nb_iter / nb_process
ch := make(chan float64, nb_process)
for k := 0; k < nb_process; k++ {
go pi_process(ch, nb_eff)
}
pi := float64(0)
for l := 0; l < nb_process; l++ {
pi += <-ch
}
return 4 * (pi / (float64)(nb_iter))
}

func pi_process(ch chan float64, nb_eff int) {
r := rand.New(rand.NewSource(time.Nanoseconds()))
nb_in := 0
for i := 0; i < nb_eff; i++ {
x, y := r.Float64(), r.Float64()
if x*x+y*y <= 1.0 {
nb_in++
}
}
ch <- float64(nb_in)
}

Dmitry Vyukov

unread,
Nov 1, 2011, 2:45:13 PM11/1/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
Looks OK. What is odd here?
 

Dmitry Vyukov

unread,
Nov 1, 2011, 2:47:31 PM11/1/11
to christophe petit, golang-nuts
Here is a version I prepared some time ago. It does some load-balancing and handles parallel random number generation the right way.

package main

import (
"flag"
"fmt"
"runtime"
)

var (
nThrow = flag.Int("n", 1e6, "number of throws")
nCPU   = flag.Int("cpu", 1, "number of CPUs to use")
)

func main() {
flag.Parse()
runtime.GOMAXPROCS(*nCPU) // Set number of OS threads to use.

// Split all throws to that number of parallel tasks.
// Constant factor of 64 is to ensure sufficient dynamic load balancing.
nTasks := *nCPU * 64
// Channel to collect partial results.
parts := make(chan int, nTasks)
for i := 0; i < nTasks; i++ {
// Kick off a parallel task.
go func(me int) {
hits := 0
// Create task-local PRNG to avoid synchronization issues.
r := MakeLeapFrogRand(nTasks, me)
// Calculate number of throws for this task.
n := *nThrow / nTasks
if me < (*nThrow % nTasks) {
n++
}
// Do the throws.
for i := 0; i < n; i++ {
x := r.Float64()
y := r.Float64()
if x*x+y*y < 1 {
hits++
}
}
parts <- hits // Send the result back.
}(i)
}

// Aggregate partial results.
hits := 0
for i := 0; i < nTasks; i++ {
hits += <-parts
}
pi := 4 * float64(hits) / float64(*nThrow)
fmt.Printf("PI = %g\n", pi)
}

// LeapFrogRand ensures that all parallel tasks
// receive non-overlapping regions of random space.
// That is to ensure absence of over sampled regions.
type LeapFrogRand struct {
mult, next uint64
pad        [64]uint8
}

const (
LeapFrogMult = 764261123
LeapFrogPmod = 2147483647
)

func MakeLeapFrogRand(total, me int) LeapFrogRand {
var r LeapFrogRand
mult := uint64(LeapFrogMult)
seed := uint64(1)
for i := 0; i < total; i++ {
mult = (mult * LeapFrogMult) % LeapFrogPmod
seed = (seed * LeapFrogMult) % LeapFrogPmod
if i == me {
r.next = seed
}
}
r.mult = mult
return r
}

func (r *LeapFrogRand) Float64() float64 {
r.next = (r.next * r.mult) % LeapFrogPmod
return float64(r.next) / float64(LeapFrogPmod)
}

 

Michael Jones

unread,
Nov 1, 2011, 3:16:23 PM11/1/11
to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
The parts in bold.

Dmitry Vyukov

unread,
Nov 2, 2011, 12:22:46 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
I think it is due to absence of load balancing.

Dmitry Vyukov

unread,
Nov 2, 2011, 5:53:43 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
On Wed, Nov 2, 2011 at 1:50 PM, Michael Jones <m...@google.com> wrote:
I changed Kyle's version to add your "and also split the task 64 times more than expected" change:
    pi_eval := random_thread(*nb_process * 64, *nb_iter)

The results are now nice and linear, with the CPU usage looking nice and steady. Immediately jumping to N CPU's and running steady untill the end.

babar:big mtj$ ./pi -n 1000000000 -g 1
Pi Estimation: 3.141580
Elapsed Time: 42.173500 seconds
babar:big mtj$ ./pi -n 1000000000 -g 2
Pi Estimation: 3.141636
Elapsed Time: 21.433001 seconds
babar:big mtj$ ./pi -n 1000000000 -g 3
Pi Estimation: 3.141605
Elapsed Time: 14.288153 seconds
babar:big mtj$ ./pi -n 1000000000 -g 4
Pi Estimation: 3.141662
Elapsed Time: 10.962000 seconds
babar:big mtj$ ./pi -n 1000000000 -g 5
Pi Estimation: 3.141560
Elapsed Time: 8.907828 seconds
babar:big mtj$ ./pi -n 1000000000 -g 6
Pi Estimation: 3.141544
Elapsed Time: 7.273181 seconds
babar:big mtj$ ./pi -n 1000000000 -g 7
Pi Estimation: 3.141570
Elapsed Time: 6.314425 seconds
babar:big mtj$ ./pi -n 1000000000 -g 8
Pi Estimation: 3.141534
Elapsed Time: 5.559293 seconds
babar:big mtj$ ./pi -n 1000000000 -g 9
Pi Estimation: 3.141607
Elapsed Time: 5.541590 seconds
babar:big mtj$ ./pi -n 1000000000 -g 10
Pi Estimation: 3.141619
Elapsed Time: 5.508037 seconds
babar:big mtj$ ./pi -n 1000000000 -g 11
Pi Estimation: 3.141611
Elapsed Time: 5.497377 seconds
babar:big mtj$ ./pi -n 1000000000 -g 12
Pi Estimation: 3.141615
Elapsed Time: 5.486157 seconds
babar:big mtj$ ./pi -n 1000000000 -g 13
Pi Estimation: 3.141604
Elapsed Time: 5.493715 seconds
babar:big mtj$ ./pi -n 1000000000 -g 14
Pi Estimation: 3.141674
Elapsed Time: 5.464580 seconds
babar:big mtj$ ./pi -n 1000000000 -g 15
Pi Estimation: 3.141544
Elapsed Time: 5.460681 seconds
babar:big mtj$ ./pi -n 1000000000 -g 16
Pi Estimation: 3.141614
Elapsed Time: 5.508967 seconds

Thank you for the advice. This seems an area for development of the runtime code.

Ensuring proper granularity is user's responsibility. It's impossible to execute a single goroutine on several processors.

Dmitry Vyukov

unread,
Nov 2, 2011, 6:23:06 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
On Wed, Nov 2, 2011 at 2:01 PM, Michael Jones <m...@google.com> wrote:
Indeed, but what you say suggests that we misunderstand each other.

I have 8 CPUs. I ran the program with num goroutines ranging from 1 to N and explicitly set the runtime.GOMAXPROCS to that number each time.

Physical observation showed (as I described carefully case by case):

1 on 1
2 on 2
3 on 3
4 on 2
5 on 1 then 5
6 on 1 then 6
7 on 1 then 7
8 on 8
9 on 9 shared on 8
10 in 10 shared on 8
:

As I said then and again now, it was the cases in bold, such as 4 max procs, 4 goroutines, and 100% cpu on two cpus for the whole execution that puzzled me. It still does. Why is 4 threads on 4 CPUs not granular enough?

Because then execution time is max(T0, T1, T2, T3) which can be arbitrary longer then sum(T0, T1, T2, T3)/n.

 
Seems a bug. A round-robin scheduler would not have this problem. A greedy scheduler would not have this problem. A "just launch them all" null scheduler would not have this problem. So, I conjecture that it is in fact a problem.


All types of schedulers have the "problem" (so that I would not even call it a problem in a scheduler, it is a problem in user code). Ideally task granularity is fixed, it's number of tasks that vary.

 

Dmitry Vyukov

unread,
Nov 2, 2011, 6:26:39 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
Well, after thinking some more, I think that the problem may be due to non-preemptive execution.
May you please try the following modification:


for i := 0; i < nb_eff; i++ {
x, y := r.Float64(), r.Float64()
if x*x+y*y <= 1.0 {
nb_in++
}
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
                if i % 1e4 == 0 {
                        runtime.Gosched()
                }
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
}

Dmitry Vyukov

unread,
Nov 2, 2011, 6:56:18 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
On Wed, Nov 2, 2011 at 2:43 PM, Michael Jones <m...@google.com> wrote:
Hi! Works fine, same with gosched() frequency reduced greatly (1e6)

Does not work in the other extreme case of just one call after each Goroutine dispatch.

I am concerned that this demonstrates a problem.


OK, I think I see what is happening.
The first goroutine is started calculations. Then the second goroutine allocates the rand object, at this point the allocation procedure requests garbage collection. Third and fourth goroutines are successfully parked (or not yet started), however the first goroutine can't stop (it never checks for pending GC during calculations), so just continue its calculations till completion. Now the first goroutine finishes calculations and GC starts (however at this point all goroutines might already have finished). 3 remaining goroutines start calculation after GC. And so total execution time is 2x of what it might be. Moreover, the problematic sequence may repeat 2 times more, then execution time is 4x - gorotuines effectively run sequentially.
Such situation is unlikely to happen in a server like application, however it quite can happen in parallel computation programs.

Dmitry Vyukov

unread,
Nov 2, 2011, 6:59:34 AM11/2/11
to Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
To make it clear, fine-grained work splitting is still important (static scheduling, like start 4 goroutines with N/4 iterations each and expect them to finish at exactly the same time, almost never works in practice) and it alleviates the problem (pending GC is checked "between" gorotuines). So it looks like a good solution.

Carl

unread,
Nov 2, 2011, 7:26:02 AM11/2/11
to golang-nuts
I didn't quite get it, so I google'd around and found some additional
background info which I feel I should not withhold:

some Keywords:

Efficient Operating System Scheduling Multi-Core Architectures
Performance Asymmetric kernel.

I don't know if this could be having any impact, but the activities of
the underlying OS probably deserve some additional consideration:

a possibility:
the odd behavior may be due to asymmetric core performance,
attributable to advanced power management. Some parts of the machine
could be powered down to save energy, check configuration and set
everything to maximum-performance, turn off any OS management for
energy efficiency.

by my experience, its common in load testing to run into these kind of
odd behaviors in the first round, in some advanced hardware and
software designs the "turbo" does not kick in before a specific
threshold (load). A standard best-practice is to do successive ramp-
up of the load and possibly also a warm-up phase; also the full load
has to be sustained for a meaningful length of time. Thus, later on
when graphing the performance results of the respective system, one
can easily spot any nonlinear and asymmetric behavior and do some
correlations of the key events over time to get deeper insight. BTW
(...); it would be a big help if there were some way to count and
measure the event "memory management" (GC), to make it show up
somewhere as diagnostic info or as a key performance counter with a
specific signature to facilitate correlation of this specific key
event with the rest of the system performance during subsequent
analysis.

I don't know what the current state of the Linux Kernel is, in regard
to any advanced scheduling (asymmetric core performance etc) or what
parts of it may have been implemented, but I understand there has been
some ongoing research over the last few years.

hope that can help someone,
cheers.


On 1 Nov., 20:16, Michael Jones <m...@google.com> wrote:
> The parts in bold.
>
> On Tue, Nov 1, 2011 at 11:45 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> > On Tue, Nov 1, 2011 at 4:09 PM, Michael Jones <m...@google.com> wrote:
>
> >> Interesting observation. I ran this on a varied number of threads on my 2
> >> processor x 4 cores Xeon MacPro (8 CPUs) at a variety of thread settings
> >> and considered the efficiency by simply watching the CPU utilization bar
> >> chart of the system's Activity Monitor program. Here are the results:
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 1
> >> Pi Estimation: 3.141606
> >> Elapsed Time: 42.328063 seconds
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 2
> >> Pi Estimation: 3.141491
> >> Elapsed Time: 21.222779 seconds
>
> >> babar:big mtj$ ./pi -n 1000000000 -g 3
> >> Pi Estimation: 3.141594
> >> Elapsed Time: 14.112936 seconds
>
> >> *babar:big mtj$ ./pi -n 1000000000 -g 4*
> >> *Pi Estimation: 3.141547*
> >> *Elapsed Time: 21.210744 seconds*
> >> *babar:big mtj$ #4 runs on 2 CPUs only*
> >> *
> >> *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 5*
> >> *Pi Estimation: 3.141589*
> >> *Elapsed Time: 16.952144 seconds*
> >> *babar:big mtj$ #5 runs on 1 until half done, then 5*
> >> * *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 6*
> >> *Pi Estimation: 3.141640*
> >> *Elapsed Time: 14.128896 seconds*
> >> *babar:big mtj$ #6 runs on 1 until mostly done, then 6*
> >> * *
> >> *babar:big mtj$ ./pi -n 1000000000 -g 7*
> >> *Pi Estimation: 3.141571*
> >> *Elapsed Time: 12.116916 seconds*
> >> *babar:big mtj$ #7 runs on 1 for a while then spreads out*

Michael Jones

unread,
Nov 2, 2011, 5:50:03 AM11/2/11
to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
I changed Kyle's version to add your "and also split the task 64 times more than expected" change:
    pi_eval := random_thread(*nb_process * 64, *nb_iter)

The results are now nice and linear, with the CPU usage looking nice and steady. Immediately jumping to N CPU's and running steady untill the end.

babar:big mtj$ ./pi -n 1000000000 -g 1
Pi Estimation: 3.141580
Elapsed Time: 42.173500 seconds
babar:big mtj$ ./pi -n 1000000000 -g 2
Pi Estimation: 3.141636
Elapsed Time: 21.433001 seconds
babar:big mtj$ ./pi -n 1000000000 -g 3
Pi Estimation: 3.141605
Elapsed Time: 14.288153 seconds
babar:big mtj$ ./pi -n 1000000000 -g 4
Pi Estimation: 3.141662
Elapsed Time: 10.962000 seconds
babar:big mtj$ ./pi -n 1000000000 -g 5
Pi Estimation: 3.141560
Elapsed Time: 8.907828 seconds
babar:big mtj$ ./pi -n 1000000000 -g 6
Pi Estimation: 3.141544
Elapsed Time: 7.273181 seconds
babar:big mtj$ ./pi -n 1000000000 -g 7
Pi Estimation: 3.141570
Elapsed Time: 6.314425 seconds
babar:big mtj$ ./pi -n 1000000000 -g 8
Pi Estimation: 3.141534
Elapsed Time: 5.559293 seconds
babar:big mtj$ ./pi -n 1000000000 -g 9
Pi Estimation: 3.141607
Elapsed Time: 5.541590 seconds
babar:big mtj$ ./pi -n 1000000000 -g 10
Pi Estimation: 3.141619
Elapsed Time: 5.508037 seconds
babar:big mtj$ ./pi -n 1000000000 -g 11
Pi Estimation: 3.141611
Elapsed Time: 5.497377 seconds
babar:big mtj$ ./pi -n 1000000000 -g 12
Pi Estimation: 3.141615
Elapsed Time: 5.486157 seconds
babar:big mtj$ ./pi -n 1000000000 -g 13
Pi Estimation: 3.141604
Elapsed Time: 5.493715 seconds
babar:big mtj$ ./pi -n 1000000000 -g 14
Pi Estimation: 3.141674
Elapsed Time: 5.464580 seconds
babar:big mtj$ ./pi -n 1000000000 -g 15
Pi Estimation: 3.141544
Elapsed Time: 5.460681 seconds
babar:big mtj$ ./pi -n 1000000000 -g 16
Pi Estimation: 3.141614
Elapsed Time: 5.508967 seconds

Thank you for the advice. This seems an area for development of the runtime code.

Michael Jones

unread,
Nov 2, 2011, 6:43:45 AM11/2/11
to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
Hi! Works fine, same with gosched() frequency reduced greatly (1e6)

Does not work in the other extreme case of just one call after each Goroutine dispatch.

I am concerned that this demonstrates a problem.

On Wed, Nov 2, 2011 at 3:26 AM, Dmitry Vyukov <dvy...@google.com> wrote:
if i % 1e4 == 0 {
                        runtime.Gosched()
                }



Michael Jones

unread,
Nov 2, 2011, 6:01:52 AM11/2/11
to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
Indeed, but what you say suggests that we misunderstand each other.

I have 8 CPUs. I ran the program with num goroutines ranging from 1 to N and explicitly set the runtime.GOMAXPROCS to that number each time.

Physical observation showed (as I described carefully case by case):

1 on 1
2 on 2
3 on 3
4 on 2
5 on 1 then 5
6 on 1 then 6
7 on 1 then 7
8 on 8
9 on 9 shared on 8
10 in 10 shared on 8
:

As I said then and again now, it was the cases in bold, such as 4 max procs, 4 goroutines, and 100% cpu on two cpus for the whole execution that puzzled me. It still does. Why is 4 threads on 4 CPUs not granular enough? Seems a bug. A round-robin scheduler would not have this problem. A greedy scheduler would not have this problem. A "just launch them all" null scheduler would not have this problem. So, I conjecture that it is in fact a problem.

Michael Jones

unread,
Nov 2, 2011, 7:05:21 AM11/2/11
to Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
I am sold and appreciated your "subdive a lot more" plan. My experience in this area is in assembler/C/C++ and various fork/thread/pthread approaches that do indeed manage to run parallel computational tasks in parallel. I do it all the time. That's why I have an 8 core Xeon at home. ;-) For now my lesson will just be "subdivide the work much more than the number of cores when using Go." Thank you for demonstrating that.

Ian Lance Taylor

unread,
Nov 2, 2011, 12:25:02 PM11/2/11
to Dmitry Vyukov, Michael Jones, Kyle Consalus, roger peppe, christophe petit, golang-nuts
Dmitry Vyukov <dvy...@google.com> writes:

> The first goroutine is started calculations. Then the second goroutine
> allocates the rand object, at this point the allocation procedure requests
> garbage collection. Third and fourth goroutines are successfully parked (or
> not yet started), however the first goroutine can't stop (it never checks
> for pending GC during calculations), so just continue its calculations till
> completion. Now the first goroutine finishes calculations and GC starts
> (however at this point all goroutines might already have finished). 3
> remaining goroutines start calculation after GC. And so total execution
> time is 2x of what it might be. Moreover, the problematic sequence may
> repeat 2 times more, then execution time is 4x - gorotuines effectively run
> sequentially.
> Such situation is unlikely to happen in a server like application, however
> it quite can happen in parallel computation programs.

Thanks for the analysis. This is definitely a known problem with the
current scheduler: a goroutine which executes a long running
calculation, one which makes no system calls, is not preempted. With
the current scheduler, you will get better parallelism in a long running
calculation if you occasionally call runtime.Gosched. But that is
really a bug which needs to be fixed.

Ian

Michael Jones

unread,
Nov 2, 2011, 2:07:10 PM11/2/11
to Ian Lance Taylor, Dmitry Vyukov, Kyle Consalus, roger peppe, christophe petit, golang-nuts
yes, and the next "go" something is a natural rallying point!

Carl

unread,
Nov 4, 2011, 5:38:08 AM11/4/11
to golang-nuts
On 2 Nov., 17:25, Ian Lance Taylor <i...@google.com> wrote:


> With the current scheduler, you will get better parallelism in a long running
> calculation if you occasionally call runtime.Gosched.  But that is
> really a bug which needs to be fixed.
>
> Ian

I thought about this some more, and for me one question remains: is
this an easy Bug to fix or does it require additional work, making it
more of a problem in the long term for folks doing benchmarks and
measuring performance? If the latter is the case then I think the
issue should be documented on a FAQ somewhere along with the
workaround for it.

Ian Lance Taylor

unread,
Nov 4, 2011, 9:11:45 AM11/4/11
to Carl, golang-nuts
Carl <2ca...@googlemail.com> writes:

> On 2 Nov., 17:25, Ian Lance Taylor <i...@google.com> wrote:
>
>
>> With the current scheduler, you will get better parallelism in a long running
>> calculation if you occasionally call runtime.Gosched.  But that is
>> really a bug which needs to be fixed.
>

> I thought about this some more, and for me one question remains: is
> this an easy Bug to fix or does it require additional work, making it
> more of a problem in the long term for folks doing benchmarks and
> measuring performance? If the latter is the case then I think the
> issue should be documented on a FAQ somewhere along with the
> workaround for it.

It is not a simple bug to fix.

http://golang.org/doc/go_faq.html#Why_GOMAXPROCS

Ian

Reply all
Reply to author
Forward
0 new messages