Go scheduler update

1,887 views
Skip to first unread message

Dmitry Vyukov

unread,
Feb 8, 2013, 9:09:58 AM2/8/13
to Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev
Good news everyone!

I've updated my scheduler patch to tip:
https://codereview.appspot.com/7314062
(I had to create new changelist due to some issues with codereview site)

It passed all possible tests on std lib. Race detector also seems to
be happy. And I discovered a bug in current runtime thanks to the
patch:
https://codereview.appspot.com/7307073/diff/5001/src/pkg/runtime/parfor.c

Brad Fitzpatrick

unread,
Feb 8, 2013, 10:30:55 AM2/8/13
to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev
That CL description (and this email) are somewhat lacking in details.  :-)


--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Ugorji

unread,
Feb 8, 2013, 12:13:15 PM2/8/13
to golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall
Hi Dmitry,

Thanks so much for the update. 

Is this back on track for Go 1.1? Would love to know before getting excited about it. 
Also, would this help with CPU usage and scalability on high-volume servers?

For information of others, it seems this is a continuation of the old CL: https://codereview.appspot.com/6441097/ 

Patrick Mylund Nielsen

unread,
Feb 8, 2013, 12:27:42 PM2/8/13
to Brad Fitzpatrick, Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev
Yes. It's an interesting feeling of excitement and absolute cluelessness about what this does.

Leo Ferres

unread,
Feb 8, 2013, 1:22:31 PM2/8/13
to Dmitry Vyukov, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez

Great job, Dmitry! Did you manage to run some benchmarks against the round robin scheduler? Should we now implement the number crunching scheduler against this version?

On Feb 8, 2013 11:09 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:

Leo Ferres

unread,
Feb 8, 2013, 1:22:14 PM2/8/13
to Dmitry Vyukov, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez

Great job, Dmitry! Did you manage to run some benchmarks against the round robin scheduler? Should we now implement the number crunching scheduler against this version?

On Feb 8, 2013 11:09 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:

Dmitry Vyukov

unread,
Feb 12, 2013, 11:28:48 AM2/12/13
to Leo Ferres, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez
On Fri, Feb 8, 2013 at 10:22 PM, Leo Ferres <leof...@gmail.com> wrote:
> Great job, Dmitry! Did you manage to run some benchmarks against the round
> robin scheduler? Should we now implement the number crunching scheduler
> against this version?


Do you mean the current scheduler?
No, I have not run any benchmarks, this is just code sync.

Dmitry Vyukov

unread,
Feb 19, 2013, 8:58:59 AM2/19/13
to Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Hi,

I've done a lot cleaning/tuning/refactoring over the last days.
And also updated the net poller patch:
https://codereview.appspot.com/7326051
(now you can apply them together, or only the scheduler patch)

I have some great numbers:

net:
benchmark                           old ns/op    new ns/op    delta
BenchmarkTCPPersistent                  42107         7145  -83.03%
BenchmarkTCPPersistent-2                24177         5579  -76.92%
BenchmarkTCPPersistent-4                14245         3880  -72.76%
BenchmarkTCPPersistent-8                14653         2470  -83.14%
BenchmarkTCPPersistent-16               17408         2199  -87.37%
BenchmarkTCPPersistent-32                7134         2193  -69.26%
BenchmarkTCPPersistentTimeout           48570         7387  -84.79%
BenchmarkTCPPersistentTimeout-2         23810         6227  -73.85%
BenchmarkTCPPersistentTimeout-4         16876         3800  -77.48%
BenchmarkTCPPersistentTimeout-8         16072         2465  -84.66%
BenchmarkTCPPersistentTimeout-16        17017         1943  -88.58%
BenchmarkTCPPersistentTimeout-32         6894         2155  -68.74%
BenchmarkTCPOneShot                    124367        46281  -62.79%
BenchmarkTCPOneShot-2                   60165        31802  -47.14%
BenchmarkTCPOneShot-4                   36388        18533  -49.07%
BenchmarkTCPOneShot-8                   26048        14382  -44.79%
BenchmarkTCPOneShot-16                  19560        16013  -18.13%
BenchmarkTCPOneShot-32                  25307        15862  -37.32%
BenchmarkTCPOneShotTimeout             118402        58643  -50.47%
BenchmarkTCPOneShotTimeout-2            63786        34907  -45.27%
BenchmarkTCPOneShotTimeout-4            37546        19545  -47.94%
BenchmarkTCPOneShotTimeout-8            23137        13881  -40.01%
BenchmarkTCPOneShotTimeout-16           19154        15478  -19.19%
BenchmarkTCPOneShotTimeout-32           18417        16006  -13.09%

rpc (not parallel, so not much speedup expected):
benchmark                        old ns/op    new ns/op    delta
BenchmarkEndToEnd                   119289        24226  -79.69%
BenchmarkEndToEnd-2                  47349        31605  -33.25%
BenchmarkEndToEnd-4                  23231        18750  -19.29%
BenchmarkEndToEnd-8                  17798        14293  -19.69%
BenchmarkEndToEnd-16                 17328        13443  -22.42%
BenchmarkEndToEnd-32                 17552        12907  -26.46%
BenchmarkEndToEndHTTP               119738        24540  -79.51%
BenchmarkEndToEndHTTP-2              37950        30588  -19.40%
BenchmarkEndToEndHTTP-4              22885        18516  -19.09%
BenchmarkEndToEndHTTP-8              16167        14363  -11.16%
BenchmarkEndToEndHTTP-16             16985        13872  -18.33%
BenchmarkEndToEndHTTP-32             17423        13071  -24.98%
BenchmarkEndToEndAsync               59772        16941  -71.66%
BenchmarkEndToEndAsync-2             29662        11943  -59.74%
BenchmarkEndToEndAsync-4             16570        10582  -36.14%
BenchmarkEndToEndAsync-8             17036        13197  -22.53%
BenchmarkEndToEndAsync-16            17669        12718  -28.02%
BenchmarkEndToEndAsync-32            17629        13154  -25.38%
BenchmarkEndToEndAsyncHTTP           62488        16905  -72.95%
BenchmarkEndToEndAsyncHTTP-2         28810        12126  -57.91%
BenchmarkEndToEndAsyncHTTP-4         17742        10417  -41.29%
BenchmarkEndToEndAsyncHTTP-8         17258        12745  -26.15%
BenchmarkEndToEndAsyncHTTP-16        16881        13349  -20.92%
BenchmarkEndToEndAsyncHTTP-32        17340        12926  -25.46%

test/bench/shootout/threadring.go -n=10000000
                     old     new
GOMAXPROCS=1         1.965s  1.532s
GOMAXPROCS=2        42.090s  5.190s

vitess server (with memcached stub):
GOMAXPROCS=8, 64 connections
old: 50k qps, 890% CPU
new: 88k qps, 765% CPU
GOMAXPROCS=8, 3000 connections
old: 50k qps, 890% CPU
new: 70k qps, 775% CPU
GOMAXPROCS=32, 64 connections
old: 47k qps, 1350% CPU
new: 130k qps, 1850% CPU
GOMAXPROCS=32, 3000 connections
old: 48k qps, 1400% CPU
new: 110k qps, 1700% CPU

An interesting difference is that the old scheduler tends to use GOMAXPROCS+1 CPUs under load, while the new -- a bit less than GOMAXPROCS CPUs.

The bad news is that http tests sometimes hang. This seems to happen only with IP6. The hang goes away if I comment out the "fd.rblock = true" line in https://codereview.appspot.com/7326051/diff/5001/src/pkg/net/fd_linux_amd64.go. The line basically says "if you try to read N bytes from tcp socket and get fewer bytes, next time do not try to read, but wait for epoll signal first". Docs briefly mention this is a correct optimization, but I can't find more detailed info. What happens is that socket read returns fewer than requested bytes and then epoll edge triggered notification is not delivered; however if I try to read an additional time just to get return value 0, epoll notification is delivered normally.
I can't reproduce the misbehavior on a simple C test. I know, but for now it looks like a kernel IP6 bug...


Russ Cox

unread,
Feb 19, 2013, 11:31:03 AM2/19/13
to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
How do you propose to get these CLs submitted? They're a bit large right now.

Russ

Dmitry Vyukov

unread,
Feb 20, 2013, 4:24:33 AM2/20/13
to Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
On Tue, Feb 19, 2013 at 8:31 PM, Russ Cox <r...@golang.org> wrote:
How do you propose to get these CLs submitted? They're a bit large right now.


I can submit changes to the bulk of the changed files in separate CLs (by introducing runtime.entersyscallblock() and runtime·gsignalstk).
This will leave only proc.c and mgc0.c. Then, I can submit some leaf code in separate CLs (e.g. introduce P struct, implement per-P and global run queues). That will be dead code initially. Then the bulk of the changes to proc.c will be a single CL, still big, but I do not see a way to split it (it's all very tightly coupled).
 

Dmitry Vyukov

unread,
Feb 20, 2013, 4:28:56 AM2/20/13
to Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Does it sound acceptably?
And my plan is to start sending it straight after Go1.1 branch. Btw, what is ETA? I see 153 Go1.1 opened bugs...



Brad Fitzpatrick

unread,
Feb 20, 2013, 10:12:15 AM2/20/13
to Dmitry Vyukov, Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
I hope it would make it for Go 1.1.  There are open relevant 1.1 bugs for it:  2933, for instance.

Russ Cox

unread,
Feb 20, 2013, 10:14:06 AM2/20/13
to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
If you think it is ready now, we should start reviewing it and then it should be able to go into Go 1.1. The performance gains are very impressive, so if we can get it in, we should.

Russ

Russ Cox

unread,
Feb 20, 2013, 10:14:24 AM2/20/13
to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Oh, and your plan for splitting it up sounds good to me.

Russ

Leo Ferres

unread,
Feb 20, 2013, 11:46:20 AM2/20/13
to Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Wow, Dmitry. Those are great numbers. In the interest of completeness,
it would be worth it to compare before and after numbers for a "number
crunching" (divide and conquer) algorithm, like parallel matrix
multiplication. Is Sean still doing that? Even though I agree that Go
should be primarily an I/O language with a networking focus, it would
be interesting to know what Go has to offer for more mundane tasks...
like MatMult.
Leo

On Wed, Feb 20, 2013 at 12:14 PM, Russ Cox <r...@golang.org> wrote:
> Oh, and your plan for splitting it up sounds good to me.
>
> Russ
>



--
Leo Ferres, Ph.D.
Assistant Professor
Department of Computer Science
University of Concepción, Concepción,
Chile

Michael Jones

unread,
Feb 20, 2013, 12:01:29 PM2/20/13
to Leo Ferres, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
I have very extensive, fully parallel, no-IO mathematical code running now. I agree to try the test. Hopefully others will too.

--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.





--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Leo Ferres

unread,
Feb 20, 2013, 12:14:06 PM2/20/13
to Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Hi, Michael,
great!! Please post the results! Just in case, the issue is the
following: work stealing schedulers (an IO customized version like the
one Dmitry coded) are provably within a constant-factor of optimal (at
least for D&Conquer algos). Dmitry made modifications such that it
would optimize for IO tasks. It would therefore be interesting to see
how the IO-optimizations fare against non-IO highly-parallel code. My
students (currently on summer holidays) and I are coding the original
version of the work-stealing scheduler (based off on Dmitry's design),
which would optimize for what Dmitry and I call "number crunching"
(NC) algorithms (eg ?Strassen MatMult). Ultimately, and this is given
the (rather large) assumption that Go is a general-purpose language,
the question is: should Go provide 2 schedulers (one for IO, the other
for NC, chosen by the user at compile time), or is there a provably
good scheduler that does both tasks optimally (up to constant
factors). I think we all agree that the Round-Robin scheduler is
usable, but far from optimal (at least theoretically).
Leo

Dmitry Vyukov

unread,
Feb 20, 2013, 12:38:27 PM2/20/13
to Leo Ferres, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Here are some results on parallel divide and conquer matmult:

It seems that it scales pretty well (even if it does FIFO instead of LIFO, i.e. depth-first search)

new: matmult -size=1200 -proc=1 -mode=parallel-chan

real 0m23.857s
user 0m23.700s
sys 0m0.110s
old: matmult -size=1200 -proc=1 -mode=parallel-chan

real 0m23.163s
user 0m22.980s
sys 0m0.120s
new: matmult -size=1200 -proc=2 -mode=parallel-chan

real 0m12.183s
user 0m23.940s
sys 0m0.220s
old: matmult -size=1200 -proc=2 -mode=parallel-chan

real 0m22.182s
user 0m42.470s
sys 0m1.440s
new: matmult -size=1200 -proc=4 -mode=parallel-chan

real 0m6.557s
user 0m25.410s
sys 0m0.270s
old: matmult -size=1200 -proc=4 -mode=parallel-chan

real 0m25.568s
user 1m10.910s
sys 0m23.500s
new: matmult -size=1200 -proc=8 -mode=parallel-chan

real 0m4.338s
user 0m31.770s
sys 0m1.200s
old: matmult -size=1200 -proc=8 -mode=parallel-chan

real 0m31.372s
user 1m45.350s
sys 1m36.270s
new: matmult -size=1200 -proc=16 -mode=parallel-chan

real 0m3.478s
user 0m46.700s
sys 0m1.440s
old: matmult -size=1200 -proc=16 -mode=parallel-chan

real 0m27.275s
user 1m23.190s
sys 2m12.920s
new: matmult -size=1200 -proc=32 -mode=parallel-chan

real 0m2.385s
user 0m56.240s
sys 0m1.420s
old: matmult -size=1200 -proc=32 -mode=parallel-chan

real 0m26.074s
user 1m17.070s
sys 2m25.030s

Dmitry Vyukov

unread,
Feb 20, 2013, 12:39:46 PM2/20/13
to Leo Ferres, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
This is a machine with 2 Intel Xeon E5-2690 CPUs (@ 2.90GHz), each CPU has 8 hyper threaded cores (32 HW threads total).

Brad Fitzpatrick

unread,
Feb 20, 2013, 1:00:21 PM2/20/13
to Leo Ferres, Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
On Wed, Feb 20, 2013 at 9:14 AM, Leo Ferres <leof...@gmail.com> wrote:
the question is: should Go provide 2 schedulers (one for IO, the other
for NC, chosen by the user at compile time), or is there a provably
good scheduler that does both tasks optimally

I hope we end up with only 1 scheduler.

Knobs are distracting, confusing and annoying.  Personally, I'd rather things be 90% good 100% of the time than see 90 knobs.

Leo Ferres

unread,
Feb 20, 2013, 1:26:05 PM2/20/13
to Brad Fitzpatrick, Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
I agree, Brad. Simplicity is best. But what if one scheduler
outperforms the other *significantly* for their given tasks? Then the
engineering decision of simplicity will start leaking. In any case, I
agree with you that we should aim for a single optimal IO/NC
scheduler, if such a beast exists (mathematically). That way both
simplicity and efficiency would rule :) I just need to know if there
is room for that work to be done, or whether, as you say, the
difference in efficiency between the 2 schedulers is negligible.

In any case, I'm attaching a speedup graph (time on 1 proc over time
on P procs) of Dmitry's MatMult benchmarks. WS=workstealing,
RR=round-robin, which is what (I think) Dmitry meant with "new" vs.
"old", respectively. Speedup maxes out at ~10 (optimal would be 32),
while RR, well, doesn't scale. I don't know what that means though. It
doesn't look right to me. In any case, Dmitry tested this with
HyperThreading enabled... I've found sometimes that hinders
performance, but that is a technicality.

Regards,

Leo
Dmitry.png

Russ Cox

unread,
Feb 20, 2013, 2:42:49 PM2/20/13
to Leo Ferres, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
On Wed, Feb 20, 2013 at 1:26 PM, Leo Ferres <leof...@gmail.com> wrote:
I agree, Brad. Simplicity is best. But what if one scheduler
outperforms the other *significantly* for their given tasks?

Then the people in charge of the runtime have some work to do.

Russ

Leo Ferres

unread,
Feb 20, 2013, 3:43:11 PM2/20/13
to Russ Cox, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Hi, Russ,
Sorry, I don't understand your sentence. What work is that?
L.

Russ Cox

unread,
Feb 20, 2013, 3:48:18 PM2/20/13
to Leo Ferres, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
On Wed, Feb 20, 2013 at 3:43 PM, Leo Ferres <leof...@gmail.com> wrote:
Sorry, I don't understand your sentence. What work is that?

It is the job of the runtime to make reasonable scheduling decisions based on the offered workload. If one approach is vastly better for one kind of job and another approach is vastly better for a second kind of job, then the runtime should probably figure out what the setting is and do the right thing. That's far better than new knobs.

Russ

Leo Ferres

unread,
Feb 20, 2013, 4:07:21 PM2/20/13
to Russ Cox, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Yes, I agree, but again, at the cost of (maybe crazy) heuristics in
the runtime to "figure out what the setting is and do the right
thing". But this is not something that is worth discussing with me,
really. I'm not that knowledgeable (or interested) in runtime
development, I'm interested in scheduling dynamically unfolding
computations :) I'm sure you guys will do a great job with the
runtime, as you have so far.

Michael Jones

unread,
Feb 20, 2013, 6:24:02 PM2/20/13
to Leo Ferres, Russ Cox, Brad Fitzpatrick, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Quick results of a few hour's computation on a simple (dual-cpu, 1-4 threads & no I/O) exercise:

tip: 178%-182% of CPU.

+Dmitry's patch set:
196% of CPU, finishes sooner, less system time.

A serious benchmark must wait until I get home to my 8-core Xeon system. But already I see gain in all tests so far.

Wow!

Dmitry Vyukov

unread,
Feb 21, 2013, 6:12:02 AM2/21/13
to Leo Ferres, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Here are results with grain size 16x16 (the previous was 8x8):

new: matmult -size=1500 -proc=1 -mode=parallel-chan -threshold=16

real 0m19.091s

old: matmult -size=1500 -proc=1 -mode=parallel-chan -threshold=16

real 0m18.955s

new: matmult -size=1500 -proc=2 -mode=parallel-chan -threshold=16

real 0m9.368s

old: matmult -size=1500 -proc=2 -mode=parallel-chan -threshold=16

real 0m10.676s

new: matmult -size=1500 -proc=4 -mode=parallel-chan -threshold=16

real 0m4.904s

old: matmult -size=1500 -proc=4 -mode=parallel-chan -threshold=16

real 0m5.885s

new: matmult -size=1500 -proc=8 -mode=parallel-chan -threshold=16

real 0m2.648s

old: matmult -size=1500 -proc=8 -mode=parallel-chan -threshold=16

real 0m3.761s

new: matmult -size=1500 -proc=16 -mode=parallel-chan -threshold=16

real 0m1.422s

old: matmult -size=1500 -proc=16 -mode=parallel-chan -threshold=16

real 0m3.658s

new: matmult -size=1500 -proc=32 -mode=parallel-chan -threshold=16

real 0m1.138s

old: matmult -size=1500 -proc=32 -mode=parallel-chan -threshold=16

real 0m3.560s


current(old) scheduler achieves only 2.5786 speedup, while the new - 16.7759. Note that with 16 threads the speedup is 13.4254, which is close to ideal, then it seems that hyper threading provides less speedup.



Leo Ferres

unread,
Feb 21, 2013, 6:39:32 AM2/21/13
to Dmitry Vyukov, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Wow, fantastic. 

Yes, as I was saying, since they are not "real" processors, if hyperthreading is on, hardware threads share hw resources. I've tested problems in which HT is not a problem, but for these non-vector parallelism, it is detrimental (though these numbers seem a bit too high to pin on HT, IMO). Maybe the grain-size has something to do with it? Which leads me to...

Another thing, if by "grain-size" you mean that the 16x16 matrices are multiplied sequentially then although you get very good scalability in your # of processors, there will be a point in which the parallelism of the application will die with the grain size. The higher the grainsize the higher the critical path of the program, the less the potential parallelism. This is still begging the question of what the best grain-size is, *for every different application given a number of processors*.

I wish Keith would jump in and check my facts here.

L.


Dmitry Vyukov

unread,
Feb 21, 2013, 6:54:16 AM2/21/13
to Leo Ferres, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
On Thu, Feb 21, 2013 at 3:39 PM, Leo Ferres <leof...@gmail.com> wrote:
>
> Wow, fantastic.
>
> Yes, as I was saying, since they are not "real" processors, if hyperthreading is on, hardware threads share hw resources. I've tested problems in which HT is not a problem, but for these non-vector parallelism, it is detrimental (though these numbers seem a bit too high to pin on HT, IMO). Maybe the grain-size has something to do with it? Which leads me to...
>
> Another thing, if by "grain-size" you mean that the 16x16 matrices are multiplied sequentially then although you get very good scalability in your # of processors, there will be a point in which the parallelism of the application will die with the grain size. The higher the grainsize the higher the critical path of the program, the less the potential parallelism. This is still begging the question of what the best grain-size is, *for every different application given a number of processors*.
>
> I wish Keith would jump in and check my facts here.
>


With grain size 8x8 the profile looks like:
50.14% matmult matmult [.] main.matmultParallelImplChan
8.32% matmult matmult [.] runtime.xchg
8.30% matmult matmult [.] schedule
5.82% matmult matmult [.] runtime.chanrecv
5.20% matmult matmult [.] runtime.mallocgc
4.68% matmult matmult [.] runtime.chansend
3.11% matmult matmult [.] runtime.markallocated
2.41% matmult matmult [.] runtime.xadd64
0.81% matmult matmult [.] runtime.lock
0.71% matmult matmult [.] dequeue
0.61% matmult matmult [.] sweepspan

with grain size 16x16:
96.71% matmult matmult [.] main.matmultParallelImplChan
0.44% matmult matmult [.] runtime.xchg
0.34% matmult [kernel.kallsyms] [k] smp_invalidate_interrupt
0.18% matmult matmult [.] runtime.xadd64
0.16% matmult matmult [.] schedule
0.16% matmult matmult [.] runtime.chanrecv
0.08% matmult matmult [.] runtime.mallocgc
0.07% matmult matmult [.] runtime.chansend
0.06% matmult matmult [.] runtime.markallocated

So according to the profile there are no overheads in runtime.
However, it can be exactly FIFO execution order, that breaks the whole
idea of cache-oblivious recursive decomposition.

Leo Ferres

unread,
Feb 21, 2013, 7:04:08 AM2/21/13
to Dmitry Vyukov, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev
Right.
This is great work, Dmitry. With these numbers, Brad is right and
there may be no need for a "true" (as in "original") LIFO
work-stealing scheduler... I don't know, I will test anyway and see
what comes up when we compare them.
L.

Patrick Mylund Nielsen

unread,
Feb 21, 2013, 1:06:39 PM2/21/13
to Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall
x.0 = major release, breaks backward compatibility, big changes to language

1.x = major release

1.0.x = minor release


On Thu, Feb 21, 2013 at 7:01 PM, <2pau...@googlemail.com> wrote:
This is good new Indeed, thanks for the Update.  Looks like excellent work.  Along with all the other performance tunings and enhancements the upcoming release should make some waves. 

@All ...  it does not seem to do the upcoming release justice anymore to only call it "1.1", because the ".1" suggests that it was just a minor bug fixing release. However we are looking at some major enhancements: Parallel and precise Garbage Collection and now also a new Scheduler would justify bumping it up to at least 1.5.    Those are just my humble thoughts, I have no idea if it is even possible to actually do this, so feel free to ignore the suggestion if it is too unrealistic.   


On Friday, February 8, 2013 3:09:58 PM UTC+1, Dmitry Vyukov wrote:
Good news everyone!

I've updated my scheduler patch to tip:
https://codereview.appspot.com/7314062
(I had to create new changelist due to some issues with codereview site)

It passed all possible tests on std lib. Race detector also seems to
be happy. And I discovered a bug in current runtime thanks to the
patch:
https://codereview.appspot.com/7307073/diff/5001/src/pkg/runtime/parfor.c

--

Michael Jones

unread,
Feb 21, 2013, 1:06:17 PM2/21/13
to 2pau...@googlemail.com, golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall
Maybe the next version could be "Go 0.9" and successive versions could converge on the Euler–Mascheroni constant.

Rémy Oudompheng

unread,
Feb 21, 2013, 1:10:48 PM2/21/13
to Michael Jones, 2pau...@googlemail.com, golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall
On 2013/2/21 Michael Jones <m...@google.com>:
> Maybe the next version could be "Go 0.9" and successive versions could
> converge on the Euler–Mascheroni constant.

I suggest it grow to the left and converge 10-adically to the number
of possible Go games.

Rémy.

Dmitry Vyukov

unread,
Feb 22, 2013, 2:29:03 AM2/22/13
to Rémy Oudompheng, Michael Jones, Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall
Hey, we already have great unique identifiers, e.g. if the release is branched now it should be named 083759101bc9.
Well, actually the url itself features revision id 083759101bc9743e3dbf9b2aefc7ca0e5bd5f000.




Keith Randall

unread,
Feb 25, 2013, 8:12:25 PM2/25/13
to Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr
Sorry I haven't chimed in, I've been on vacation all last week.

Dmitry, those numbers look great.

I'm with Brad, I don't think we want knobs on our scheduler.  I don't think there's a fundamental incompatibility between the two types of scheduling - they are just two ends of a continuum.  We should be able to make the scheduler operate gracefully (gracefully = small constant factor from optimal?) across the continuum.  At least, I'm optimistic that's achievable.

Don't assume this will get into 1.1 just yet.  My concern is that we don't really have a good feel for how lots of other applications fare under this scheduler.  Related, there are a lot of things changing in this patch and I don't think we understand where exactly the enhanced performance is coming from.  But it's still early, if we can get some confidence that we're not breaking scheduling for other types of programs I'm happy to work on getting it into 1.1.

Ugorji

unread,
Feb 25, 2013, 8:36:28 PM2/25/13
to golan...@googlegroups.com, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr
This is a major downer (that this may not go into Go 1.1). 

I think I (and many other folks) have been downright giddy that this is going in for Go 1.1. It will suck if that doesn't happen. 

My own wish list for Go 1.1 is:
- escape analysis for string <-> []byte conversion (where runtime/compiler can determine that they are read-only and non-escaping)
- improved scheduler (that takes cores into account, and is somewhat more pre-emptive)
- precise and better-performing GC 
- improved segmented stack growth algorithms (especially for recursive functions) 
- possibly cheaper closures 
- cache (per processor)

It seems many of these are being worked on now (and I understand we can't get all we want).

In particular, I'm hoping that the improved scheduler will make the CPU utilization for my web application much better. 

Russ Cox

unread,
Feb 25, 2013, 9:52:52 PM2/25/13
to Ugorji, golang-dev, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr
No one making any promises one way or the other about what will be in Go 1.1. And if something is not in Go 1.1 there is always Go 1.2. Don't worry too much about it.

Russ

Gustavo Niemeyer

unread,
Feb 26, 2013, 8:56:14 AM2/26/13
to Steve Wang, golan...@googlegroups.com, Ugorji, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr
On Tue, Feb 26, 2013 at 3:27 AM, Steve Wang <steve....@gmail.com> wrote:
> Has Go1.1 been feature-frozen or not yet?
> Is there a relatively formal feature list?

Nothing formal that I know of. You can have a look at the issue
tracker, though, and get an idea of what has been accomplished, and
what is still pending.


gustavo @ http://niemeyer.net

Patrick Mylund Nielsen

unread,
Feb 26, 2013, 12:04:26 PM2/26/13
to Steve Wang, golang-dev, Ugorji, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr
Yes, most new feature tickets won't be accepted for Go 1.1: https://groups.google.com/d/msg/golang-dev/ajifK81ZUbg/hCIfuaWlSf4J


On Tue, Feb 26, 2013 at 7:27 AM, Steve Wang <steve....@gmail.com> wrote:
Has Go1.1 been feature-frozen or not yet?
Is there a relatively formal feature list?


On Tuesday, February 26, 2013 10:52:52 AM UTC+8, rsc wrote:
No one making any promises one way or the other about what will be in Go 1.1. And if something is not in Go 1.1 there is always Go 1.2. Don't worry too much about it.

Russ

Reply all
Reply to author
Forward
0 new messages