Go scheduler update

Dmitry Vyukov

unread,

Feb 8, 2013, 9:09:58 AM2/8/13

to Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev

Good news everyone!

I've updated my scheduler patch to tip:
https://codereview.appspot.com/7314062
(I had to create new changelist due to some issues with codereview site)

It passed all possible tests on std lib. Race detector also seems to
be happy. And I discovered a bug in current runtime thanks to the
patch:
https://codereview.appspot.com/7307073/diff/5001/src/pkg/runtime/parfor.c

Brad Fitzpatrick

unread,

Feb 8, 2013, 10:30:55 AM2/8/13

to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev

That CL description (and this email) are somewhat lacking in details. :-)

--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ugorji

unread,

Feb 8, 2013, 12:13:15 PM2/8/13

to golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall

Hi Dmitry,

Thanks so much for the update.

Is this back on track for Go 1.1? Would love to know before getting excited about it.

Also, would this help with CPU usage and scalability on high-volume servers?

For information of others, it seems this is a continuation of the old CL: https://codereview.appspot.com/6441097/

Patrick Mylund Nielsen

unread,

Feb 8, 2013, 12:27:42 PM2/8/13

to Brad Fitzpatrick, Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, golang-dev

Yes. It's an interesting feeling of excitement and absolute cluelessness about what this does.

Ah, just remembered the design doc: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit

Leo Ferres

unread,

Feb 8, 2013, 1:22:31 PM2/8/13

to Dmitry Vyukov, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez

Great job, Dmitry! Did you manage to run some benchmarks against the round robin scheduler? Should we now implement the number crunching scheduler against this version?

On Feb 8, 2013 11:09 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:

Leo Ferres

unread,

Feb 8, 2013, 1:22:14 PM2/8/13

to Dmitry Vyukov, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez

Great job, Dmitry! Did you manage to run some benchmarks against the round robin scheduler? Should we now implement the number crunching scheduler against this version?

On Feb 8, 2013 11:09 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:

Dmitry Vyukov

unread,

Feb 12, 2013, 11:28:48 AM2/12/13

to Leo Ferres, Peter A. Buhr, Keith Randall, Sean Bauer, golang-dev, Javier Rodriguez

On Fri, Feb 8, 2013 at 10:22 PM, Leo Ferres <leof...@gmail.com> wrote:
> Great job, Dmitry! Did you manage to run some benchmarks against the round
> robin scheduler? Should we now implement the number crunching scheduler
> against this version?

Do you mean the current scheduler?
No, I have not run any benchmarks, this is just code sync.

Dmitry Vyukov

unread,

Feb 19, 2013, 8:58:59 AM2/19/13

to Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Hi,

I've done a lot cleaning/tuning/refactoring over the last days.
And also updated the net poller patch:
https://codereview.appspot.com/7326051

(now you can apply them together, or only the scheduler patch)

I have some great numbers:

net:
benchmark old ns/op new ns/op delta
BenchmarkTCPPersistent 42107 7145 -83.03%
BenchmarkTCPPersistent-2 24177 5579 -76.92%
BenchmarkTCPPersistent-4 14245 3880 -72.76%
BenchmarkTCPPersistent-8 14653 2470 -83.14%
BenchmarkTCPPersistent-16 17408 2199 -87.37%
BenchmarkTCPPersistent-32 7134 2193 -69.26%
BenchmarkTCPPersistentTimeout 48570 7387 -84.79%
BenchmarkTCPPersistentTimeout-2 23810 6227 -73.85%
BenchmarkTCPPersistentTimeout-4 16876 3800 -77.48%
BenchmarkTCPPersistentTimeout-8 16072 2465 -84.66%
BenchmarkTCPPersistentTimeout-16 17017 1943 -88.58%
BenchmarkTCPPersistentTimeout-32 6894 2155 -68.74%
BenchmarkTCPOneShot 124367 46281 -62.79%
BenchmarkTCPOneShot-2 60165 31802 -47.14%
BenchmarkTCPOneShot-4 36388 18533 -49.07%
BenchmarkTCPOneShot-8 26048 14382 -44.79%
BenchmarkTCPOneShot-16 19560 16013 -18.13%
BenchmarkTCPOneShot-32 25307 15862 -37.32%
BenchmarkTCPOneShotTimeout 118402 58643 -50.47%
BenchmarkTCPOneShotTimeout-2 63786 34907 -45.27%
BenchmarkTCPOneShotTimeout-4 37546 19545 -47.94%
BenchmarkTCPOneShotTimeout-8 23137 13881 -40.01%
BenchmarkTCPOneShotTimeout-16 19154 15478 -19.19%
BenchmarkTCPOneShotTimeout-32 18417 16006 -13.09%

rpc (not parallel, so not much speedup expected):

benchmark old ns/op new ns/op delta

BenchmarkEndToEnd 119289 24226 -79.69%

BenchmarkEndToEnd-2 47349 31605 -33.25%

BenchmarkEndToEnd-4 23231 18750 -19.29%

BenchmarkEndToEnd-8 17798 14293 -19.69%

BenchmarkEndToEnd-16 17328 13443 -22.42%

BenchmarkEndToEnd-32 17552 12907 -26.46%

BenchmarkEndToEndHTTP 119738 24540 -79.51%

BenchmarkEndToEndHTTP-2 37950 30588 -19.40%

BenchmarkEndToEndHTTP-4 22885 18516 -19.09%

BenchmarkEndToEndHTTP-8 16167 14363 -11.16%

BenchmarkEndToEndHTTP-16 16985 13872 -18.33%

BenchmarkEndToEndHTTP-32 17423 13071 -24.98%

BenchmarkEndToEndAsync 59772 16941 -71.66%

BenchmarkEndToEndAsync-2 29662 11943 -59.74%

BenchmarkEndToEndAsync-4 16570 10582 -36.14%

BenchmarkEndToEndAsync-8 17036 13197 -22.53%

BenchmarkEndToEndAsync-16 17669 12718 -28.02%

BenchmarkEndToEndAsync-32 17629 13154 -25.38%

BenchmarkEndToEndAsyncHTTP 62488 16905 -72.95%

BenchmarkEndToEndAsyncHTTP-2 28810 12126 -57.91%

BenchmarkEndToEndAsyncHTTP-4 17742 10417 -41.29%

BenchmarkEndToEndAsyncHTTP-8 17258 12745 -26.15%

BenchmarkEndToEndAsyncHTTP-16 16881 13349 -20.92%

BenchmarkEndToEndAsyncHTTP-32 17340 12926 -25.46%

test/bench/shootout/threadring.go -n=10000000

old new

GOMAXPROCS=1 1.965s 1.532s

GOMAXPROCS=2 42.090s 5.190s

vitess server (with memcached stub):

GOMAXPROCS=8, 64 connections

old: 50k qps, 890% CPU

new: 88k qps, 765% CPU

GOMAXPROCS=8, 3000 connections

old: 50k qps, 890% CPU

new: 70k qps, 775% CPU

GOMAXPROCS=32, 64 connections

old: 47k qps, 1350% CPU

new: 130k qps, 1850% CPU

GOMAXPROCS=32, 3000 connections

old: 48k qps, 1400% CPU

new: 110k qps, 1700% CPU

An interesting difference is that the old scheduler tends to use GOMAXPROCS+1 CPUs under load, while the new -- a bit less than GOMAXPROCS CPUs.

The bad news is that http tests sometimes hang. This seems to happen only with IP6. The hang goes away if I comment out the "fd.rblock = true" line in https://codereview.appspot.com/7326051/diff/5001/src/pkg/net/fd_linux_amd64.go. The line basically says "if you try to read N bytes from tcp socket and get fewer bytes, next time do not try to read, but wait for epoll signal first". Docs briefly mention this is a correct optimization, but I can't find more detailed info. What happens is that socket read returns fewer than requested bytes and then epoll edge triggered notification is not delivered; however if I try to read an additional time just to get return value 0, epoll notification is delivered normally.

I can't reproduce the misbehavior on a simple C test. I know, but for now it looks like a kernel IP6 bug...

Russ Cox

unread,

Feb 19, 2013, 11:31:03 AM2/19/13

to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

How do you propose to get these CLs submitted? They're a bit large right now.

Russ

Dmitry Vyukov

unread,

Feb 20, 2013, 4:24:33 AM2/20/13

to Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

On Tue, Feb 19, 2013 at 8:31 PM, Russ Cox <r...@golang.org> wrote:

How do you propose to get these CLs submitted? They're a bit large right now.

I can submit changes to the bulk of the changed files in separate CLs (by introducing runtime.entersyscallblock() and runtime·gsignalstk).

This will leave only proc.c and mgc0.c. Then, I can submit some leaf code in separate CLs (e.g. introduce P struct, implement per-P and global run queues). That will be dead code initially. Then the bulk of the changes to proc.c will be a single CL, still big, but I do not see a way to split it (it's all very tightly coupled).

Dmitry Vyukov

unread,

Feb 20, 2013, 4:28:56 AM2/20/13

to Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Does it sound acceptably?

And my plan is to start sending it straight after Go1.1 branch. Btw, what is ETA? I see 153 Go1.1 opened bugs...

Brad Fitzpatrick

unread,

Feb 20, 2013, 10:12:15 AM2/20/13

to Dmitry Vyukov, Russ Cox, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

I hope it would make it for Go 1.1. There are open relevant 1.1 bugs for it: 2933, for instance.

Russ Cox

unread,

Feb 20, 2013, 10:14:06 AM2/20/13

to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

If you think it is ready now, we should start reviewing it and then it should be able to go into Go 1.1. The performance gains are very impressive, so if we can get it in, we should.

Russ

Russ Cox

unread,

Feb 20, 2013, 10:14:24 AM2/20/13

to Dmitry Vyukov, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Oh, and your plan for splitting it up sounds good to me.

Russ

Leo Ferres

unread,

Feb 20, 2013, 11:46:20 AM2/20/13

to Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Wow, Dmitry. Those are great numbers. In the interest of completeness,
it would be worth it to compare before and after numbers for a "number
crunching" (divide and conquer) algorithm, like parallel matrix
multiplication. Is Sean still doing that? Even though I agree that Go
should be primarily an I/O language with a networking focus, it would
be interesting to know what Go has to offer for more mundane tasks...
like MatMult.
Leo

On Wed, Feb 20, 2013 at 12:14 PM, Russ Cox <r...@golang.org> wrote:
> Oh, and your plan for splitting it up sounds good to me.
>
> Russ
>

--
Leo Ferres, Ph.D.
Assistant Professor
Department of Computer Science
University of Concepción, Concepción,
Chile

Michael Jones

unread,

Feb 20, 2013, 12:01:29 PM2/20/13

to Leo Ferres, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

I have very extensive, fully parallel, no-IO mathematical code running now. I agree to try the test. Hopefully others will too.

--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Michael T. Jones | Chief Technology Advocate | m...@google.com | +1 650-335-5765

Leo Ferres

unread,

Feb 20, 2013, 12:14:06 PM2/20/13

to Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Hi, Michael,
great!! Please post the results! Just in case, the issue is the
following: work stealing schedulers (an IO customized version like the
one Dmitry coded) are provably within a constant-factor of optimal (at
least for D&Conquer algos). Dmitry made modifications such that it
would optimize for IO tasks. It would therefore be interesting to see
how the IO-optimizations fare against non-IO highly-parallel code. My
students (currently on summer holidays) and I are coding the original
version of the work-stealing scheduler (based off on Dmitry's design),
which would optimize for what Dmitry and I call "number crunching"
(NC) algorithms (eg ?Strassen MatMult). Ultimately, and this is given
the (rather large) assumption that Go is a general-purpose language,
the question is: should Go provide 2 schedulers (one for IO, the other
for NC, chosen by the user at compile time), or is there a provably
good scheduler that does both tasks optimally (up to constant
factors). I think we all agree that the Round-Robin scheduler is
usable, but far from optimal (at least theoretically).
Leo

Dmitry Vyukov

unread,

Feb 20, 2013, 12:38:27 PM2/20/13

to Leo Ferres, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Here are some results on parallel divide and conquer matmult:

https://codereview.appspot.com/6496049/diff/2001/test/bench/sched/matmult.go

It seems that it scales pretty well (even if it does FIFO instead of LIFO, i.e. depth-first search)

new: matmult -size=1200 -proc=1 -mode=parallel-chan

real 0m23.857s

user 0m23.700s

sys 0m0.110s

old: matmult -size=1200 -proc=1 -mode=parallel-chan

real 0m23.163s

user 0m22.980s

sys 0m0.120s

new: matmult -size=1200 -proc=2 -mode=parallel-chan

real 0m12.183s

user 0m23.940s

sys 0m0.220s

old: matmult -size=1200 -proc=2 -mode=parallel-chan

real 0m22.182s

user 0m42.470s

sys 0m1.440s

new: matmult -size=1200 -proc=4 -mode=parallel-chan

real 0m6.557s

user 0m25.410s

sys 0m0.270s

old: matmult -size=1200 -proc=4 -mode=parallel-chan

real 0m25.568s

user 1m10.910s

sys 0m23.500s

new: matmult -size=1200 -proc=8 -mode=parallel-chan

real 0m4.338s

user 0m31.770s

sys 0m1.200s

old: matmult -size=1200 -proc=8 -mode=parallel-chan

real 0m31.372s

user 1m45.350s

sys 1m36.270s

new: matmult -size=1200 -proc=16 -mode=parallel-chan

real 0m3.478s

user 0m46.700s

sys 0m1.440s

old: matmult -size=1200 -proc=16 -mode=parallel-chan

real 0m27.275s

user 1m23.190s

sys 2m12.920s

new: matmult -size=1200 -proc=32 -mode=parallel-chan

real 0m2.385s

user 0m56.240s

sys 0m1.420s

old: matmult -size=1200 -proc=32 -mode=parallel-chan

real 0m26.074s

user 1m17.070s

sys 2m25.030s

Dmitry Vyukov

unread,

Feb 20, 2013, 12:39:46 PM2/20/13

to Leo Ferres, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

This is a machine with 2 Intel Xeon E5-2690 CPUs (@ 2.90GHz), each CPU has 8 hyper threaded cores (32 HW threads total).

Brad Fitzpatrick

unread,

Feb 20, 2013, 1:00:21 PM2/20/13

to Leo Ferres, Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

On Wed, Feb 20, 2013 at 9:14 AM, Leo Ferres <leof...@gmail.com> wrote:

the question is: should Go provide 2 schedulers (one for IO, the other
for NC, chosen by the user at compile time), or is there a provably
good scheduler that does both tasks optimally

I hope we end up with only 1 scheduler.

Knobs are distracting, confusing and annoying. Personally, I'd rather things be 90% good 100% of the time than see 90 knobs.

Leo Ferres

unread,

Feb 20, 2013, 1:26:05 PM2/20/13

to Brad Fitzpatrick, Michael Jones, Russ Cox, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

I agree, Brad. Simplicity is best. But what if one scheduler
outperforms the other *significantly* for their given tasks? Then the
engineering decision of simplicity will start leaking. In any case, I
agree with you that we should aim for a single optimal IO/NC
scheduler, if such a beast exists (mathematically). That way both
simplicity and efficiency would rule :) I just need to know if there
is room for that work to be done, or whether, as you say, the
difference in efficiency between the 2 schedulers is negligible.

In any case, I'm attaching a speedup graph (time on 1 proc over time
on P procs) of Dmitry's MatMult benchmarks. WS=workstealing,
RR=round-robin, which is what (I think) Dmitry meant with "new" vs.
"old", respectively. Speedup maxes out at ~10 (optimal would be 32),
while RR, well, doesn't scale. I don't know what that means though. It
doesn't look right to me. In any case, Dmitry tested this with
HyperThreading enabled... I've found sometimes that hinders
performance, but that is a technicality.

Regards,

Leo

Dmitry.png

Russ Cox

unread,

Feb 20, 2013, 2:42:49 PM2/20/13

to Leo Ferres, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

On Wed, Feb 20, 2013 at 1:26 PM, Leo Ferres <leof...@gmail.com> wrote:

I agree, Brad. Simplicity is best. But what if one scheduler
outperforms the other *significantly* for their given tasks?

Then the people in charge of the runtime have some work to do.

Russ

Leo Ferres

unread,

Feb 20, 2013, 3:43:11 PM2/20/13

to Russ Cox, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Hi, Russ,
Sorry, I don't understand your sentence. What work is that?
L.

Russ Cox

unread,

Feb 20, 2013, 3:48:18 PM2/20/13

to Leo Ferres, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

On Wed, Feb 20, 2013 at 3:43 PM, Leo Ferres <leof...@gmail.com> wrote:

Sorry, I don't understand your sentence. What work is that?

It is the job of the runtime to make reasonable scheduling decisions based on the offered workload. If one approach is vastly better for one kind of job and another approach is vastly better for a second kind of job, then the runtime should probably figure out what the setting is and do the right thing. That's far better than new knobs.

Russ

Leo Ferres

unread,

Feb 20, 2013, 4:07:21 PM2/20/13

to Russ Cox, Brad Fitzpatrick, Michael Jones, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Yes, I agree, but again, at the cost of (maybe crazy) heuristics in
the runtime to "figure out what the setting is and do the right
thing". But this is not something that is worth discussing with me,
really. I'm not that knowledgeable (or interested) in runtime
development, I'm interested in scheduling dynamically unfolding
computations :) I'm sure you guys will do a great job with the
runtime, as you have so far.

Michael Jones

unread,

Feb 20, 2013, 6:24:02 PM2/20/13

to Leo Ferres, Russ Cox, Brad Fitzpatrick, Dmitry Vyukov, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Quick results of a few hour's computation on a simple (dual-cpu, 1-4 threads & no I/O) exercise:

tip: 178%-182% of CPU.

+Dmitry's patch set:

196% of CPU, finishes sooner, less system time.

A serious benchmark must wait until I get home to my 8-core Xeon system. But already I see gain in all tests so far.

Wow!

Dmitry Vyukov

unread,

Feb 21, 2013, 6:12:02 AM2/21/13

to Leo Ferres, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Here are results with grain size 16x16 (the previous was 8x8):

new: matmult -size=1500 -proc=1 -mode=parallel-chan -threshold=16

real 0m19.091s

old: matmult -size=1500 -proc=1 -mode=parallel-chan -threshold=16

real 0m18.955s

new: matmult -size=1500 -proc=2 -mode=parallel-chan -threshold=16

real 0m9.368s

old: matmult -size=1500 -proc=2 -mode=parallel-chan -threshold=16

real 0m10.676s

new: matmult -size=1500 -proc=4 -mode=parallel-chan -threshold=16

real 0m4.904s

old: matmult -size=1500 -proc=4 -mode=parallel-chan -threshold=16

real 0m5.885s

new: matmult -size=1500 -proc=8 -mode=parallel-chan -threshold=16

real 0m2.648s

old: matmult -size=1500 -proc=8 -mode=parallel-chan -threshold=16

real 0m3.761s

new: matmult -size=1500 -proc=16 -mode=parallel-chan -threshold=16

real 0m1.422s

old: matmult -size=1500 -proc=16 -mode=parallel-chan -threshold=16

real 0m3.658s

new: matmult -size=1500 -proc=32 -mode=parallel-chan -threshold=16

real 0m1.138s

old: matmult -size=1500 -proc=32 -mode=parallel-chan -threshold=16

real 0m3.560s

current(old) scheduler achieves only 2.5786 speedup, while the new - 16.7759. Note that with 16 threads the speedup is 13.4254, which is close to ideal, then it seems that hyper threading provides less speedup.

Leo Ferres

unread,

Feb 21, 2013, 6:39:32 AM2/21/13

to Dmitry Vyukov, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Wow, fantastic.

Yes, as I was saying, since they are not "real" processors, if hyperthreading is on, hardware threads share hw resources. I've tested problems in which HT is not a problem, but for these non-vector parallelism, it is detrimental (though these numbers seem a bit too high to pin on HT, IMO). Maybe the grain-size has something to do with it? Which leads me to...

Another thing, if by "grain-size" you mean that the 16x16 matrices are multiplied sequentially then although you get very good scalability in your # of processors, there will be a point in which the parallelism of the application will die with the grain size. The higher the grainsize the higher the critical path of the program, the less the potential parallelism. This is still begging the question of what the best grain-size is, *for every different application given a number of processors*.

I wish Keith would jump in and check my facts here.

L.

Dmitry Vyukov

unread,

Feb 21, 2013, 6:54:16 AM2/21/13

to Leo Ferres, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

On Thu, Feb 21, 2013 at 3:39 PM, Leo Ferres <leof...@gmail.com> wrote:
>
> Wow, fantastic.
>
> Yes, as I was saying, since they are not "real" processors, if hyperthreading is on, hardware threads share hw resources. I've tested problems in which HT is not a problem, but for these non-vector parallelism, it is detrimental (though these numbers seem a bit too high to pin on HT, IMO). Maybe the grain-size has something to do with it? Which leads me to...
>
> Another thing, if by "grain-size" you mean that the 16x16 matrices are multiplied sequentially then although you get very good scalability in your # of processors, there will be a point in which the parallelism of the application will die with the grain size. The higher the grainsize the higher the critical path of the program, the less the potential parallelism. This is still begging the question of what the best grain-size is, *for every different application given a number of processors*.
>
> I wish Keith would jump in and check my facts here.
>

With grain size 8x8 the profile looks like:
50.14% matmult matmult [.] main.matmultParallelImplChan
8.32% matmult matmult [.] runtime.xchg
8.30% matmult matmult [.] schedule
5.82% matmult matmult [.] runtime.chanrecv
5.20% matmult matmult [.] runtime.mallocgc
4.68% matmult matmult [.] runtime.chansend
3.11% matmult matmult [.] runtime.markallocated
2.41% matmult matmult [.] runtime.xadd64
0.81% matmult matmult [.] runtime.lock
0.71% matmult matmult [.] dequeue
0.61% matmult matmult [.] sweepspan

with grain size 16x16:
96.71% matmult matmult [.] main.matmultParallelImplChan
0.44% matmult matmult [.] runtime.xchg
0.34% matmult [kernel.kallsyms] [k] smp_invalidate_interrupt
0.18% matmult matmult [.] runtime.xadd64
0.16% matmult matmult [.] schedule
0.16% matmult matmult [.] runtime.chanrecv
0.08% matmult matmult [.] runtime.mallocgc
0.07% matmult matmult [.] runtime.chansend
0.06% matmult matmult [.] runtime.markallocated

So according to the profile there are no overheads in runtime.
However, it can be exactly FIFO execution order, that breaks the whole
idea of cache-oblivious recursive decomposition.

Leo Ferres

unread,

Feb 21, 2013, 7:04:08 AM2/21/13

to Dmitry Vyukov, Brad Fitzpatrick, Michael Jones, Russ Cox, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall, Sugu Sougoumarane, Mike Solomon, golang-dev

Right.
This is great work, Dmitry. With these numbers, Brad is right and
there may be no need for a "true" (as in "original") LIFO
work-stealing scheduler... I don't know, I will test anyway and see
what comes up when we compare them.
L.

Patrick Mylund Nielsen

unread,

Feb 21, 2013, 1:06:39 PM2/21/13

to Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall

x.0 = major release, breaks backward compatibility, big changes to language

1.x = major release

1.0.x = minor release

On Thu, Feb 21, 2013 at 7:01 PM, <2pau...@googlemail.com> wrote:

This is good new Indeed, thanks for the Update. Looks like excellent work. Along with all the other performance tunings and enhancements the upcoming release should make some waves.

@All ... it does not seem to do the upcoming release justice anymore to only call it "1.1", because the ".1" suggests that it was just a minor bug fixing release. However we are looking at some major enhancements: Parallel and precise Garbage Collection and now also a new Scheduler would justify bumping it up to at least 1.5. Those are just my humble thoughts, I have no idea if it is even possible to actually do this, so feel free to ignore the suggestion if it is too unrealistic.

On Friday, February 8, 2013 3:09:58 PM UTC+1, Dmitry Vyukov wrote:
Good news everyone!

I've updated my scheduler patch to tip:
https://codereview.appspot.com/7314062
(I had to create new changelist due to some issues with codereview site)

It passed all possible tests on std lib. Race detector also seems to
be happy. And I discovered a bug in current runtime thanks to the
patch:
https://codereview.appspot.com/7307073/diff/5001/src/pkg/runtime/parfor.c

--

Michael Jones

unread,

Feb 21, 2013, 1:06:17 PM2/21/13

to 2pau...@googlemail.com, golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall

Maybe the next version could be "Go 0.9" and successive versions could converge on the Euler–Mascheroni constant.

Rémy Oudompheng

unread,

Feb 21, 2013, 1:10:48 PM2/21/13

to Michael Jones, 2pau...@googlemail.com, golan...@googlegroups.com, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall

On 2013/2/21 Michael Jones <m...@google.com>:

> Maybe the next version could be "Go 0.9" and successive versions could
> converge on the Euler–Mascheroni constant.

I suggest it grow to the left and converge 10-adically to the number
of possible Go games.

Rémy.

Dmitry Vyukov

unread,

Feb 22, 2013, 2:29:03 AM2/22/13

to Rémy Oudompheng, Michael Jones, Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr, Keith Randall

Hey, we already have great unique identifiers, e.g. if the release is branched now it should be named 083759101bc9.

Well, actually the url itself features revision id 083759101bc9743e3dbf9b2aefc7ca0e5bd5f000.

Keith Randall

unread,

Feb 25, 2013, 8:12:25 PM2/25/13

to Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, golang-dev, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr

Sorry I haven't chimed in, I've been on vacation all last week.

Dmitry, those numbers look great.

I'm with Brad, I don't think we want knobs on our scheduler. I don't think there's a fundamental incompatibility between the two types of scheduling - they are just two ends of a continuum. We should be able to make the scheduler operate gracefully (gracefully = small constant factor from optimal?) across the continuum. At least, I'm optimistic that's achievable.

Don't assume this will get into 1.1 just yet. My concern is that we don't really have a good feel for how lots of other applications fare under this scheduler. Related, there are a lot of things changing in this patch and I don't think we understand where exactly the enhanced performance is coming from. But it's still early, if we can get some confidence that we're not breaking scheduling for other types of programs I'm happy to work on getting it into 1.1.

Ugorji

unread,

Feb 25, 2013, 8:36:28 PM2/25/13

to golan...@googlegroups.com, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr

This is a major downer (that this may not go into Go 1.1).

I think I (and many other folks) have been downright giddy that this is going in for Go 1.1. It will suck if that doesn't happen.

My own wish list for Go 1.1 is:

- escape analysis for string <-> []byte conversion (where runtime/compiler can determine that they are read-only and non-escaping)

- improved scheduler (that takes cores into account, and is somewhat more pre-emptive)

- precise and better-performing GC

- improved segmented stack growth algorithms (especially for recursive functions)

- possibly cheaper closures

- cache (per processor)

It seems many of these are being worked on now (and I understand we can't get all we want).

In particular, I'm hoping that the improved scheduler will make the CPU utilization for my web application much better.

Russ Cox

unread,

Feb 25, 2013, 9:52:52 PM2/25/13

to Ugorji, golang-dev, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr

No one making any promises one way or the other about what will be in Go 1.1. And if something is not in Go 1.1 there is always Go 1.2. Don't worry too much about it.

Russ

Gustavo Niemeyer

unread,

Feb 26, 2013, 8:56:14 AM2/26/13

to Steve Wang, golan...@googlegroups.com, Ugorji, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr

On Tue, Feb 26, 2013 at 3:27 AM, Steve Wang <steve....@gmail.com> wrote:
> Has Go1.1 been feature-frozen or not yet?
> Is there a relatively formal feature list?

Nothing formal that I know of. You can have a look at the issue
tracker, though, and get an idea of what has been accomplished, and
what is still pending.

gustavo @ http://niemeyer.net

Patrick Mylund Nielsen

unread,

Feb 26, 2013, 12:04:26 PM2/26/13

to Steve Wang, golang-dev, Ugorji, Dmitry Vyukov, Rémy Oudompheng, Michael Jones, Paul de, Leo Ferres, Sean Bauer, Javier Rodriguez, Peter A. Buhr

Yes, most new feature tickets won't be accepted for Go 1.1: https://groups.google.com/d/msg/golang-dev/ajifK81ZUbg/hCIfuaWlSf4J

On Tue, Feb 26, 2013 at 7:27 AM, Steve Wang <steve....@gmail.com> wrote:

Has Go1.1 been feature-frozen or not yet?

Is there a relatively formal feature list?

On Tuesday, February 26, 2013 10:52:52 AM UTC+8, rsc wrote:

No one making any promises one way or the other about what will be in Go 1.1. And if something is not in Go 1.1 there is always Go 1.2. Don't worry too much about it.

Russ

Reply all

Reply to author

Forward