NUMA-aware scheduler for Go

Dmitry Vyukov

chưa đọc,

09:10:03 10 thg 10, 201410/10/14

đến golang-dev

Hi,

I've wrote up my thoughts about Go scheduler future in this doc:

https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub

The main challenge is that it needs to work for a wide variety of
different Go programs and the runtime has absolutely no hints from
user regarding goroutine longevity, communication patterns, data
access patterns, etc. There are lots of open questions with respect to
exact policies, but I am sure that the high-level idea is good.
Suggestions and comments are welcome.

Brendan Tracey

chưa đọc,

13:26:26 10 thg 10, 201410/10/14

đến Dmitry Vyukov, golang-dev

Thanks for the very nice writeup. It seems like it could have some major speed improvements for concurrent programming, which is exciting.

Couple of comments/questions:

1) Does a NUMA scheduler extend to an RDMA-like scheduler in high-reliability clusters?

2) Most of my concurrency (as of right now) falls under two styles. I’m either a) setting up a bunch of worker go routines, and objects will be passed between them on channels to do certain computations on them b) trying to do some form of (advanced) parallel for loop where one goroutine allocates a lot of goroutines, and then waits for the result of the computation. This latter pattern seems to agree with "For example, if a goroutine creates a bunch of new goroutines in quick succession, they may need to be distributed across P's and nodes.”.

3) With such parallel for loops, there will often be some shared data. For example, the following program:
http://play.golang.org/p/haRAISUUSF
It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?

Thanks.

> --
> You received this message because you are subscribed to the Google Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Henrik Johansson

chưa đọc,

14:33:45 10 thg 10, 201410/10/14

đến Brendan Tracey, Dmitry Vyukov, golang-dev

There are some interesting talks about this fo rGHC in https://groups.google.com/forum/#!topic/parallel-haskell/fsHrxz3ei70

That was a while ago, i don't know what came of it.

brainman

chưa đọc,

17:32:00 10 thg 10, 201410/10/14

đến golan...@googlegroups.com

Nice pictures!

s/if/is/

in "What's left if scheduling policy.".

Alex

Dmitry Vyukov

chưa đọc,

05:47:27 11 thg 10, 201411/10/14

đến brainman, golang-dev

On Sat, Oct 11, 2014 at 1:31 AM, brainman <alex.b...@gmail.com> wrote:
> Nice pictures!

Thanks!
Surprisingly, they were super-easy to create in Google Docs.

> s/if/is/
>
> in "What's left if scheduling policy.".

Done

Dmitry Vyukov

chưa đọc,

05:48:20 11 thg 10, 201411/10/14

đến Henrik Johansson, Brendan Tracey, golang-dev

On Fri, Oct 10, 2014 at 10:33 PM, Henrik Johansson <dahan...@gmail.com> wrote:
> There are some interesting talks about this fo rGHC in
> https://groups.google.com/forum/#!topic/parallel-haskell/fsHrxz3ei70
> That was a while ago, i don't know what came of it.

I've read as much as I could. What is the gist?

Jens Frederich

chưa đọc,

05:53:29 11 thg 10, 201411/10/14

đến golan...@googlegroups.com

I'd like to see your concept happen.

Jens

Henrik Johansson

chưa đọc,

06:09:13 11 thg 10, 201411/10/14

đến Dmitry Vyukov, Brendan Tracey, golang-dev

I think it is basically discussing Numa for GHC but also promising results of work stealing allocation which seems to rhyme nicely with your proposal although it is just a small facet.

I like the idea a lot.

Russ Cox

chưa đọc,

14:35:51 11 thg 10, 201411/10/14

đến Dmitry Vyukov, golang-dev

We are in the feature freeze and are supposed to be fixing bugs. I am not going to read plans like this until the release is in better shape.

Russ

Michael Jones

chưa đọc,

17:47:01 11 thg 10, 201411/10/14

đến Russ Cox, golan...@googlegroups.com, Dmitry Vyukov

Russ, please don't wait too long! Parallel performance at tip has been only 40% of release for at least three weeks. Something drastic needs to happen during the freeze...

On Oct 11, 2014 11:35 AM, "Russ Cox" <r...@golang.org> wrote:

We are in the feature freeze and are supposed to be fixing bugs. I am not going to read plans like this until the release is in better shape.

Russ

--

Rob Pike

chưa đọc,

19:01:24 11 thg 10, 201411/10/14

đến Michael Jones, Russ Cox, golan...@googlegroups.com, Dmitry Vyukov

Can you be a little more specific please?

-rob

Michael Jones

chưa đọc,

19:32:18 11 thg 10, 201411/10/14

đến Rob 'Commander' Pike, Russ Cox, Dmitry Vyukov, golan...@googlegroups.com

I have a dozen or so concurrent computation programs in Go. Few of them are important in a work/business way, just sophisticated play: solve 100x100 sudokus quickly, discover the first million highly composite numbers, solve all polyomino problems in Golomb's books, and so on. They use a variety of means for concurrency, from 'go-ing' leaves in recursion to pools of workers, to pipelines of pools of workers.

I update most days and benchmark them. For the last few weeks I have seen 50% or less aggregate performance. I've been away for a week and just tested again this morning. As an example from memory, take my generalized Skolem-Langford solver. (Which generates world record results.) Using Release I test 120 million candidates per second on a single core. (Bravo Go!) On tip, that is 57 million per second. In concurrent mode (4 core MBP) I get 5x this on release and 3.5x to 4x this on tip. (4x means no SMT bonus). This same overall slowdown exists across the range of my fast codes. I saw a similar slowdown at work with my 16 core Xeon.

It concerns me.

brainman

chưa đọc,

19:43:39 11 thg 10, 201411/10/14

đến golan...@googlegroups.com, r...@golang.org, dvy...@google.com

Also this http://build.golang.org/perf is all red. I am not sure if that matters ...

Alex

Dmitry Vyukov

chưa đọc,

03:34:26 12 thg 10, 201412/10/14

đến Michael Jones, Rob 'Commander' Pike, Russ Cox, golang-dev

On Sun, Oct 12, 2014 at 3:32 AM, Michael Jones <m...@google.com> wrote:
> I have a dozen or so concurrent computation programs in Go. Few of them are
> important in a work/business way, just sophisticated play: solve 100x100
> sudokus quickly, discover the first million highly composite numbers, solve
> all polyomino problems in Golomb's books, and so on. They use a variety of
> means for concurrency, from 'go-ing' leaves in recursion to pools of
> workers, to pipelines of pools of workers.
>
> I update most days and benchmark them. For the last few weeks I have seen
> 50% or less aggregate performance. I've been away for a week and just tested
> again this morning. As an example from memory, take my generalized
> Skolem-Langford solver. (Which generates world record results.) Using
> Release I test 120 million candidates per second on a single core. (Bravo
> Go!) On tip, that is 57 million per second. In concurrent mode (4 core MBP)
> I get 5x this on release and 3.5x to 4x this on tip. (4x means no SMT
> bonus). This same overall slowdown exists across the range of my fast codes.
> I saw a similar slowdown at work with my 16 core Xeon.
>
> It concerns me.

What is "Release I"? And what is MBP?
In "get 5x this on release and 3.5x to 4x this on tip" what do you
mean by "this"? 120M or 57M?
Can you give us a reproducer? On what revision did it start happening?

Dmitry Vyukov

chưa đọc,

03:35:54 12 thg 10, 201412/10/14

đến brainman, golang-dev, Rob 'Commander' Pike

On Sun, Oct 12, 2014 at 3:43 AM, brainman <alex.b...@gmail.com> wrote:
> Also this http://build.golang.org/perf is all red. I am not sure if that
> matters ...

For the purpose of making the perf dashboard better, what makes you
think that it may not matter?

brainman

chưa đọc,

03:58:40 12 thg 10, 201412/10/14

đến golan...@googlegroups.com, alex.b...@gmail.com, r...@golang.org

On Sunday, 12 October 2014 18:35:54 UTC+11, Dmitry Vyukov wrote:

..., what makes you

think that it may not matter?

No one else seems to be concerned about it. So, perhaps, my concerns might be baseless. I don't consider myself expert in area. I understand a lot of runtime code changed recently. But still, It would be nice, if we're all clear about what the perf effect is. Before the release is cut.

Alex

Dmitry Vyukov

chưa đọc,

04:16:52 12 thg 10, 201412/10/14

đến brainman, golang-dev, Rob 'Commander' Pike

Ah, I see.
Other people are concerned.

Dmitry Vyukov

chưa đọc,

09:47:18 12 thg 10, 201412/10/14

đến Brendan Tracey, golang-dev

On Fri, Oct 10, 2014 at 9:26 PM, Brendan Tracey
<tracey....@gmail.com> wrote:
> Thanks for the very nice writeup. It seems like it could have some major speed improvements for concurrent programming, which is exciting.
>
> Couple of comments/questions:
>
> 1) Does a NUMA scheduler extend to an RDMA-like scheduler in high-reliability clusters?

Never worked with RDMA.
As far as I understand any shared-memory program will continue to work
on RDMA. So Go supports RDMA in this sense.
As for performance, I don't actually know. Any locality- and NUMA-like
optimization will definitely help RDMA as well. But I am not sure
whether it will be enough to make Go on RDMA efficient enough for
practical use.

> 2) Most of my concurrency (as of right now) falls under two styles. I’m either a) setting up a bunch of worker go routines, and objects will be passed between them on channels to do certain computations on them b) trying to do some form of (advanced) parallel for loop where one goroutine allocates a lot of goroutines, and then waits for the result of the computation. This latter pattern seems to agree with "For example, if a goroutine creates a bunch of new goroutines in quick succession, they may need to be distributed across P's and nodes.”.

Ack

> 3) With such parallel for loops, there will often be some shared data. For example, the following program:
> http://play.golang.org/p/haRAISUUSF
> It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?

Regarding weights. Its size is 80K, so it will fit into caches, and
won't be affected by NUMA. I meant significantly larger data
structures. In either case the new scheduler won't behave worser than
the current scheduler.
As for rowSums, if per-element work is large enough, then single store
to the shared array is OK. If per-element work is not large enough,
then you can coarsen work by giving each goroutine a range
rowSums[i..j].
Or you could use cache-oblivious recursive decomposition:
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms
Current scheduler does not support it well (it requires LIFO
scheduling). Hopefully the new scheduler will do better.

Michael Jones

chưa đọc,

10:24:52 12 thg 10, 201412/10/14

đến Dmitry Vyukov, Brendan Tracey, golang-dev

Hi Dmitry,

I'm trying to make a tiny, synthetic example to show the problem. My first effort (tiny numerical kernel) does not support my statement. That's good for Tip in the that problem is not a universal one, but does not disprove the data that I see in my real programs (that do big computations, do more than zero allocations, etc.) I will continue to seek a small and shareable test program that clearly shows the problem. (If I give you a big one you'll spend more time on the program itself that the performance problem it indicates.) That said, here is my quick non-example, taken from the first test program in the first programming class in BASIC taught by John Kemeny & Tom Kurtz. This one approximates Pi by polygonal subdivision. Code here:

http://play.golang.org/p/1h9ecszyqT

On Tip on my 4-core MacBook Pro ("MBP")

mtj-macbookpro:slow mtj$ go build

mtj-macbookpro:slow mtj$ export GOMAXPROCS=1

mtj-macbookpro:slow mtj$ ./slow

2.019800 ( 2.020+ 0.002) 100.109% 0.059 MiB 1796476/sec ( 1796613/sec)

2.018182 ( 2.018+ 0.002) 100.096% 0.008 MiB 1798114/sec ( 1798054/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=2

mtj-macbookpro:slow mtj$ ./slow

2.019929 ( 2.020+ 0.003) 100.153% 0.082 MiB 1796030/sec ( 1796499/sec)

1.029032 ( 2.058+ 0.001) 200.079% 0.047 MiB 1763442/sec ( 3526420/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=3

mtj-macbookpro:slow mtj$ ./slow

2.024293 ( 2.025+ 0.002) 100.152% 0.055 MiB 1792043/sec ( 1792625/sec)

0.705275 ( 2.113+ 0.001) 299.763% 0.020 MiB 1717591/sec ( 5145228/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=4

mtj-macbookpro:slow mtj$ ./slow

2.017216 ( 2.016+ 0.004) 100.157% 0.055 MiB 1799781/sec ( 1798915/sec)

0.576599 ( 2.257+ 0.001) 391.613% 0.016 MiB 1608041/sec ( 6293451/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=5

mtj-macbookpro:slow mtj$ ./slow

2.018553 ( 2.019+ 0.002) 100.156% 0.047 MiB 1797042/sec ( 1797724/sec)

0.484054 ( 2.354+ 0.001) 486.485% 0.012 MiB 1541840/sec ( 7496681/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=6

mtj-macbookpro:slow mtj$ ./slow

2.025593 ( 2.026+ 0.003) 100.157% 0.055 MiB 1791449/sec ( 1791475/sec)

0.411758 ( 2.390+ 0.001) 580.847% 0.066 MiB 1518112/sec ( 8812934/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=7

mtj-macbookpro:slow mtj$ ./slow

2.028218 ( 2.029+ 0.002) 100.150% 0.055 MiB 1788607/sec ( 1789157/sec)

0.353464 ( 2.418+ 0.001) 684.444% 0.121 MiB 1500725/sec ( 10266382/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=8

mtj-macbookpro:slow mtj$ ./slow

2.075799 ( 2.076+ 0.002) 100.134% 0.090 MiB 1747902/sec ( 1748146/sec)

0.324338 ( 2.472+ 0.002) 762.783% 0.172 MiB 1468065/sec ( 11188317/sec)

mtj-macbookpro:slow mtj$

on Release:

mtj-macbookpro:slow mtj$ go clean

mtj-macbookpro:slow mtj$ go build

mtj-macbookpro:slow mtj$ export GOMAXPROCS=1

mtj-macbookpro:slow mtj$ ./slow

2.040765 ( 2.039+ 0.004) 100.107% 0.051 MiB 1779683/sec ( 1778157/sec)

2.039777 ( 2.038+ 0.004) 100.098% 0.020 MiB 1780672/sec ( 1779018/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=2

mtj-macbookpro:slow mtj$ ./slow

2.038229 ( 2.037+ 0.004) 100.135% 0.035 MiB 1781605/sec ( 1780369/sec)

1.039575 ( 2.079+ 0.001) 200.083% 0.020 MiB 1745732/sec ( 3490658/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=3

mtj-macbookpro:slow mtj$ ./slow

2.052183 ( 2.052+ 0.003) 100.150% 0.027 MiB 1768072/sec ( 1768263/sec)

0.703978 ( 2.110+ 0.002) 300.003% 0.027 MiB 1719594/sec ( 5154704/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=4

mtj-macbookpro:slow mtj$ ./slow

2.034749 ( 2.035+ 0.003) 100.149% 0.035 MiB 1782993/sec ( 1783414/sec)

0.591446 ( 2.257+ 0.002) 382.062% 0.027 MiB 1607559/sec ( 6135471/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=5

mtj-macbookpro:slow mtj$ ./slow

2.034675 ( 2.035+ 0.003) 100.151% 0.027 MiB 1783086/sec ( 1783479/sec)

0.477781 ( 2.319+ 0.002) 485.797% 0.070 MiB 1565078/sec ( 7595110/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=6

mtj-macbookpro:slow mtj$ ./slow

2.037266 ( 2.038+ 0.003) 100.151% 0.027 MiB 1780752/sec ( 1781210/sec)

0.409199 ( 2.417+ 0.003) 591.432% 0.191 MiB 1501088/sec ( 8868050/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=7

mtj-macbookpro:slow mtj$ ./slow

2.033536 ( 2.033+ 0.004) 100.154% 0.039 MiB 1784815/sec ( 1784478/sec)

0.354096 ( 2.432+ 0.002) 687.527% 0.184 MiB 1492080/sec ( 10248075/sec)

mtj-macbookpro:slow mtj$ export GOMAXPROCS=8

mtj-macbookpro:slow mtj$ ./slow

2.034722 ( 2.033+ 0.005) 100.151% 0.047 MiB 1785255/sec ( 1783438/sec)

0.321146 ( 2.486+ 0.003) 775.165% 0.230 MiB 1459721/sec ( 11299531/sec)

No meaningful difference here.

Michael Jones

chưa đọc,

11:13:41 12 thg 10, 201412/10/14

đến Dmitry Vyukov, Brendan Tracey, golang-dev

Dmitry, here is a single-CPU result from a "real" program to show you what I am seeing generally:

Tip:

=== RUN TestSkolem12

# Skolem(12): 227968 solutions 2434825 nodes 35682939 updates 0.703 s

--- PASS: TestSkolem12 (0.70s)

exact_test.go:677: # Skolem(12): 227968 solutions 2434825 nodes 35682939 updates 0.703 s

=== RUN TestGap

# Gap( 1, 0): 1 solutions 1 nodes 5 updates 0.000 s

# Gap( 3, 1): 1 solutions 4 nodes 31 updates 0.000 s

# Gap( 4, 1): 1 solutions 8 nodes 92 updates 0.000 s

# Gap( 4, 0): 3 solutions 14 nodes 169 updates 0.000 s

# Gap( 5, 2): 3 solutions 19 nodes 220 updates 0.000 s

# Gap( 5, 0): 5 solutions 38 nodes 522 updates 0.000 s

# Gap( 7, 3): 14 solutions 115 nodes 1539 updates 0.000 s

# Gap( 7, 1): 26 solutions 352 nodes 5055 updates 0.000 s

# Gap( 8, 3): 56 solutions 603 nodes 8091 updates 0.001 s

# Gap( 8, 2): 90 solutions 1010 nodes 14123 updates 0.001 s

# Gap( 8, 1): 150 solutions 1514 nodes 22468 updates 0.002 s

# Gap( 8, 0): 252 solutions 2260 nodes 33173 updates 0.003 s

# Gap( 9, 4): 122 solutions 1146 nodes 15970 updates 0.002 s

# Gap( 9, 2): 346 solutions 4642 nodes 67317 updates 0.006 s

# Gap( 9, 0): 1328 solutions 10332 nodes 152978 updates 0.007 s

# Gap(11, 5): 1272 solutions 14886 nodes 212023 updates 0.007 s

# Gap(11, 3): 5560 solutions 87750 nodes 1270392 updates 0.036 s

# Gap(11, 1): 17792 solutions 210876 nodes 3033477 updates 0.063 s

# Gap(12, 5): 9052 solutions 141676 nodes 1949120 updates 0.042 s

# Gap(12, 4): 21028 solutions 357049 nodes 5048941 updates 0.109 s 46.481 Mu/s

# Gap(12, 3): 32744 solutions 536269 nodes 7768788 updates 0.155 s 49.965 Mu/s

# Gap(12, 2): 55696 solutions 818550 nodes 11935776 updates 0.237 s 50.381 Mu/s

# Gap(12, 1): 108144 solutions 1351543 nodes 19494378 updates 0.382 s 51.053 Mu/s

# Gap(12, 0): 227968 solutions 2137237 nodes 31496482 updates 0.610 s 51.662 Mu/s

# Gap(13, 6): 17800 solutions 276931 nodes 4032013 updates 0.085 s

# Gap(13, 4): 129220 solutions 2406091 nodes 34510000 updates 0.692 s 49.892 Mu/s

# Gap(13, 2): 360876 solutions 5586798 nodes 81469260 updates 1.582 s 51.504 Mu/s

# Gap(13, 0): 1520280 solutions 14940920 nodes 219953588 updates 4.216 s 52.167 Mu/s

# Gap(15, 7): 329816 solutions 6433598 nodes 94719176 updates 1.874 s 50.539 Mu/s

# Gap(15, 5): 3912272 solutions 82944237 nodes 1180126632 updates 23.040 s 51.220 Mu/s

Release (1.3.3):

=== RUN TestSkolem12

# Skolem(12): 227968 solutions 2434825 nodes 35682939 updates 0.308 s

--- PASS: TestSkolem12 (0.31 seconds)

exact_test.go:677: # Skolem(12): 227968 solutions 2434825 nodes 35682939 updates 0.308 s

=== RUN TestGap

# Gap( 1, 0): 1 solutions 1 nodes 5 updates 0.000 s

# Gap( 3, 1): 1 solutions 4 nodes 31 updates 0.000 s

# Gap( 4, 1): 1 solutions 8 nodes 92 updates 0.000 s

# Gap( 4, 0): 3 solutions 14 nodes 169 updates 0.000 s

# Gap( 5, 2): 3 solutions 19 nodes 220 updates 0.000 s

# Gap( 5, 0): 5 solutions 38 nodes 522 updates 0.000 s

# Gap( 7, 3): 14 solutions 115 nodes 1539 updates 0.000 s

# Gap( 7, 1): 26 solutions 352 nodes 5055 updates 0.000 s

# Gap( 8, 3): 56 solutions 603 nodes 8091 updates 0.000 s

# Gap( 8, 2): 90 solutions 1010 nodes 14123 updates 0.001 s

# Gap( 8, 1): 150 solutions 1514 nodes 22468 updates 0.001 s

# Gap( 8, 0): 252 solutions 2260 nodes 33173 updates 0.001 s

# Gap( 9, 4): 122 solutions 1146 nodes 15970 updates 0.001 s

# Gap( 9, 2): 346 solutions 4642 nodes 67317 updates 0.003 s

# Gap( 9, 0): 1328 solutions 10332 nodes 152978 updates 0.003 s

# Gap(11, 5): 1272 solutions 14886 nodes 212023 updates 0.003 s

# Gap(11, 3): 5560 solutions 87750 nodes 1270392 updates 0.013 s

# Gap(11, 1): 17792 solutions 210876 nodes 3033477 updates 0.027 s

# Gap(12, 5): 9052 solutions 141676 nodes 1949120 updates 0.019 s

# Gap(12, 4): 21028 solutions 357049 nodes 5048941 updates 0.049 s

# Gap(12, 3): 32744 solutions 536269 nodes 7768788 updates 0.071 s

# Gap(12, 2): 55696 solutions 818550 nodes 11935776 updates 0.106 s 112.309 Mu/s

# Gap(12, 1): 108144 solutions 1351543 nodes 19494378 updates 0.169 s 115.404 Mu/s

# Gap(12, 0): 227968 solutions 2137237 nodes 31496482 updates 0.268 s 117.713 Mu/s

# Gap(13, 6): 17800 solutions 276931 nodes 4032013 updates 0.038 s

# Gap(13, 4): 129220 solutions 2406091 nodes 34510000 updates 0.310 s 111.319 Mu/s

# Gap(13, 2): 360876 solutions 5586798 nodes 81469260 updates 0.717 s 113.560 Mu/s

# Gap(13, 0): 1520280 solutions 14940920 nodes 219953588 updates 1.815 s 121.187 Mu/s

# Gap(15, 7): 329816 solutions 6433598 nodes 94719176 updates 0.845 s 112.131 Mu/s

# Gap(15, 5): 3912272 solutions 82944237 nodes 1180126632 updates 10.352 s 114.001 Mu/s

Unfortunately, this is the typical comparison across a range of programs. It would be great if I've done something wrong and the problem is with me.

--

Michael T. Jones | Chief Technology Advocate | m...@google.com | +1 650-335-5765

Brendan Tracey

chưa đọc,

11:54:16 12 thg 10, 201412/10/14

đến Dmitry Vyukov, golang-dev

>
>> 3) With such parallel for loops, there will often be some shared data. For example, the following program:
>> http://play.golang.org/p/haRAISUUSF
>> It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?
>
>
> Regarding weights. Its size is 80K, so it will fit into caches, and
> won't be affected by NUMA. I meant significantly larger data
> structures. In either case the new scheduler won't behave worser than
> the current scheduler.
> As for rowSums, if per-element work is large enough, then single store
> to the shared array is OK. If per-element work is not large enough,
> then you can coarsen work by giving each goroutine a range
> rowSums[i..j].
> Or you could use cache-oblivious recursive decomposition:
> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
> http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms
> Current scheduler does not support it well (it requires LIFO
> scheduling). Hopefully the new scheduler will do better.

Okay. In my actual use case, the data is much larger, but there are playground restrictions. I didn’t know if such patters would get worse with the NUMA scheduler, or if such patters would be bad to begin with because one should avoid “remote” accesses to the same slice. Thanks for the comments and the links.

ron minnich

chưa đọc,

12:15:54 12 thg 10, 201412/10/14

đến Dmitry Vyukov, Brendan Tracey, golang-dev

On Sun, Oct 12, 2014 at 6:46 AM, 'Dmitry Vyukov' via golang-dev
<golan...@googlegroups.com> wrote:

> Never worked with RDMA.
> As far as I understand any shared-memory program will continue to work
> on RDMA. So Go supports RDMA in this sense.

no, RDMA is not NUMA at all. To make data move in RDMA you have to
take explicit action, which is why it's used as the basis of message
passing protocols. You don't share address spaces in RDMA systems and
you do in NUMA.

> As for performance, I don't actually know. Any locality- and NUMA-like
> optimization will definitely help RDMA as well.

I don't see how. RDMA is a shared-none message passing type of network.

ron

Dmitry Vyukov

chưa đọc,

04:15:01 13 thg 10, 201413/10/14

đến Michael Jones, Brendan Tracey, golang-dev

What does profile say?

Dmitry Vyukov

chưa đọc,

04:16:10 13 thg 10, 201413/10/14

đến Brendan Tracey, golang-dev

It won't get worse.

Dmitry Vyukov

chưa đọc,

04:19:18 13 thg 10, 201413/10/14

đến ron minnich, Brendan Tracey, golang-dev

Ah, OK, mixed RDMA with DSM
(https://en.wikipedia.org/wiki/Distributed_shared_memory).
Then, no, Go won't work on a cluster w/o shared memory. But it could
work on a DSM cluster.

Liam

chưa đọc,

20:52:14 13 thg 10, 201413/10/14

đến golan...@googlegroups.com

Dmitry,

Is there NUMA-savvy C code showing expected performance for certain benchmarks?
Is there Go code for the same benchmarks?
If so, what's the current performance gap on NUMA machines?
And on non-NUMA machines?

Also, I think Michael's performance topic deserves its own thread...

Dmitry Vyukov

chưa đọc,

02:52:24 14 thg 10, 201414/10/14

đến Liam, golang-dev

On Tue, Oct 14, 2014 at 4:52 AM, Liam <networ...@gmail.com> wrote:
> Dmitry,
>
> Is there NUMA-savvy C code showing expected performance for certain
> benchmarks?
> Is there Go code for the same benchmarks?
> If so, what's the current performance gap on NUMA machines?
> And on non-NUMA machines?

There is no comprehensive comparison analysis.

QuickThread is a good example of NUMA-aware scheduler for C++:
http://quickthreadprogramming.com/
But it's based on user hints all way down.

> Also, I think Michael's performance topic deserves its own thread...
>
>
>
> On Friday, October 10, 2014 6:10:03 AM UTC-7, Dmitry Vyukov wrote:
>>
>> Hi,
>>
>> I've wrote up my thoughts about Go scheduler future in this doc:
>>
>>
>> https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub
>>
>> The main challenge is that it needs to work for a wide variety of
>> different Go programs and the runtime has absolutely no hints from
>> user regarding goroutine longevity, communication patterns, data
>> access patterns, etc. There are lots of open questions with respect to
>> exact policies, but I am sure that the high-level idea is good.
>> Suggestions and comments are welcome.
>

Trả lời tất cả

Trả lời tác giả

Chuyển tiếp