NUMA-aware scheduler for Go

1.747 lượt xem
Chuyển tới thư đầu tiên chưa đọc

Dmitry Vyukov

chưa đọc,
09:10:03 10 thg 10, 201410/10/14
đến golang-dev
Hi,

I've wrote up my thoughts about Go scheduler future in this doc:

https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub

The main challenge is that it needs to work for a wide variety of
different Go programs and the runtime has absolutely no hints from
user regarding goroutine longevity, communication patterns, data
access patterns, etc. There are lots of open questions with respect to
exact policies, but I am sure that the high-level idea is good.
Suggestions and comments are welcome.

Brendan Tracey

chưa đọc,
13:26:26 10 thg 10, 201410/10/14
đến Dmitry Vyukov, golang-dev
Thanks for the very nice writeup. It seems like it could have some major speed improvements for concurrent programming, which is exciting.

Couple of comments/questions:

1) Does a NUMA scheduler extend to an RDMA-like scheduler in high-reliability clusters?

2) Most of my concurrency (as of right now) falls under two styles. I’m either a) setting up a bunch of worker go routines, and objects will be passed between them on channels to do certain computations on them b) trying to do some form of (advanced) parallel for loop where one goroutine allocates a lot of goroutines, and then waits for the result of the computation. This latter pattern seems to agree with "For example, if a goroutine creates a bunch of new goroutines in quick succession, they may need to be distributed across P's and nodes.”.

3) With such parallel for loops, there will often be some shared data. For example, the following program:
http://play.golang.org/p/haRAISUUSF
It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?

Thanks.
> --
> You received this message because you are subscribed to the Google Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Henrik Johansson

chưa đọc,
14:33:45 10 thg 10, 201410/10/14
đến Brendan Tracey, Dmitry Vyukov, golang-dev
There are some interesting talks about this fo rGHC in https://groups.google.com/forum/#!topic/parallel-haskell/fsHrxz3ei70 
That was a while ago, i don't know what came of it.

brainman

chưa đọc,
17:32:00 10 thg 10, 201410/10/14
đến golan...@googlegroups.com
Nice pictures!

s/if/is/

in "What's left if scheduling policy.".

Alex

Dmitry Vyukov

chưa đọc,
05:47:27 11 thg 10, 201411/10/14
đến brainman, golang-dev
On Sat, Oct 11, 2014 at 1:31 AM, brainman <alex.b...@gmail.com> wrote:
> Nice pictures!

Thanks!
Surprisingly, they were super-easy to create in Google Docs.


> s/if/is/
>
> in "What's left if scheduling policy.".

Done

Dmitry Vyukov

chưa đọc,
05:48:20 11 thg 10, 201411/10/14
đến Henrik Johansson, Brendan Tracey, golang-dev
On Fri, Oct 10, 2014 at 10:33 PM, Henrik Johansson <dahan...@gmail.com> wrote:
> There are some interesting talks about this fo rGHC in
> https://groups.google.com/forum/#!topic/parallel-haskell/fsHrxz3ei70
> That was a while ago, i don't know what came of it.

I've read as much as I could. What is the gist?

Jens Frederich

chưa đọc,
05:53:29 11 thg 10, 201411/10/14
đến golan...@googlegroups.com
I'd like to see your concept happen.

Jens 

Henrik Johansson

chưa đọc,
06:09:13 11 thg 10, 201411/10/14
đến Dmitry Vyukov, Brendan Tracey, golang-dev

I think it is basically discussing Numa for GHC but also promising results of work stealing allocation which seems to rhyme nicely with your proposal although it is just a small facet.

I like the idea a lot.

Russ Cox

chưa đọc,
14:35:51 11 thg 10, 201411/10/14
đến Dmitry Vyukov, golang-dev
We are in the feature freeze and are supposed to be fixing bugs. I am not going to read plans like this until the release is in better shape.

Russ

Michael Jones

chưa đọc,
17:47:01 11 thg 10, 201411/10/14
đến Russ Cox, golan...@googlegroups.com, Dmitry Vyukov

Russ, please don't wait too long! Parallel performance at tip has been only 40% of release for at least three weeks. Something drastic needs to happen during the freeze...

On Oct 11, 2014 11:35 AM, "Russ Cox" <r...@golang.org> wrote:
We are in the feature freeze and are supposed to be fixing bugs. I am not going to read plans like this until the release is in better shape.

Russ

--

Rob Pike

chưa đọc,
19:01:24 11 thg 10, 201411/10/14
đến Michael Jones, Russ Cox, golan...@googlegroups.com, Dmitry Vyukov
Can you be a little more specific please?

-rob

Michael Jones

chưa đọc,
19:32:18 11 thg 10, 201411/10/14
đến Rob 'Commander' Pike, Russ Cox, Dmitry Vyukov, golan...@googlegroups.com

I have a dozen or so concurrent computation programs in Go. Few of them are important in a work/business way, just sophisticated play: solve 100x100 sudokus quickly, discover the first million highly composite numbers, solve all polyomino problems in Golomb's books, and so on. They use a variety of means for concurrency, from 'go-ing' leaves in recursion to pools of workers, to pipelines of pools of workers.

I update most days and benchmark them. For the last few weeks I have seen 50% or less aggregate performance. I've been away for a week and just tested again this morning. As an example from memory, take my generalized Skolem-Langford solver. (Which generates world record results.) Using Release I test 120 million candidates per second on a single core. (Bravo Go!) On tip, that is 57 million per second. In concurrent mode (4 core MBP) I get 5x this on release and 3.5x to 4x this on tip. (4x means no SMT bonus). This same overall slowdown exists across the range of my fast codes. I saw a similar slowdown at work with my 16 core Xeon.

It concerns me.

brainman

chưa đọc,
19:43:39 11 thg 10, 201411/10/14
đến golan...@googlegroups.com, r...@golang.org, dvy...@google.com
Also this http://build.golang.org/perf is all red. I am not sure if that matters ...

Alex

Dmitry Vyukov

chưa đọc,
03:34:26 12 thg 10, 201412/10/14
đến Michael Jones, Rob 'Commander' Pike, Russ Cox, golang-dev
On Sun, Oct 12, 2014 at 3:32 AM, Michael Jones <m...@google.com> wrote:
> I have a dozen or so concurrent computation programs in Go. Few of them are
> important in a work/business way, just sophisticated play: solve 100x100
> sudokus quickly, discover the first million highly composite numbers, solve
> all polyomino problems in Golomb's books, and so on. They use a variety of
> means for concurrency, from 'go-ing' leaves in recursion to pools of
> workers, to pipelines of pools of workers.
>
> I update most days and benchmark them. For the last few weeks I have seen
> 50% or less aggregate performance. I've been away for a week and just tested
> again this morning. As an example from memory, take my generalized
> Skolem-Langford solver. (Which generates world record results.) Using
> Release I test 120 million candidates per second on a single core. (Bravo
> Go!) On tip, that is 57 million per second. In concurrent mode (4 core MBP)
> I get 5x this on release and 3.5x to 4x this on tip. (4x means no SMT
> bonus). This same overall slowdown exists across the range of my fast codes.
> I saw a similar slowdown at work with my 16 core Xeon.
>
> It concerns me.


What is "Release I"? And what is MBP?
In "get 5x this on release and 3.5x to 4x this on tip" what do you
mean by "this"? 120M or 57M?
Can you give us a reproducer? On what revision did it start happening?

Dmitry Vyukov

chưa đọc,
03:35:54 12 thg 10, 201412/10/14
đến brainman, golang-dev, Rob 'Commander' Pike
On Sun, Oct 12, 2014 at 3:43 AM, brainman <alex.b...@gmail.com> wrote:
> Also this http://build.golang.org/perf is all red. I am not sure if that
> matters ...


For the purpose of making the perf dashboard better, what makes you
think that it may not matter?

brainman

chưa đọc,
03:58:40 12 thg 10, 201412/10/14
đến golan...@googlegroups.com, alex.b...@gmail.com, r...@golang.org
On Sunday, 12 October 2014 18:35:54 UTC+11, Dmitry Vyukov wrote:
..., what makes you
think that it may not matter?

No one else seems to be concerned about it. So, perhaps, my concerns might be baseless. I don't consider myself expert in area. I understand a lot of runtime code changed recently. But still, It would be nice, if we're all clear about what the perf effect is. Before the release is cut.

Alex

Dmitry Vyukov

chưa đọc,
04:16:52 12 thg 10, 201412/10/14
đến brainman, golang-dev, Rob 'Commander' Pike
Ah, I see.
Other people are concerned.

Dmitry Vyukov

chưa đọc,
09:47:18 12 thg 10, 201412/10/14
đến Brendan Tracey, golang-dev
On Fri, Oct 10, 2014 at 9:26 PM, Brendan Tracey
<tracey....@gmail.com> wrote:
> Thanks for the very nice writeup. It seems like it could have some major speed improvements for concurrent programming, which is exciting.
>
> Couple of comments/questions:
>
> 1) Does a NUMA scheduler extend to an RDMA-like scheduler in high-reliability clusters?

Never worked with RDMA.
As far as I understand any shared-memory program will continue to work
on RDMA. So Go supports RDMA in this sense.
As for performance, I don't actually know. Any locality- and NUMA-like
optimization will definitely help RDMA as well. But I am not sure
whether it will be enough to make Go on RDMA efficient enough for
practical use.


> 2) Most of my concurrency (as of right now) falls under two styles. I’m either a) setting up a bunch of worker go routines, and objects will be passed between them on channels to do certain computations on them b) trying to do some form of (advanced) parallel for loop where one goroutine allocates a lot of goroutines, and then waits for the result of the computation. This latter pattern seems to agree with "For example, if a goroutine creates a bunch of new goroutines in quick succession, they may need to be distributed across P's and nodes.”.

Ack

> 3) With such parallel for loops, there will often be some shared data. For example, the following program:
> http://play.golang.org/p/haRAISUUSF
> It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?


Regarding weights. Its size is 80K, so it will fit into caches, and
won't be affected by NUMA. I meant significantly larger data
structures. In either case the new scheduler won't behave worser than
the current scheduler.
As for rowSums, if per-element work is large enough, then single store
to the shared array is OK. If per-element work is not large enough,
then you can coarsen work by giving each goroutine a range
rowSums[i..j].
Or you could use cache-oblivious recursive decomposition:
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms
Current scheduler does not support it well (it requires LIFO
scheduling). Hopefully the new scheduler will do better.

Michael Jones

chưa đọc,
10:24:52 12 thg 10, 201412/10/14
đến Dmitry Vyukov, Brendan Tracey, golang-dev
Hi Dmitry,

I'm trying to make a tiny, synthetic example to show the problem. My first effort (tiny numerical kernel) does not support my statement. That's good for Tip in the that problem is not a universal one, but does not disprove the data that I see in my real programs (that do big computations, do more than zero allocations, etc.) I will continue to seek a small and shareable test program that clearly shows the problem. (If I give you a big one you'll spend more time on the program itself that the performance problem it indicates.) That said, here is my quick non-example, taken from the first test program in the first programming class in BASIC taught by John Kemeny & Tom Kurtz. This one approximates Pi by polygonal subdivision. Code here:


On Tip on my 4-core MacBook Pro ("MBP")

mtj-macbookpro:slow mtj$ go build
mtj-macbookpro:slow mtj$ export GOMAXPROCS=1
mtj-macbookpro:slow mtj$ ./slow
    2.019800 (     2.020+    0.002) 100.109%     0.059 MiB   1796476/sec (    1796613/sec)
    2.018182 (     2.018+    0.002) 100.096%     0.008 MiB   1798114/sec (    1798054/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=2
mtj-macbookpro:slow mtj$ ./slow
    2.019929 (     2.020+    0.003) 100.153%     0.082 MiB   1796030/sec (    1796499/sec)
    1.029032 (     2.058+    0.001) 200.079%     0.047 MiB   1763442/sec (    3526420/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=3
mtj-macbookpro:slow mtj$ ./slow
    2.024293 (     2.025+    0.002) 100.152%     0.055 MiB   1792043/sec (    1792625/sec)
    0.705275 (     2.113+    0.001) 299.763%     0.020 MiB   1717591/sec (    5145228/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=4
mtj-macbookpro:slow mtj$ ./slow
    2.017216 (     2.016+    0.004) 100.157%     0.055 MiB   1799781/sec (    1798915/sec)
    0.576599 (     2.257+    0.001) 391.613%     0.016 MiB   1608041/sec (    6293451/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=5
mtj-macbookpro:slow mtj$ ./slow
    2.018553 (     2.019+    0.002) 100.156%     0.047 MiB   1797042/sec (    1797724/sec)
    0.484054 (     2.354+    0.001) 486.485%     0.012 MiB   1541840/sec (    7496681/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=6
mtj-macbookpro:slow mtj$ ./slow
    2.025593 (     2.026+    0.003) 100.157%     0.055 MiB   1791449/sec (    1791475/sec)
    0.411758 (     2.390+    0.001) 580.847%     0.066 MiB   1518112/sec (    8812934/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=7
mtj-macbookpro:slow mtj$ ./slow
    2.028218 (     2.029+    0.002) 100.150%     0.055 MiB   1788607/sec (    1789157/sec)
    0.353464 (     2.418+    0.001) 684.444%     0.121 MiB   1500725/sec (   10266382/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=8
mtj-macbookpro:slow mtj$ ./slow
    2.075799 (     2.076+    0.002) 100.134%     0.090 MiB   1747902/sec (    1748146/sec)
    0.324338 (     2.472+    0.002) 762.783%     0.172 MiB   1468065/sec (   11188317/sec)
mtj-macbookpro:slow mtj$

on Release:

mtj-macbookpro:slow mtj$ go clean
mtj-macbookpro:slow mtj$ go build
mtj-macbookpro:slow mtj$ export GOMAXPROCS=1
mtj-macbookpro:slow mtj$ ./slow
    2.040765 (     2.039+    0.004) 100.107%     0.051 MiB   1779683/sec (    1778157/sec)
    2.039777 (     2.038+    0.004) 100.098%     0.020 MiB   1780672/sec (    1779018/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=2
mtj-macbookpro:slow mtj$ ./slow
    2.038229 (     2.037+    0.004) 100.135%     0.035 MiB   1781605/sec (    1780369/sec)
    1.039575 (     2.079+    0.001) 200.083%     0.020 MiB   1745732/sec (    3490658/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=3
mtj-macbookpro:slow mtj$ ./slow
    2.052183 (     2.052+    0.003) 100.150%     0.027 MiB   1768072/sec (    1768263/sec)
    0.703978 (     2.110+    0.002) 300.003%     0.027 MiB   1719594/sec (    5154704/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=4
mtj-macbookpro:slow mtj$ ./slow
    2.034749 (     2.035+    0.003) 100.149%     0.035 MiB   1782993/sec (    1783414/sec)
    0.591446 (     2.257+    0.002) 382.062%     0.027 MiB   1607559/sec (    6135471/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=5
mtj-macbookpro:slow mtj$ ./slow
    2.034675 (     2.035+    0.003) 100.151%     0.027 MiB   1783086/sec (    1783479/sec)
    0.477781 (     2.319+    0.002) 485.797%     0.070 MiB   1565078/sec (    7595110/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=6
mtj-macbookpro:slow mtj$ ./slow
    2.037266 (     2.038+    0.003) 100.151%     0.027 MiB   1780752/sec (    1781210/sec)
    0.409199 (     2.417+    0.003) 591.432%     0.191 MiB   1501088/sec (    8868050/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=7
mtj-macbookpro:slow mtj$ ./slow
    2.033536 (     2.033+    0.004) 100.154%     0.039 MiB   1784815/sec (    1784478/sec)
    0.354096 (     2.432+    0.002) 687.527%     0.184 MiB   1492080/sec (   10248075/sec)
mtj-macbookpro:slow mtj$ export GOMAXPROCS=8
mtj-macbookpro:slow mtj$ ./slow
    2.034722 (     2.033+    0.005) 100.151%     0.047 MiB   1785255/sec (    1783438/sec)
    0.321146 (     2.486+    0.003) 775.165%     0.230 MiB   1459721/sec (   11299531/sec)

No meaningful difference here.

Michael Jones

chưa đọc,
11:13:41 12 thg 10, 201412/10/14
đến Dmitry Vyukov, Brendan Tracey, golang-dev
Dmitry, here is a single-CPU result from a "real" program to show you what I am seeing generally:

Tip:

=== RUN TestSkolem12
# Skolem(12):           227968 solutions          2434825 nodes         35682939 updates         0.703 s
--- PASS: TestSkolem12 (0.70s)
exact_test.go:677: # Skolem(12):       227968 solutions      2434825 nodes     35682939 updates         0.703 s
=== RUN TestGap
# Gap( 1, 0):                1 solutions                1 nodes                5 updates         0.000 s 
# Gap( 3, 1):                1 solutions                4 nodes               31 updates         0.000 s 
# Gap( 4, 1):                1 solutions                8 nodes               92 updates         0.000 s 
# Gap( 4, 0):                3 solutions               14 nodes              169 updates         0.000 s 
# Gap( 5, 2):                3 solutions               19 nodes              220 updates         0.000 s 
# Gap( 5, 0):                5 solutions               38 nodes              522 updates         0.000 s 
# Gap( 7, 3):               14 solutions              115 nodes             1539 updates         0.000 s 
# Gap( 7, 1):               26 solutions              352 nodes             5055 updates         0.000 s 
# Gap( 8, 3):               56 solutions              603 nodes             8091 updates         0.001 s 
# Gap( 8, 2):               90 solutions             1010 nodes            14123 updates         0.001 s 
# Gap( 8, 1):              150 solutions             1514 nodes            22468 updates         0.002 s 
# Gap( 8, 0):              252 solutions             2260 nodes            33173 updates         0.003 s 
# Gap( 9, 4):              122 solutions             1146 nodes            15970 updates         0.002 s 
# Gap( 9, 2):              346 solutions             4642 nodes            67317 updates         0.006 s 
# Gap( 9, 0):             1328 solutions            10332 nodes           152978 updates         0.007 s 
# Gap(11, 5):             1272 solutions            14886 nodes           212023 updates         0.007 s 
# Gap(11, 3):             5560 solutions            87750 nodes          1270392 updates         0.036 s 
# Gap(11, 1):            17792 solutions           210876 nodes          3033477 updates         0.063 s 
# Gap(12, 5):             9052 solutions           141676 nodes          1949120 updates         0.042 s 
# Gap(12, 4):            21028 solutions           357049 nodes          5048941 updates         0.109 s  46.481 Mu/s
# Gap(12, 3):            32744 solutions           536269 nodes          7768788 updates         0.155 s  49.965 Mu/s
# Gap(12, 2):            55696 solutions           818550 nodes         11935776 updates         0.237 s  50.381 Mu/s
# Gap(12, 1):           108144 solutions          1351543 nodes         19494378 updates         0.382 s  51.053 Mu/s
# Gap(12, 0):           227968 solutions          2137237 nodes         31496482 updates         0.610 s  51.662 Mu/s
# Gap(13, 6):            17800 solutions           276931 nodes          4032013 updates         0.085 s 
# Gap(13, 4):           129220 solutions          2406091 nodes         34510000 updates         0.692 s  49.892 Mu/s
# Gap(13, 2):           360876 solutions          5586798 nodes         81469260 updates         1.582 s  51.504 Mu/s
# Gap(13, 0):          1520280 solutions         14940920 nodes        219953588 updates         4.216 s  52.167 Mu/s
# Gap(15, 7):           329816 solutions          6433598 nodes         94719176 updates         1.874 s  50.539 Mu/s
# Gap(15, 5):          3912272 solutions         82944237 nodes       1180126632 updates        23.040 s  51.220 Mu/s

Release (1.3.3):

=== RUN TestSkolem12
# Skolem(12):           227968 solutions          2434825 nodes         35682939 updates         0.308 s
--- PASS: TestSkolem12 (0.31 seconds)
exact_test.go:677: # Skolem(12):       227968 solutions      2434825 nodes     35682939 updates         0.308 s
=== RUN TestGap
# Gap( 1, 0):                1 solutions                1 nodes                5 updates         0.000 s 
# Gap( 3, 1):                1 solutions                4 nodes               31 updates         0.000 s 
# Gap( 4, 1):                1 solutions                8 nodes               92 updates         0.000 s 
# Gap( 4, 0):                3 solutions               14 nodes              169 updates         0.000 s 
# Gap( 5, 2):                3 solutions               19 nodes              220 updates         0.000 s 
# Gap( 5, 0):                5 solutions               38 nodes              522 updates         0.000 s 
# Gap( 7, 3):               14 solutions              115 nodes             1539 updates         0.000 s 
# Gap( 7, 1):               26 solutions              352 nodes             5055 updates         0.000 s 
# Gap( 8, 3):               56 solutions              603 nodes             8091 updates         0.000 s 
# Gap( 8, 2):               90 solutions             1010 nodes            14123 updates         0.001 s 
# Gap( 8, 1):              150 solutions             1514 nodes            22468 updates         0.001 s 
# Gap( 8, 0):              252 solutions             2260 nodes            33173 updates         0.001 s 
# Gap( 9, 4):              122 solutions             1146 nodes            15970 updates         0.001 s 
# Gap( 9, 2):              346 solutions             4642 nodes            67317 updates         0.003 s 
# Gap( 9, 0):             1328 solutions            10332 nodes           152978 updates         0.003 s 
# Gap(11, 5):             1272 solutions            14886 nodes           212023 updates         0.003 s 
# Gap(11, 3):             5560 solutions            87750 nodes          1270392 updates         0.013 s 
# Gap(11, 1):            17792 solutions           210876 nodes          3033477 updates         0.027 s 
# Gap(12, 5):             9052 solutions           141676 nodes          1949120 updates         0.019 s 
# Gap(12, 4):            21028 solutions           357049 nodes          5048941 updates         0.049 s 
# Gap(12, 3):            32744 solutions           536269 nodes          7768788 updates         0.071 s 
# Gap(12, 2):            55696 solutions           818550 nodes         11935776 updates         0.106 s 112.309 Mu/s
# Gap(12, 1):           108144 solutions          1351543 nodes         19494378 updates         0.169 s 115.404 Mu/s
# Gap(12, 0):           227968 solutions          2137237 nodes         31496482 updates         0.268 s 117.713 Mu/s
# Gap(13, 6):            17800 solutions           276931 nodes          4032013 updates         0.038 s 
# Gap(13, 4):           129220 solutions          2406091 nodes         34510000 updates         0.310 s 111.319 Mu/s
# Gap(13, 2):           360876 solutions          5586798 nodes         81469260 updates         0.717 s 113.560 Mu/s
# Gap(13, 0):          1520280 solutions         14940920 nodes        219953588 updates         1.815 s 121.187 Mu/s
# Gap(15, 7):           329816 solutions          6433598 nodes         94719176 updates         0.845 s 112.131 Mu/s
# Gap(15, 5):          3912272 solutions         82944237 nodes       1180126632 updates        10.352 s 114.001 Mu/s

Unfortunately, this is the typical comparison across a range of programs. It would be great if I've done something wrong and the problem is with me.

--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Brendan Tracey

chưa đọc,
11:54:16 12 thg 10, 201412/10/14
đến Dmitry Vyukov, golang-dev
>
>> 3) With such parallel for loops, there will often be some shared data. For example, the following program:
>> http://play.golang.org/p/haRAISUUSF
>> It computes a weighted sum of each of the rows in a matrix, computing the sum of each row concurrently. The weight vector and final row storage are shared across all of the goroutines. This seems to fall under section 2 of “Risks”. This is a common pattern I use, so I’m somewhat concerned. I don’t know how well founded my concerns are. Does this type of code need to be rewritten to avoid explicit memory sharing as much as possible (i.e. sending the specific row slice into the goroutine, and then passing a struct{i int; sum float64} over a channel back to the main goroutine?
>
>
> Regarding weights. Its size is 80K, so it will fit into caches, and
> won't be affected by NUMA. I meant significantly larger data
> structures. In either case the new scheduler won't behave worser than
> the current scheduler.
> As for rowSums, if per-element work is large enough, then single store
> to the shared array is OK. If per-element work is not large enough,
> then you can coarsen work by giving each goroutine a range
> rowSums[i..j].
> Or you could use cache-oblivious recursive decomposition:
> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
> http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms
> Current scheduler does not support it well (it requires LIFO
> scheduling). Hopefully the new scheduler will do better.

Okay. In my actual use case, the data is much larger, but there are playground restrictions. I didn’t know if such patters would get worse with the NUMA scheduler, or if such patters would be bad to begin with because one should avoid “remote” accesses to the same slice. Thanks for the comments and the links.


ron minnich

chưa đọc,
12:15:54 12 thg 10, 201412/10/14
đến Dmitry Vyukov, Brendan Tracey, golang-dev
On Sun, Oct 12, 2014 at 6:46 AM, 'Dmitry Vyukov' via golang-dev
<golan...@googlegroups.com> wrote:

> Never worked with RDMA.
> As far as I understand any shared-memory program will continue to work
> on RDMA. So Go supports RDMA in this sense.

no, RDMA is not NUMA at all. To make data move in RDMA you have to
take explicit action, which is why it's used as the basis of message
passing protocols. You don't share address spaces in RDMA systems and
you do in NUMA.

> As for performance, I don't actually know. Any locality- and NUMA-like
> optimization will definitely help RDMA as well.

I don't see how. RDMA is a shared-none message passing type of network.

ron

Dmitry Vyukov

chưa đọc,
04:15:01 13 thg 10, 201413/10/14
đến Michael Jones, Brendan Tracey, golang-dev
What does profile say?

Dmitry Vyukov

chưa đọc,
04:16:10 13 thg 10, 201413/10/14
đến Brendan Tracey, golang-dev
It won't get worse.

Dmitry Vyukov

chưa đọc,
04:19:18 13 thg 10, 201413/10/14
đến ron minnich, Brendan Tracey, golang-dev
Ah, OK, mixed RDMA with DSM
(https://en.wikipedia.org/wiki/Distributed_shared_memory).
Then, no, Go won't work on a cluster w/o shared memory. But it could
work on a DSM cluster.

Liam

chưa đọc,
20:52:14 13 thg 10, 201413/10/14
đến golan...@googlegroups.com
Dmitry,

Is there NUMA-savvy C code showing expected performance for certain benchmarks?
Is there Go code for the same benchmarks?
If so, what's the current performance gap on NUMA machines?
And on non-NUMA machines?

Also, I think Michael's performance topic deserves its own thread...

Dmitry Vyukov

chưa đọc,
02:52:24 14 thg 10, 201414/10/14
đến Liam, golang-dev
On Tue, Oct 14, 2014 at 4:52 AM, Liam <networ...@gmail.com> wrote:
> Dmitry,
>
> Is there NUMA-savvy C code showing expected performance for certain
> benchmarks?
> Is there Go code for the same benchmarks?
> If so, what's the current performance gap on NUMA machines?
> And on non-NUMA machines?


There is no comprehensive comparison analysis.

QuickThread is a good example of NUMA-aware scheduler for C++:
http://quickthreadprogramming.com/
But it's based on user hints all way down.



> Also, I think Michael's performance topic deserves its own thread...
>
>
>
> On Friday, October 10, 2014 6:10:03 AM UTC-7, Dmitry Vyukov wrote:
>>
>> Hi,
>>
>> I've wrote up my thoughts about Go scheduler future in this doc:
>>
>>
>> https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub
>>
>> The main challenge is that it needs to work for a wide variety of
>> different Go programs and the runtime has absolutely no hints from
>> user regarding goroutine longevity, communication patterns, data
>> access patterns, etc. There are lots of open questions with respect to
>> exact policies, but I am sure that the high-level idea is good.
>> Suggestions and comments are welcome.
>
Trả lời tất cả
Trả lời tác giả
Chuyển tiếp
0 tin nhắn mới