Workload semantics of new scheduler queues

Devon H. O'Dell

unread,

Mar 2, 2013, 12:22:21 PM3/2/13

to golang-nuts

I was going through the scheduler changes to try to understand them
better. I came across a couple of comments from Dmitry hinting at the
idea of experimenting with lock-free queues in the scheduler, and I
figured it might be fun to try. But I wanted to clarify my
understanding of the access semantics of the run queues first.

Basically, it seems there are 1+N queues. There is a single global
queue, and then GOMAXPROCS additional local queues, associated with a
"P." The access semantics around the global queue require the
scheduler lock to be held. I haven't changed this assumption, so I'm
rolling with it for now.

What I have done is made each P have a fixed-size ring buffer. If the
ring can't be inserted into (i.e. it is full), we insert the G into
the global run queue. If we can't fetch from the ring, we pull from
the ring of some other P. I haven't implemented "stealing" yet --
right now, I'll just pull a single G from someone else's ring -- but I
don't imagine it would be very difficult to implement a lock-free
version of that. I imagine we could just save the head pointer,
advance it towards the tail, and then copy from old->new-1.

From what I can tell, the workload is SPMC; there is only one thread
associated with contributing G's to a P's queue.

Since Pn can steal work from Pm, the semantics of access require
assuming multiple consumers. But I can't tell if I'm right about
understanding the producer workload. If I'm wrong, it needs to be
MPMC, which reduces the producer side wait-freedom guarantee to simply
lock-free, but I guess that's fine. For now I just want to see how it
compares, especially on highly contended workloads. Getting this
working on ARM will require adding the DMB instruction (possibly DSB,
too) to the assembler.

I'd appreciate any information on these points and thoughts on this
general approach. I've got most of the code written (there's some
stupid bug in there somewhere, but it "works" anyway), so I don't
think that finishing it would take too much longer.

--dho

Devon H. O'Dell

unread,

Mar 2, 2013, 1:54:59 PM3/2/13

to golang-nuts

I think it's a net win. This is just the threadring benchmark, but on
my Macbook Pro with 4 core 2.9GHz i7:

Current:
Devons-MacBook-Pro:shootout dho$ for i in 1 2 4; do time ./6.out
-n=20000000; done > /dev/null
real 0m3.025s user 0m3.018s sys 0m0.006s
real 0m3.057s user 0m3.050s sys 0m0.006s
real 0m3.036s user 0m3.031s sys 0m0.005s

New:
Devons-MacBook-Pro:shootout dho$ for i in 1 2 4; do time ./6.out
-n=20000000; done > /dev/null
real 0m2.633s user 0m2.627s sys 0m0.005s
real 0m2.623s user 0m2.618s sys 0m0.005s
real 0m2.645s user 0m2.639s sys 0m0.005s

The pkg tests I ran don't show really great improvements but there are
some, especially when there's higher contention over the P locks
(which is usually true when GOMAXPROCS=1). I suspect I can make some
additional improvements. Are there suggestions on which benchmarks I
should run for this sort of thing?

I'm not sure that the improvements would warrant trying to get this in
to 1.1, depending on how close we are to that.

--dho

2013/3/2 Devon H. O'Dell <devon...@gmail.com>:

Dmitry Vyukov

unread,

Mar 3, 2013, 1:30:56 AM3/3/13

to Devon H. O'Dell, golang-nuts

On Sat, Mar 2, 2013 at 7:22 PM, Devon H. O'Dell <devon...@gmail.com> wrote:
> I was going through the scheduler changes to try to understand them
> better. I came across a couple of comments from Dmitry hinting at the
> idea of experimenting with lock-free queues in the scheduler, and I
> figured it might be fun to try. But I wanted to clarify my
> understanding of the access semantics of the run queues first.

Hi Devon,

I have not implemented lock-free queues, because that' additional
complexity that did not want to introduce in the first patch, and I
expect it to provide limited improvement (the main thing is
distribution).

> Basically, it seems there are 1+N queues. There is a single global
> queue, and then GOMAXPROCS additional local queues, associated with a
> "P." The access semantics around the global queue require the
> scheduler lock to be held. I haven't changed this assumption, so I'm
> rolling with it for now.

Right.

> What I have done is made each P have a fixed-size ring buffer. If the
> ring can't be inserted into (i.e. it is full), we insert the G into
> the global run queue. If we can't fetch from the ring, we pull from
> the ring of some other P. I haven't implemented "stealing" yet --
> right now, I'll just pull a single G from someone else's ring -- but I
> don't imagine it would be very difficult to implement a lock-free
> version of that. I imagine we could just save the head pointer,
> advance it towards the tail, and then copy from old->new-1.

Sounds good. But note that you must ensure that the elements are not
overwritten between "advance it towards the tail" and "copy from
old->new-1".

> From what I can tell, the workload is SPMC; there is only one thread
> associated with contributing G's to a P's queue.

Correct.

> Since Pn can steal work from Pm, the semantics of access require
> assuming multiple consumers. But I can't tell if I'm right about
> understanding the producer workload. If I'm wrong, it needs to be
> MPMC, which reduces the producer side wait-freedom guarantee to simply
> lock-free, but I guess that's fine. For now I just want to see how it
> compares, especially on highly contended workloads. Getting this
> working on ARM will require adding the DMB instruction (possibly DSB,
> too) to the assembler.

There are some portable atomic operations available in runtime, see
runtime.atomicload/store/xchg/cas/xadd.

> I'd appreciate any information on these points and thoughts on this
> general approach. I've got most of the code written (there's some
> stupid bug in there somewhere, but it "works" anyway), so I don't
> think that finishing it would take too much longer.

Yes, only one thread enqueues to local queue -- current P owner. Owner
thread dequeues 1 element from the queue, that must be FIFO. Other
threads steal half of the elements, this can either FIFO or LIFO, I
don't think it's important.

I don't have the exact design for the queue, but here are some
thoughts. Thieves can synchronized with a lightweight mutex (dedicate
1 bit as "locked" in either head or tail position), most of the time
thieves can use "trylock", that is, if the queue is already locked by
another thief choose another queue. However during final check in
findrunnable() all queues must be checked.
It would be nice to implement local enqueue without atomic RMW
operations (just store-release).
It's important to ensure that elements are not overwritten by the
queue owner while a thief copies them. Perhaps the mutex can help --
the owner can lock it once in a while to ensure that it does not step
onto thieves.
Transfers to/from global queue must be done in batches, because if an
owner spawns a lot of new goroutines you do not want to access the
global queue each time. E.g. if the queue is full, transfer half of
the elements to the global queue.

Dmitry Vyukov

unread,

Mar 3, 2013, 1:35:36 AM3/3/13

to Devon H. O'Dell, golang-nuts

On Sat, Mar 2, 2013 at 8:54 PM, Devon H. O'Dell <devon...@gmail.com> wrote:
> I think it's a net win. This is just the threadring benchmark, but on
> my Macbook Pro with 4 core 2.9GHz i7:
>
> Current:
> Devons-MacBook-Pro:shootout dho$ for i in 1 2 4; do time ./6.out
> -n=20000000; done > /dev/null
> real 0m3.025s user 0m3.018s sys 0m0.006s
> real 0m3.057s user 0m3.050s sys 0m0.006s
> real 0m3.036s user 0m3.031s sys 0m0.005s
>
> New:
> Devons-MacBook-Pro:shootout dho$ for i in 1 2 4; do time ./6.out
> -n=20000000; done > /dev/null
> real 0m2.633s user 0m2.627s sys 0m0.005s
> real 0m2.623s user 0m2.618s sys 0m0.005s
> real 0m2.645s user 0m2.639s sys 0m0.005s

How does it look with GOMAXPROCS=2 and 4?

> The pkg tests I ran don't show really great improvements but there are
> some, especially when there's higher contention over the P locks
> (which is usually true when GOMAXPROCS=1). I suspect I can make some
> additional improvements. Are there suggestions on which benchmarks I
> should run for this sort of thing?
>
> I'm not sure that the improvements would warrant trying to get this in
> to 1.1, depending on how close we are to that.

But this also resolved the other TODO to have fixed-size local queue,
which I think is more important (to prevent excessive growth of local
queues).

Devon H. O'Dell

unread,

Mar 3, 2013, 1:51:59 AM3/3/13

to golang-nuts

Forgot to CC list

---------- Forwarded message ----------
From: Devon H. O'Dell <devon...@gmail.com>
Date: 2013/3/3
Subject: Re: [go-nuts] Workload semantics of new scheduler queues
To: Dmitry Vyukov <dvy...@google.com>

More in a bit, but...

>> Since Pn can steal work from Pm, the semantics of access require
>> assuming multiple consumers. But I can't tell if I'm right about
>> understanding the producer workload. If I'm wrong, it needs to be
>> MPMC, which reduces the producer side wait-freedom guarantee to simply
>> lock-free, but I guess that's fine. For now I just want to see how it
>> compares, especially on highly contended workloads. Getting this
>> working on ARM will require adding the DMB instruction (possibly DSB,
>> too) to the assembler.
>
> There are some portable atomic operations available in runtime, see
> runtime.atomicload/store/xchg/cas/xadd.

I'm worried about this. ARM has relaxed memory ordering, and the
compilers don't generate memory fences on ARM, which means that
lock-free data structures will probably break in weird ways if they're
implemented on ARM. If these are working, it might be by accident...
:S

--dho

Dmitry Vyukov

unread,

Mar 3, 2013, 1:54:09 AM3/3/13

to Devon H. O'Dell, golang-nuts

Current atomic operations provide sequential consistency, that is,
they include all possible memory barriers.

Devon H. O'Dell

unread,

Mar 3, 2013, 1:56:40 AM3/3/13

to Dmitry Vyukov, golang-nuts

2013/3/3 Dmitry Vyukov <dvy...@google.com>:

I haven't looked at the generated ASM output, but really? I don't see
DSB/DMB support in 5a. But I haven't looked in compiled binaries for
ARM yet. Just want to make sure, because "just using an atomic" isn't
enough for ARM since it can reorder loads and stores with each other,
irrespective of whether or not they deal with the same address.

--dho

Dmitry Vyukov

unread,

Mar 3, 2013, 2:06:17 AM3/3/13

to Devon H. O'Dell, golang-nuts

It's a good question. Now that I am actually looking at the code, I
think the arm atomics indeed lack proper memory barriers. And it can
be the reason why my scheduler patch initially deadlocked on arm, and
why adding atomicload/store instead of plain loads and stored fixed it
(atomicload/store implemented as ldrex/strex loop).

Dmitry Vyukov

unread,

Mar 3, 2013, 2:28:23 AM3/3/13

to Devon H. O'Dell, golang-nuts

The linux version uses kernel-provided CAS implementation that must
include memory barriers (presumably).

Devon H. O'Dell

unread,

Mar 3, 2013, 4:39:47 AM3/3/13

to golang-nuts

I...

---------- Forwarded message ----------
From: Devon H. O'Dell <devon...@gmail.com>
Date: 2013/3/3
Subject: Re: [go-nuts] Workload semantics of new scheduler queues
To: Dmitry Vyukov <dvy...@google.com>

I think that ldrex/strex might help, but I think that on multicore ARM
that if it works properly, it works because it races properly.

Devon H. O'Dell

unread,

Mar 3, 2013, 4:40:07 AM3/3/13

to golang-nuts

... am terrible at reply-all.

---------- Forwarded message ----------
From: Devon H. O'Dell <devon...@gmail.com>
Date: 2013/3/3
Subject: Re: [go-nuts] Workload semantics of new scheduler queues
To: Dmitry Vyukov <dvy...@google.com>

Ah. That probably explains it. We can implement it natively. I will
try to do this (I'm also working on getting Concurrency Kit[1] working
on ARM) as part of a CL. Maybe.

Given what I've seen from such heavyweight CLs as this, what should I
submit as a minimal set of changes? What I have now is relatively
minimal, but will still need some ARM ASM changes. I can make / test
ARM changes, but I want to be clear on what we're looking for.

Thanks for the input!

--dho

Dmitry Vyukov

unread,

Mar 3, 2013, 10:42:17 AM3/3/13

to Devon H. O'Dell, golang-nuts

upload your current changes to codereview, then it will be easier to
suggest splitting

Devon H. O'Dell

unread,

Mar 4, 2013, 12:24:27 PM3/4/13

to Dmitry Vyukov, golang-nuts

2013/3/3 Dmitry Vyukov <dvy...@google.com>:

> upload your current changes to codereview, then it will be easier to
> suggest splitting

I'll get to this today. I just realized after re-reading this thread,
that I hadn't actually set GOMAXPROCS when I was testing threadring.
Oops.

This is threadring with GOMAXPROCS={1,2,4,8}:
Devons-MacBook-Pro:shootout dho$ cat old.txt
3.02 real 3.01 user 0.00 sys
10.21 real 11.51 user 4.07 sys
6.73 real 7.91 user 3.08 sys
6.11 real 7.27 user 2.74 sys
Devons-MacBook-Pro:shootout dho$ cat new.txt
2.67 real 2.67 user 0.00 sys
5.87 real 6.30 user 2.54 sys
4.66 real 4.88 user 2.17 sys
4.42 real 4.81 user 1.98 sys

Running the benchmarks under pkg right now; I'll tack on those results
as well once they're done and submit a CL.

--dho

Reply all

Reply to author

Forward