On Sat, Mar 2, 2013 at 7:22 PM, Devon H. O'Dell <
devon...@gmail.com> wrote:
> I was going through the scheduler changes to try to understand them
> better. I came across a couple of comments from Dmitry hinting at the
> idea of experimenting with lock-free queues in the scheduler, and I
> figured it might be fun to try. But I wanted to clarify my
> understanding of the access semantics of the run queues first.
Hi Devon,
I have not implemented lock-free queues, because that' additional
complexity that did not want to introduce in the first patch, and I
expect it to provide limited improvement (the main thing is
distribution).
> Basically, it seems there are 1+N queues. There is a single global
> queue, and then GOMAXPROCS additional local queues, associated with a
> "P." The access semantics around the global queue require the
> scheduler lock to be held. I haven't changed this assumption, so I'm
> rolling with it for now.
Right.
> What I have done is made each P have a fixed-size ring buffer. If the
> ring can't be inserted into (i.e. it is full), we insert the G into
> the global run queue. If we can't fetch from the ring, we pull from
> the ring of some other P. I haven't implemented "stealing" yet --
> right now, I'll just pull a single G from someone else's ring -- but I
> don't imagine it would be very difficult to implement a lock-free
> version of that. I imagine we could just save the head pointer,
> advance it towards the tail, and then copy from old->new-1.
Sounds good. But note that you must ensure that the elements are not
overwritten between "advance it towards the tail" and "copy from
old->new-1".
> From what I can tell, the workload is SPMC; there is only one thread
> associated with contributing G's to a P's queue.
Correct.
> Since Pn can steal work from Pm, the semantics of access require
> assuming multiple consumers. But I can't tell if I'm right about
> understanding the producer workload. If I'm wrong, it needs to be
> MPMC, which reduces the producer side wait-freedom guarantee to simply
> lock-free, but I guess that's fine. For now I just want to see how it
> compares, especially on highly contended workloads. Getting this
> working on ARM will require adding the DMB instruction (possibly DSB,
> too) to the assembler.
There are some portable atomic operations available in runtime, see
runtime.atomicload/store/xchg/cas/xadd.
> I'd appreciate any information on these points and thoughts on this
> general approach. I've got most of the code written (there's some
> stupid bug in there somewhere, but it "works" anyway), so I don't
> think that finishing it would take too much longer.
Yes, only one thread enqueues to local queue -- current P owner. Owner
thread dequeues 1 element from the queue, that must be FIFO. Other
threads steal half of the elements, this can either FIFO or LIFO, I
don't think it's important.
I don't have the exact design for the queue, but here are some
thoughts. Thieves can synchronized with a lightweight mutex (dedicate
1 bit as "locked" in either head or tail position), most of the time
thieves can use "trylock", that is, if the queue is already locked by
another thief choose another queue. However during final check in
findrunnable() all queues must be checked.
It would be nice to implement local enqueue without atomic RMW
operations (just store-release).
It's important to ensure that elements are not overwritten by the
queue owner while a thief copies them. Perhaps the mutex can help --
the owner can lock it once in a while to ensure that it does not step
onto thieves.
Transfers to/from global queue must be done in batches, because if an
owner spawns a lot of new goroutines you do not want to access the
global queue each time. E.g. if the queue is full, transfer half of
the elements to the global queue.