Scalable Go Scheduler Design Doc

2,487 views
Skip to first unread message

Dmitry Vyukov

unread,
Jun 1, 2012, 12:09:47 PM6/1/12
to golang-dev
Hi,

Here are some thoughts on subj:
https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit
I may start working on it in the near future.
Feedback is welcome.

Kyle Lemons

unread,
Jun 1, 2012, 1:25:48 PM6/1/12
to Dmitry Vyukov, golang-dev
Sweet!

Ian Lance Taylor

unread,
Jun 1, 2012, 2:45:31 PM6/1/12
to Dmitry Vyukov, golang-dev
I'm glad you are thinking about this.

It sounds like you are saying that an M corresponds to a system thread,
as is the case today. A P loosely represents a processor, and we
require that every M attach to a P while running.

In the "Scheduling" section you say that a P picks a new G to run. But
presumably the P has to pick an M to run it. And elsewhere we see that
an M has to pick a P to run. So I guess I don't understand this. When
does a P pick a new G? How does it pick the M?

In the "Parking and Unparking" section you say that there are most
GOMAXPROCS spinning M's. But we could have a lot of M's simultaneously
come out of a syscall, with an associated G but without an associated P.
Won't they all be spinning waiting for a P?

Your initial point 4 says that syscalls lead to lots of thread blocking
and unblocking. Does your plan address that?

You speak of the P's as executing code. But as I think about this I'm
finding it easier to think of the M's as executing code. The P's become
a shared resource to manage memory, so that rather than the memory cache
being per-M, it is shared in such a way that each M has exclusive access
to a memory cache, and we only have as many memory caches as we have
simultaneously executing M's.

You have arranged for an affinity for G's to P's. But what matters more
for performance purposes is an affinity for G's to CPU's. And while a P
loosely represents a CPU, the kernel doesn't know about that. It seems
that we also need an affinity for M's to P's, so that G's tend to wind
up executing on the same set of M's. As you note in point 2, currently
G's often move between M's unnecessarily. Can we avoid that?

Is there some reasonable way that we could permit a G entering a syscall
to stay on the same M and P, and only have the P start running a new M
and G if the syscall does not return immediately?

Minor note: "Some variables form M are" s/form/from/ .

Ian

Ingo Oeser

unread,
Jun 1, 2012, 5:12:05 PM6/1/12
to golan...@googlegroups.com
Great article. But I have a very stupid suggestion: Let M do what you plan for P. 

The Linux kernel already tries to schedule threads in a way you suggest for P, 
as long as it can see them as related. Don't know for xBSD and Windows.

This also solves the next issues like hyperthreading, far away nodes and other topology stuff, 
because your kernel has already solved these.

Just removing the global scheduling lock and implementing the work stealing scheduler might already be enough.

Devon H. O'Dell

unread,
Jun 1, 2012, 5:43:39 PM6/1/12
to Ingo Oeser, golan...@googlegroups.com
2012/6/1 Ingo Oeser <night...@googlemail.com>:
> Great article. But I have a very stupid suggestion: Let M do what you plan
> for P.
>
> The Linux kernel already tries to schedule threads in a way you suggest for
> P,
> as long as it can see them as related. Don't know for xBSD and Windows.

M:N scheduling is a hard problem to solve (as is indicated by decades
of people trying to implement it correctly with little success). I see
this document as a reasonably good proposal. The different BSDs have
different scheduling mechanisms, but they're all generally 1:1. (To my
knowledge, Linux is 1:1, too). Because we're trying to support an M:N
model, it seems to me that this extra "P" actually makes a lot of
sense.

I do agree with Ian, though. There should be some more mention of P
processor affinity. This "solves" the migrating G and memory / cache
locality problem pretty simply assuming there is a more explicit
mention of G affinity to P. I think all of our supported operating
systems have a way to pin a user thread to a particular processor, so
this should be possible. I'm happy to do this for FreeBSD when the
time is right.

It might be possible to do the same sort of thing with just G and M,
but I'm not sure. I think Dmitry is probably a much better person than
I to have a good idea about that.

--dho

Ingo Oeser

unread,
Jun 1, 2012, 6:02:06 PM6/1/12
to Devon H. O'Dell, golan...@googlegroups.com

Once you lock any M, to the CPU assigned to it, that is not waiting for a syscall an unlock if you wait, P is not needed anymore.

Basically you have NumCPU active Ms plus a number of Ms which are pure waiters. Since there can be only one active OS thread per CPU at any given time, this should suffice. And once your M is idle, just steal some runnable Gs from foreign Ms. No P needed.

Ian Lance Taylor

unread,
Jun 1, 2012, 7:02:00 PM6/1/12
to Ingo Oeser, Devon H. O'Dell, golan...@googlegroups.com
Ingo Oeser <night...@googlemail.com> writes:

> Once you lock any M, to the CPU assigned to it, that is not waiting for a
> syscall an unlock if you wait, P is not needed anymore.
>
> Basically you have NumCPU active Ms plus a number of Ms which are pure
> waiters. Since there can be only one active OS thread per CPU at any given
> time, this should suffice. And once your M is idle, just steal some
> runnable Gs from foreign Ms. No P needed.

The P is still needed to hold a CPU-specific MCache.

Ian

Dmitry Vyukov

unread,
Jun 4, 2012, 2:24:00 PM6/4/12
to Ian Lance Taylor, golang-dev
On Fri, Jun 1, 2012 at 2:45 PM, Ian Lance Taylor <ia...@google.com> wrote:
Dmitry Vyukov <dvy...@google.com> writes:

> Here are some thoughts on subj:
> https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit
> I may start working on it in the near future.
> Feedback is welcome.

I'm glad you are thinking about this.

It sounds like you are saying that an M corresponds to a system thread,
as is the case today.  A P loosely represents a processor, and we
require that every M attach to a P while running.

Yes. P is a resource required to execute Go code, that is M does not need P while in syscall.
I've tried to clarify this in the doc:
M represents OS thread (as it is now). P represents a resource that is required to execute Go code. When M executes Go code, it has an associated P. When M is idle or in syscall, it does need P.

 

In the "Scheduling" section you say that a P picks a new G to run.  But
presumably the P has to pick an M to run it.  And elsewhere we see that
an M has to pick a P to run.  So I guess I don't understand this.  When
does a P pick a new G?  How does it pick the M?

It's M's that execute code. When I say P executes G, I want to emphasize that M can't execute Go code w/o being attached to P.
So M picks up P when it wants to execute Go code (returns from a syscall), then M picks up G and executes it.


 

In the "Parking and Unparking" section you say that there are most
GOMAXPROCS spinning M's.  But we could have a lot of M's simultaneously
come out of a syscall, with an associated G but without an associated P.
Won't they all be spinning waiting for a P?

The idea is that there are at most GOMAXPROCS spinning M's. Excessive M's (returning from a syscall) are promptly parked (as it is now).
Spinning M's eventually block as well (if they can't find any work for some time). The priority is given to spinning M's with an associated P. That is, M's without an associated P block first. Spinning M with P blocks only when there are no spinning M's w/o P.
I will try to clarify this in the doc.



Your initial point 4 says that syscalls lead to lots of thread blocking
and unblocking.  Does your plan address that?

Yes, spinning is going to address this.
For example consider the following scenario.
1. M1 returns from a syscall, finds that all P's are busy and spins waiting for P.
2. M2 enters syscalls, returns its P to the list of idle P's, checks that there is already a spinning M (M1), so it does not need to wake anybody.
3. M1 discovers the idle P, picks it up and continues executing Go code.
This does not involve any thread blocking/unblocking.


You speak of the P's as executing code.  But as I think about this I'm
finding it easier to think of the M's as executing code.  The P's become
a shared resource to manage memory, so that rather than the memory cache
being per-M, it is shared in such a way that each M has exclusive access
to a memory cache, and we only have as many memory caches as we have
simultaneously executing M's.

You are right. I will try to rephrase the doc.
 

You have arranged for an affinity for G's to P's.  But what matters more
for performance purposes is an affinity for G's to CPU's.  And while a P
loosely represents a CPU, the kernel doesn't know about that.  It seems
that we also need an affinity for M's to P's, so that G's tend to wind
up executing on the same set of M's.

The purpose of G to P affinity is merely to eliminate synchronization on some fast paths.
A deeper support for affinity would be useful, I've noted it in "Potential Further Improvements":
4. Better locality of G-to-P. Try to enqueue an unblocked G to a P on which it was last running.
5. Better locality of P-to-M. Try to execute P on the same M it was last running.
 But I do not want to tackle it right now.


As you note in point 2, currently
G's often move between M's unnecessarily.  Can we avoid that?

I think to some degree yes. But it may be a trade-off decision. Consider than M1 makes G1 runnable, and G1 was previously running on M2. Where to schedule G1? On M1 or on M2? It may be so that M1 also has data for G1 in cache (it woke it up, so they are related in some way). Moreover, scheduling on M1 is plain cheaper, because it's local.

 

Is there some reasonable way that we could permit a G entering a syscall
to stay on the same M and P, and only have the P start running a new M
and G if the syscall does not return immediately

It's somewhat problematic. If we do not reuse P immediately, then we waste resources. Then, how to determine that the syscall does not return immediately? Using timers? I need to think more on this. But what would clearly work is to mark some syscalls as non-blocking, then we may do nothing for them.

 
Minor note: "Some variables form M are" s/form/from/ .

Done.


Btw, Ian, how does gccgo runtime relate to gc runtime? Is there some semi-automatic process of porting gc runtime changes to gccgo runtime? Is it done manually? Do they live completely independently?

Ian Lance Taylor

unread,
Jun 4, 2012, 3:28:02 PM6/4/12
to Dmitry Vyukov, golang-dev
Dmitry Vyukov <dvy...@google.com> writes:

> Btw, Ian, how does gccgo runtime relate to gc runtime? Is there some
> semi-automatic process of porting gc runtime changes to gccgo runtime? Is
> it done manually? Do they live completely independently?

There is a semi-automatic process of porting gc changes to gccgo
changes. I have a script that merges changes made to the gc runtime
into the gccgo runtime. I run it periodically, and clean up the merge
conflicts.

Ian

Daniel Morsing

unread,
Jun 5, 2012, 3:53:36 AM6/5/12
to golan...@googlegroups.com, Ian Lance Taylor
On Monday, June 4, 2012 8:24:00 PM UTC+2, Dmitry Vyukov wrote:
Yes, spinning is going to address this.
For example consider the following scenario.
1. M1 returns from a syscall, finds that all P's are busy and spins waiting for P.
2. M2 enters syscalls, returns its P to the list of idle P's, checks that there is already a spinning M (M1), so it does not need to wake anybody.
3. M1 discovers the idle P, picks it up and continues executing Go code.
This does not involve any thread blocking/unblocking.

What happens in this case if no other M is going to be entering a syscall for a while? Does it put the G back on the runnable list? Would this effect how long a syscall takes?

Regards,
Daniel Morsing

Dmitry Vyukov

unread,
Jul 3, 2012, 3:40:52 AM7/3/12
to golang-dev
A quick update on this.

Here is what I am working on now:
It introduces a notion of P (Processors) into runtime, requires each M (thread) running Go code to have an associated P, and moves MCache to P (so that there is exactly GOMAXPROCS MCaches instead of an unbounded number).

Once it is ready I will ask you to test it. But it's not yet ready.
As for review, I need to wait for Russ anyway, right?

The next significant step will be to introduce per-P mutexes and make runqueues distributed across P's. Everything else is trivial.

There will some deviations from original plan in the design.
Instead of having a single list of idle P's (when a thread enters a syscall it pushes P to the list, when exits - pops from the list), each P will just have a state - idle/busy. So on syscall enter thread marks P as idle, when exits - tries to re-acquire the same P (so it has some notion of affinity). Of course, in some cases threads will just have to iterate over all P to find an idle one.
Instead of several spinning threads, there will be exactly 1 spinning (and looking for work) thread. But it will be present at all times (when it finds some work to do, it re-spawns another thread to play its role). This spinning thread may/will also help with goroutine preemption - basically it will scan all P's and check whether it is in syscall for too long (so presumably blocked) or executing the same goroutine for too long (time to send a signal to preempt).

Jim Whitehead II

unread,
Jul 17, 2012, 4:42:23 AM7/17/12
to golan...@googlegroups.com
Dmitry,

Everything that I see so far looks great. I'm wondering if you've taken a look at CCSP, the portable CSP-based runtime that lies behind the KRoC occam-pi compiler. On a quick glance of one of the papers [1] it seems there is quite a lot in common between their approach and yours. I just wanted to bring it to your attention in case you hadn't seen it before.

Although occam is a more rigorous language than Go is, they have managed to create a fast scheduler that works quite well for large systems. Right now the Go scheduler is an order of magnitude slower in switching between goroutines/processes:

Commstime(occam) starting ...

Last value received = 1000015
Time = 185660 microsecs
Time per loop = 185 nanosecs
Context switch = 23 nanosecs

Commstime(Go) starting...

Last value received = 1000015
Time = 1.709724s
Time per loop = 1709 nanosec
Context switch = 213 nanosec

Perhaps there is something more they can bring to the table. If you'd like to get in touch with any of the authors, please let me know and I'll hunt them down for you.

- Jim

Reply all
Reply to author
Forward
0 new messages