Concurrency and task scheduler scalability

Jesse Towner

unread,

Jul 21, 2011, 8:06:23 PM7/21/11

to golan...@googlegroups.com

Hello folks,

Recently, it was brought to my attention by a fellow compatriot that Go
didn't detect the available number of logical processor cores available
to the host process, and rather relied on the end-user or administrator
to specify the maximum number of concurrent goroutines with the
GOMAXPROCS environment variable. Now, don't get me wrong configuration
is good, but requiring it when it's possible to determine a good default
automatically I thought was rather silly. So I decided to take a closer
look at the source code. I then saw that the existing task scheduler
merely uses a single global job queue and does a lot of locking and
serialization, it forks a whole new thread for each goroutine, doesn't
create additional worker threads when there are ones which are blocking,
doesn't do load-balancing, among other dastardly things.

We got started on thinking about improving the state of things there,
and decided we'd just start on doing a proof of concept to detect the
number of available logical cpu cores and use that when GOMAXPROCS isn't
defined. I have it running on Linux x86/x64. You can see the changeset
in my clone here:
http://code.google.com/r/jesset-go-concurrency/source/detail?r=d3d6a55a48fce98f9fff62915f11ec01a9f30945

Now, the way we've done it doesn't really fix too much of the overall
problem because of the way the current scheduler is implemented. It has
no way of ensuring that say all 8 cores on an 8 core machine are working
all of the time, because some of the goroutines on those threads may
block for a timk, and once the scheduler hits runtime�gomaxprocs, it
stops creating new workers. So I can see GOMAXPROCS actually being
assigned a much larger value in practice to ensure over-subscription of
workers to cores.

Ideally, you'd want a locality-aware scheduler capable of load-balancing
using work-stealing and/or work-requesting, in which case you'd want to
detect a lot more than simply the number of available cores--you'd want
to detect all of the NUMA nodes available on the machine as well, which
my current approach doesn't do.

I then thought I'd post about it on the mailing list, see if our general
approach is somewhat valid, what other people might be working on, and
where we could take this, etc. So it was to my surprise that I saw none
other than Dmitriy Yyukov already starting to work on some of these
problems!

In the interest of not duplicating any work, I'd be interested in
helping out on this effort if at all possible? I'm not sure where you're
at with things Dimitry, but perhaps I could continue with implementing
machine CPU and NUMA node configuration/detection on various platforms
if that's not yet complete, if that makes sense?

- Jesse

Ian Lance Taylor

unread,

Jul 21, 2011, 9:33:35 PM7/21/11

to Jesse Towner, golan...@googlegroups.com

Jesse Towner <jes...@lavabit.com> writes:

> So
> I decided to take a closer look at the source code. I then saw that
> the existing task scheduler merely uses a single global job queue and
> does a lot of locking and serialization, it forks a whole new thread
> for each goroutine, doesn't create additional worker threads when
> there are ones which are blocking, doesn't do load-balancing, among
> other dastardly things.

We plan to improve the scheduler, but your description isn't wholly
accurate. The scheduler does not fork a new thread for each goroutine.
It multiplexes goroutines onto a pool of threads. GOMAXPROCS sets the
number of goroutines which will run simultaneously on threads, not
counting goroutines waiting on a system call or running C code.

> Now, the way we've done it doesn't really fix too much of the overall
> problem because of the way the current scheduler is implemented. It
> has no way of ensuring that say all 8 cores on an 8 core machine are
> working all of the time, because some of the goroutines on those
> threads may block for a timk, and once the scheduler hits
> runtime·gomaxprocs, it stops creating new workers. So I can see
> GOMAXPROCS actually being assigned a much larger value in practice to
> ensure over-subscription of workers to cores.

Goroutines which are blocked don't count against GOMAXPROCS.

> In the interest of not duplicating any work, I'd be interested in
> helping out on this effort if at all possible? I'm not sure where
> you're at with things Dimitry, but perhaps I could continue with
> implementing machine CPU and NUMA node configuration/detection on
> various platforms if that's not yet complete, if that makes sense?

I'm not sure what Dimitry is going to work on, but I think that would be
helpful in any case. Thanks.

Ian

Kyle Lemons

unread,

Jul 21, 2011, 9:36:42 PM7/21/11

to Jesse Towner, golan...@googlegroups.com

Recently, it was brought to my attention by a fellow compatriot that Go didn't detect the available number of logical processor cores available to the host process

I don't think this was an oversight. As supported by your own points, the runtime does not perform better with GOMAXPROCS>1 in a majority of cases.

and rather relied on the end-user or administrator to specify the maximum number of concurrent goroutines with the GOMAXPROCS environment variable. Now, don't get me wrong configuration is good, but requiring it when it's possible to determine a good default automatically I thought was rather silly. So I decided to take a closer look at the source code. I then saw that the existing task scheduler merely uses a single global job queue and does a lot of locking and serialization, it forks a whole new thread for each goroutine

Oh? I am fairly certain it does not fork a new thread for each goroutine. Blocking system calls, I believe, are another story, but I don't think they obey GOMAXPROCS anyway... I could, however, be wrong. My understanding of the various parts of the runtime varies widely, and even then I've only looked at gc.

doesn't create additional worker threads when there are ones which are blocking, doesn't do load-balancing, among other dastardly things.

Go is being developed by a small team, and I (for one) think their efforts are nothing short of heroic. The current runtime still allows you to get a lot of the benefits of multithreading without actually running on multiple threads (10 http requests in goroutines go take almost 1/10 the time of those same 10 requests in series), and so I don't think this has been high on anyone's priority list when compared to things like getting a fully-featured standard library, bringing Go to AppEngine, improving the garbage collector, wrangling bugs, and pelting one another with small, furry, blue, goggle-clad gophers.

We got started on thinking about improving the state of things there, and decided we'd just start on doing a proof of concept to detect the number of available logical cpu cores and use that when GOMAXPROCS isn't defined. I have it running on Linux x86/x64. You can see the changeset in my clone here: http://code.google.com/r/jesset-go-concurrency/source/detail?r=d3d6a55a48fce98f9fff62915f11ec01a9f30945

The GOMAXPROCS autodetection issue is probably the smallest piece of the "next generation" runtime that can do all of the things you list above. As the documentation itself says, GOMAXPROCS is just there because the scheduler can't make such decisions itself quite yet, and will go away when it can.

Now, the way we've done it doesn't really fix too much of the overall problem because of the way the current scheduler is implemented. It has no way of ensuring that say all 8 cores on an 8 core machine are working all of the time, because some of the goroutines on those threads may block for a timk, and once the scheduler hits runtime·gomaxprocs, it stops creating new workers. So I can see GOMAXPROCS actually being assigned a much larger value in practice to ensure over-subscription of workers to cores.

I think talking about oversubscription is putting the cart before the horse a bit... The current scheduler will move a goroutine around willy-nilly, which almost always results in a slower program than in single-threaded; in the cases where it doesn't (parallel long-running, unyielding computations), you don't really benefit from the oversubscription (and may in fact be hurt by it). The ideal scheduler you mention would probably also be hurt by oversubscription. Whether whatever next-generation scheduler starts to develop requires it will have to be evaluated whenever we get to that point.

Ideally, you'd want a locality-aware scheduler capable of load-balancing using work-stealing and/or work-requesting, in which case you'd want to detect a lot more than simply the number of available cores--you'd want to detect all of the NUMA nodes available on the machine as well, which my current approach doesn't do.

I am sure we would be using one if we had it :). From my (probably skewed) understanding, there aren't a whole lot of runtimes that can do this well, and the ones who can have been around far longer than Go. Just making "a locality aware scheduler" isn't a piece of cake, let alone tuning the scheduling algorithms for a new language with "new" paradigms.

I then thought I'd post about it on the mailing list, see if our general approach is somewhat valid, what other people might be working on, and where we could take this, etc. So it was to my surprise that I saw none other than Dmitriy Yyukov already starting to work on some of these problems!

It seems like every day I check the changelists and he's managed to make my channels even faster still. +1000 I don't know if it's possible for the scheduler to have incremental (as opposed to wholesale) improvements, but I'm sure nobody would complain if you took a crack at it.

Russ Cox

unread,

Jul 22, 2011, 12:35:55 AM7/22/11

to Jesse Towner, golan...@googlegroups.com

> it forks a whole new thread for each goroutine,

Ian pointed out that this was wrong.

> doesn't create additional worker
> threads when there are ones which are blocking,

This is wrong too.

Russ

Dmitry Vyukov

unread,

Jul 22, 2011, 8:58:04 AM7/22/11

to Jesse Towner, golan...@googlegroups.com

Hi Jesse,

Regarding current gc scheduler. As others noted it's not all that bad except one thing - support for fine-grained concurrency, which is IMHO quite important to programming style Go encourages.

As for machine configuration, I think it's important and it would be great if you accomplish it. I am thinking about providing public API for querying system parameters (think sysconf()), it should provide info about number of hardware threads, cores, processors, NUMA nodes, LLC, etc. I am skeptical that GOMAXPROCS will completely die out, well, at least not in this decade. For example, if I am creating a simple utility I may want to limit it to GOMAXPROCS=1. If I am creating a memory throughput bound application then I want to limit it to number of NUMA nodes, perhaps even:

GOMAXPROCS(Sysconf(NumaNodes), AffinityScatter).

For a network server with fair amount of inter-goroutine communication I may want to limit it to a single processor:

GOMAXPROCS(Sysconf(ThreadsPerProcessor), AffinityGather).

For a video encoding utility I may want to limit it to GOMAXPROCS=2 in order to be able to compile on another core and browse the web on the last.

While many application would want just to say

GOMAXPROCS(Sysconf(HwThreadCount)) in main().

As for my progress, I had studied current implementation, investigate all the requirements, worked out a design, but had written no code. I decided to start with smaller things including improving gotest benchmarking ability and writing some scheduler benchmarks. Moreover I cleaned out a lot of write-sharing in other parts of runtime with which scaling was not possible regardless of scheduler. Also I have a CL in flight that improves Linux mutex implementation:

http://codereview.appspot.com/4711045

it should partially fix scheduler scalability. For example, for synthetic producer-consumer benchmark with buffered channel of size 100 and 16 worker goroutines it improves performance from 3394 to 448 cycles/op.

Of course it should use distributed work queues (and other resources/pools/caches), or course it should use work-stealing. Most likely it should use LIFO in common case (while preserving practical fairness of course), FIFO is the best way to cool data in all caches and blow up the system with exponential number of goroutines. As for locality awareness, I would prefer to not do it initially. There are a lot of other tricky moments to work out first.

I am not sure as to whether I will be able to accomplish it in the nearest future since my manager must eventually say WTF you are doing?!!1!

Russ Cox

unread,

Jul 22, 2011, 9:05:16 PM7/22/11

to golan...@googlegroups.com

It is important to note that the goal here is
not to write an operating system.

Martin Capitanio

unread,

Jul 23, 2011, 1:03:53 AM7/23/11

to golan...@googlegroups.com, r...@golang.org

>It is important to note that the goal here is

>not to write an operating system.

Affirmative. A question remains: who is doing what.

Dmitry Vyukov

unread,

Jul 26, 2011, 9:34:43 AM7/26/11

to golan...@googlegroups.com, r...@golang.org

You may count that I am doing nothing on scheduler now. And my next victim is going to be channels.

Dmitry Vyukov

unread,

Jul 26, 2011, 9:35:25 AM7/26/11

to r...@golang.org, golan...@googlegroups.com

On Sat, Jul 23, 2011 at 5:05 AM, Russ Cox <r...@golang.org> wrote:

It is important to note that the goal here is
not to write an operating system.

May you please elaborate what exactly you mean?

John Arbash Meinel

unread,

Jul 26, 2011, 9:40:34 AM7/26/11

to Dmitry Vyukov, r...@golang.org, golan...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Operating systems have a long history of tweaking the scheduling
routines to handle lots of concurrent processes. I'm guessing Martin's
comment is that yes, you can do a lot for handing scheduling, but maybe
you should let your OS do its job as much as possible.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEUEARECAAYFAk4uw9IACgkQJdeBCYSNAAM/5wCXWfkAH3xVGHL3i5pJfn4+pO7a
7ACgzDnvqrGC+i6KsPVzcvJL5QcTuPk=
=WlEQ
-----END PGP SIGNATURE-----

John Asmuth

unread,

Jul 26, 2011, 9:48:12 AM7/26/11

to golan...@googlegroups.com, Dmitry Vyukov, r...@golang.org

On Tuesday, July 26, 2011 9:40:34 AM UTC-4, John Arbash Meinel wrote:Operating systems have a long history of tweaking the scheduling

routines to handle lots of concurrent processes. I'm guessing Martin's
comment is that yes, you can do a lot for handing scheduling, but maybe
you should let your OS do its job as much as possible.

The OS has no jurisdiction of the assignment of goroutines to processes, and can never do anything there.

It makes a lot of sense to, for instance, group goroutines based on common memory access, especially for multicore systems, since then the common data can be loaded into that cpu's cache.

I wonder if it'd be possible for the runtime to detect common memory overlap between goroutines, and use that to assign them to processes?

I could be wrong - my understanding of this area of computer science is pretty limited.

Russ Cox

unread,

Jul 26, 2011, 11:01:28 AM7/26/11

to Dmitry Vyukov, golan...@googlegroups.com

It's easy to let these things get arbitrarily complicated,
and you wake up one day and runtime/proc.c has turned
into its own very complex operating system that runs on
top of another very complex operating system.
One of my goals is to keep that from happening,
to keep the Go implementation working well with the
host operating system instead of trying to replace it.

Russ

Carl

unread,

Jul 27, 2011, 5:40:07 AM7/27/11

to golang-nuts

On 26 Jul., 17:01, Russ Cox <r...@golang.org> wrote:

I had to think about that one...

I don't know for sure, but I get the impression that most of the
developement effort today in operating systems is going into just
barely supporting whatever new Hardware is hitting the market. And
scalability continues to be an issue.

IMO, if someone can vastly improve the performance of a core system
by developing some code that augments the way that core system works,
resulting in vastly impoving its performance, then I would say by all
means go for it!

I agree though, that its a question of economics, how much bang you
get for your developement buck matters. If someone else has already
done it and that work can be reused, avoiding duplication of efforts,
etc.

Dmitry Vyukov

unread,

Jul 28, 2011, 5:37:45 AM7/28/11

to r...@golang.org, golan...@googlegroups.com

Then gc must use thread per goroutine and wait while OS thread performance, scalability and memory consumption gets better. Why it does not?

Sindre Myren

unread,

Sep 3, 2011, 2:00:33 PM9/3/11

to golang-nuts

On Jul 22, 2:58 pm, Dmitry Vyukov <dvyu...@google.com> wrote:

> As for my progress, I had studied current implementation, investigate all
> the requirements, worked out a design, but had written no code.

Hi, Dmitry. Do you have any details on this design lying around? Or
could you share some insight on how exactly the current google go
scheduler works, maybe a bit deeper then what can be easily extracted
from the source code comments? The reason I ask, is that I have a
student project (from now until December) where I would very much like
to investigate how exactly the scheduler work, and what peaces of it
would have to be replaced/patched in order to achieve better real-time
functionality and control. I.e. get a scheduler more suited for real-
time computing (http://en.wikipedia.org/wiki/Real-time_computing).

My ultimate goal, is to write a project report, describing how Google
Go does it as a real-time programming language. The project is due in
the end of December.

-- Sindre

bflm

unread,

Sep 4, 2011, 2:43:35 AM9/4/11

to golan...@googlegroups.com

On Saturday, September 3, 2011 8:00:33 PM UTC+2, Sindre Myren wrote:

My ultimate goal, is to write a project report, describing how Google
Go does it as a real-time programming language.

Go is not a real-time programming language. End of report.

The project is due in
the end of December.

Nearly 3 months saved.

Russel Winder

unread,

Sep 4, 2011, 3:32:54 AM9/4/11

to golan...@googlegroups.com

On Sat, 2011-09-03 at 23:43 -0700, bflm wrote:
> On Saturday, September 3, 2011 8:00:33 PM UTC+2, Sindre Myren wrote:
> My ultimate goal, is to write a project report, describing how
> Google
> Go does it as a real-time programming language.

> Go is not a real-time programming language. End of report.

And Java isn't a real-time programming language?

Is C a real-time programming language?

> The project is due in
> the end of December.
>
>
> Nearly 3 months saved.

Project failed though due to lack of content. Remember OP is a student
on a course, not a contract programmer trying to complete a fixed price
contract.

--
Russel.
=============================================================================
Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel...@ekiga.net
41 Buckmaster Road m: +44 7770 465 077 xmpp: rus...@russel.org.uk
London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder

signature.asc

Sindre Myren

unread,

Sep 4, 2011, 5:04:55 AM9/4/11

to Russel Winder, golan...@googlegroups.com

2011/9/4 Russel Winder <rus...@russel.org.uk>:

> On Sat, 2011-09-03 at 23:43 -0700, bflm wrote:
>> On Saturday, September 3, 2011 8:00:33 PM UTC+2, Sindre Myren wrote:
>> My ultimate goal, is to write a project report, describing how
>> Google
>> Go does it as a real-time programming language.
>
>> Go is not a real-time programming language. End of report.
>
> And Java isn't a real-time programming language?
>
> Is C a real-time programming language?
>
>> The project is due in
>> the end of December.
>>
>>
>> Nearly 3 months saved.
>
> Project failed though due to lack of content. Remember OP is a student
> on a course, not a contract programmer trying to complete a fixed price
> contract.
>

Thats true Russel. If Google Go is worse suited for RT-programming
then say, C, Java and ADA, I would still need to explain why that is
the case. Even I understand that using a language as young as Go in a
industrial hard RT-computing case, is not a good idea. But as an
academic, I can still study it's potentials:)

> --
> Russel.

-- Sindre

John Arbash Meinel

unread,

Sep 4, 2011, 5:15:28 AM9/4/11

to Sindre Myren, Russel Winder, golan...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> Project failed though due to lack of content. Remember OP is a
>> student on a course, not a contract programmer trying to complete
>> a fixed price contract.
>>
>
> Thats true Russel. If Google Go is worse suited for RT-programming
> then say, C, Java and ADA, I would still need to explain why that
> is the case. Even I understand that using a language as young as Go
> in a industrial hard RT-computing case, is not a good idea. But as
> an academic, I can still study it's potentials:)
>

At a minimum, you have a Garbage Collected language, which makes it
hard to be hard RT. Java has a similar problem. It is possible to
switch out what collector runs, there are certainly a lot of
possibilities.
There is a pretty interesting series about "Recycler", using a
reference counted with lazy evaluation and generations to enable
minimal interruptions:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.1065

I'm not sure how much it applies to a language without a VM and
cooperative coroutines. They could be made preemptive, but really how
much do you have to change before it is good enough. And how much can
you call the final thing "Go".

I wouldn't say it is "worse" than Java at a theory level, but there
certainly would have been a lot more *implementation* level effort
trying to make Java work better on RT systems.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5jQbAACgkQJdeBCYSNAAPk+gCfYauUkWkV7dy65U7rhGIve9kU
dzQAnjGcVziRHcrQtfxslh7nZb54MJPq
=OTNT
-----END PGP SIGNATURE-----

Dmitry Vyukov

unread,

Sep 4, 2011, 5:30:43 AM9/4/11

to Sindre Myren, golang-nuts

On Sat, Sep 3, 2011 at 10:00 PM, Sindre Myren <smy...@gmail.com> wrote:

On Jul 22, 2:58 pm, Dmitry Vyukov <dvyu...@google.com> wrote:

> As for my progress, I had studied current implementation, investigate all
> the requirements, worked out a design, but had written no code.

Hi, Dmitry. Do you have any details on this design lying around?

No, I don't.

Or
could you share some insight on how exactly the current google go
scheduler works, maybe a bit deeper then what can be easily extracted
from the source code comments?

I believe the source code is the best documentation.

However, I am ready to validate your understanding and/or answer some specific questions.

Paulo Pinto

unread,

Sep 4, 2011, 5:37:59 AM9/4/11

to golang-nuts

On Java's case there is a specification for RT support and quite
a few implementations of it. How good it they are I cannot say.

Currently that is not the case of Go. I would imagine that we would
need
to have a Go version where more control over the GC and scheduling
would
be made available.

GC languages can be use in RT situations, as long as they offer the
tools
to meet the RT constraints of the environment.

--
Paulo

> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

ritesh

unread,

Sep 8, 2011, 2:08:17 PM9/8/11

to golan...@googlegroups.com, Dmitry Vyukov, r...@golang.org

IMHO, there are a few reasons why language supported task scheduling can be better than OS managed task scheduling:
- the OS has little idea of the amount of CPU context that needs to be saved with every task (goroutine here). The compiler knows more and can reduce the saved machine state when switching between tasks running on the same thread. Its hard to reason about the savings... but CPUs these days have tremendous amount of state and saving all of it would push programmers to work-around with larger amounts of compute in tasks and may not use them for the purpose of easing development which they are intended for.
- a thread switch would need a context switch to the kernel which a user space task switch can avoid.
- threads on some platforms can take up a lot of memory resources. it is possible to make tasks more light weight without depending on the OS implementation to make threads lightweight.
- the OS has little idea of how tasks may be related to each other. The language / programmer can on the other hand organize tasks so that they have more temporal locality. Distributing such tasks to a FIFO work stealing queue is a common method to achieve such locality.

As Dmitry mentioned in an email below, the distributed task queues with work-stealing seems to be a popular option among various other threadpool work queue implementations.

_r

Sindre Myren

unread,

Sep 27, 2011, 12:41:16 PM9/27/11

to Dmitry Vyukov, golang-nuts

Hi again Dimitry.

2011/9/4 Dmitry Vyukov <dvy...@google.com>

I hope you still have the time to do this, and thank you in advance for your help -- If somebody else wants to answer, that is cool to:-)

The following expleatino of how the scheduler works, is not purely based on my understanding of the code, but also partly on assumptions and things I have randomly read on the mailinglist - what is correct, and what is just totally wrong? To make it Simple, you could just write C (correct), W (wrong) and D (don't know) after each sentence. If you feel like explaining something in detail though, please do.

"""

\section{The Go Scheduler}

The comments in the Go scheduler code (src/runtime/proc.c) talks about Ms and

Gs. Each G is a Goroutine, and each M is an OS Thread. The number of Ms is

limited by the environment variable GOMAXPROCS, or by the GOMAXPROCS(int)

function. If the number of Gs is less then GOMAXPROCS, then the number of Ms

will be equal to the number of Gs (unless the number of Ms has previously been

higher - i.e. Ms are not terminated when the number of Gs decline). The default

of GOMAXPROCS is 1. Usually, you would want to set GOMAXPROCS to the number of

CPU cores. The Go scheduler schedules ready-to-run goroutines to free Ms.

The ready-to-run Gs are scheduled according to a the simplest algorithm there

is: cyclic execution. A goroutine is requed only when it can no longer continue

to run - e.g. when it hits a syncronizing function (channels, mutex.wait,

etc.), a sleep function or a system call. There is no timeout where the

goroutines are requed.

"""

Questions:

1 What happens if you set GOMAXPROCS to something much larger then the number

of cores? E.g. to 100. Given you have 100 goroutines, would you actually get

100 OS Threads (Ms)? And is this a stupid thing to do?

2 Where would one modify the code if one was to make the scheduler Round-robin?

(I.e. add a timeout for how long a G could run on one M).

3 If one wanted the scheduler to do something completly different, i.e. use

prorities, and some fancy scheduling algorithm - where would one want to do

this changes?

4. Is there some other file(s) in http://golang.org/src/pkg/runtime/ then proc.c that I should be looking at?

Thanks again for your time, hope you find the time too answer.

-- Sindre

Dmitry Vyukov

unread,

Sep 27, 2011, 1:07:53 PM9/27/11

to Sindre Myren, golang-nuts

On Tue, Sep 27, 2011 at 8:41 PM, Sindre Myren <smy...@gmail.com> wrote:

Hi again Dimitry.

2011/9/4 Dmitry Vyukov <dvy...@google.com>

On Sat, Sep 3, 2011 at 10:00 PM, Sindre Myren <smy...@gmail.com> wrote:

On Jul 22, 2:58 pm, Dmitry Vyukov <dvyu...@google.com> wrote:

> As for my progress, I had studied current implementation, investigate all
> the requirements, worked out a design, but had written no code.

Hi, Dmitry. Do you have any details on this design lying around?

No, I don't.

Or
could you share some insight on how exactly the current google go
scheduler works, maybe a bit deeper then what can be easily extracted
from the source code comments?

I believe the source code is the best documentation.
However, I am ready to validate your understanding and/or answer some specific questions.

I hope you still have the time to do this, and thank you in advance for your help -- If somebody else wants to answer, that is cool to:-)

The following expleatino of how the scheduler works, is not purely based on my understanding of the code, but also partly on assumptions and things I have randomly read on the mailinglist - what is correct, and what is just totally wrong? To make it Simple, you could just write C (correct), W (wrong) and D (don't know) after each sentence. If you feel like explaining something in detail though, please do.

"""
\section{The Go Scheduler}

The comments in the Go scheduler code (src/runtime/proc.c) talks about Ms and
Gs. Each G is a Goroutine, and each M is an OS Thread.

It is basically impossible to figure out from the code, and I don't remember how I figured that out. But it seems that M refers to Machine. You may provide that transcript in brackets, M was always confusing for me.

The number of Ms is

limited by the environment variable GOMAXPROCS, or by the GOMAXPROCS(int)
function.

Not quite. What GOMAXPROCS limits is number of Ms actively running Go code. You can have 1000 additional Ms sitting in blocking syscalls or running cgo code.

If the number of Gs is less then GOMAXPROCS, then the number of Ms

will be equal to the number of Gs (unless the number of Ms has previously been

higher - i.e. Ms are not terminated when the number of Gs decline). The default
of GOMAXPROCS is 1. Usually, you would want to set GOMAXPROCS to the number of
CPU cores. The Go scheduler schedules ready-to-run goroutines to free Ms.

The ready-to-run Gs are scheduled according to a the simplest algorithm there
is: cyclic execution. A goroutine is requed only when it can no longer continue
to run - e.g. when it hits a syncronizing function (channels, mutex.wait,

etc.), a sleep function or a system call. There is no timeout where the
goroutines are requed.

Not quite. Goroutines are requeued when they are ready to run. Eg if a goroutine blocks on a mutex, then it is not in the queue (it is somewhere else, like mutex wait list). When mutex is unlocked then the G is reenqueued to the scheduler runnable queue.

"""

Questions:

1 What happens if you set GOMAXPROCS to something much larger then the number

of cores? E.g. to 100. Given you have 100 goroutines, would you actually get
100 OS Threads (Ms)?

yes

And is this a stupid thing to do?

It depends. Slight oversubscription of processors with threads usually have no visible negative effect on performance, on the other hand it can help mask page faults, etc.

2 Where would one modify the code if one was to make the scheduler Round-robin?

(I.e. add a timeout for how long a G could run on one M).

I think the compiler must emit periodic checks (whether the G has exhausted its timeslice), so the main place to modify is the compiler.

3 If one wanted the scheduler to do something completly different, i.e. use

prorities, and some fancy scheduling algorithm - where would one want to do

this changes?

src/pkg/runtime/proc.c

4. Is there some other file(s) in http://golang.org/src/pkg/runtime/ then proc.c that I should be looking at?

The scheduler is mostly situated in proc.c. Potentially you look into ./linux/thread.c, but there are mostly low level stuff like how runtime creates threads, how runtime procides mutual exclusion, etc.

Kyle Lemons

unread,

Sep 27, 2011, 1:16:28 PM9/27/11

to Sindre Myren, Dmitry Vyukov, golang-nuts

My answers are also not based on expert knowledge, but things I've picked up from reading the mailing list and CLs and some poking in the code. I figure I'll provide my insight, though Dmitry probably has more grounded answers.

\section{The Go Scheduler}

The comments in the Go scheduler code (src/runtime/proc.c) talks about Ms and

Gs. Each G is a Goroutine, and each M is an OS Thread.

I believe that's the case.

The number of Ms is

limited by the environment variable GOMAXPROCS, or by the GOMAXPROCS(int)
function.

Hmm, I think this is incorrect. The number of OS threads can (and usually does) exceed GOMAXPROCS, because blocking syscalls are allowed to execute on their own thread.

If the number of Gs is less then GOMAXPROCS

This is rare in any non-trivial, idiomatic Go program.

then the number of Ms will be equal to the number of Gs (unless the number of Ms has previously been

higher - i.e. Ms are not terminated when the number of Gs decline).

The number of Ms and the number of Gs should be largely irrelevant to you as a programmer. The thing that matters is that GOMAXPROCS controls how many Ms are allowed

to *execute* at a time, and the scheduler makes very simplistic decisions about which available G it chooses to put on each available M.

The default of GOMAXPROCS is 1.

Only for now. I think the goal is to ditch GOMAXPROCS entirely when the scheduler becomes smart enough to not only detect the number of cores/cpus it has to run, but to make good decisions about when to utilize them.

Usually, you would want to set GOMAXPROCS to the number of CPU cores.

This is a dangerous generalization. There are many programs which will run faster on GOMAXPROCS=1 than GOMAXPROCS>1. In general, you should only start futzing with GOMAXPROCS and GOGC and all of those things when you're trying to improve performance, and only when you have an idea of how they'll affect the program.

http://golang.org/doc/go_faq.html#Why_GOMAXPROCS

The Go scheduler schedules ready-to-run goroutines to free Ms.

Up to GOMAXPROCS, yeah.

The ready-to-run Gs are scheduled according to a the simplest algorithm there
is: cyclic execution.

I'd call it a queue.

A goroutine is requed only when it can no longer continue
to run - e.g. when it hits a syncronizing function (channels, mutex.wait,

etc.), a sleep function or a system call.

Sleep is a system call :). It's also important to note that there are things some people might not realize use system calls, like printf.

There is no timeout where the
goroutines are requed.

I'm not entirely sure what this means... but there is no preemption, and no attempt by the scheduler to reawaken deadlocked goroutines.

"""

Questions:

1 What happens if you set GOMAXPROCS to something much larger then the number

of cores? E.g. to 100. Given you have 100 goroutines, would you actually get
100 OS Threads (Ms)? And is this a stupid thing to do?

The same thing that happens when you spawn more cores in a normal programming language; the OS gets to multiplex them (the OS is still multiplexing Go's threads too, of course, hence the M > GOMAXPROCS trick).

HTH,

~K

Alexey Borzenkov

unread,

Sep 27, 2011, 3:23:01 PM9/27/11

to golang-nuts

On Tue, Sep 27, 2011 at 10:11 PM, Alexey Borzenkov <sna...@gmail.com> wrote:

> On Sep 27, 2011 9:08 PM, "Dmitry Vyukov" <dvy...@google.com> wrote:
>>> 2 Where would one modify the code if one was to make the scheduler
>>> Round-robin?
>>> (I.e. add a timeout for how long a G could run on one M).
>> I think the compiler must emit periodic checks (whether the G has
>> exhausted its timeslice), so the main place to modify is the compiler.

> I don't think it's a job for the compiler. What I think should be done is
> for a thread (if my memory serves correctly there is some additional thread
> for something, otherwise there needs to be) to periodically check queues for
> stalling, by checking e.g. last scheduling timestamp. If there's no
> scheduling for some time, it should effectively increase gomaxprocs.
> Compiler checking would need checkpoints, and obvious checkpoints like
> function prologue would not only be wasteful, but miss tight busy loops
> without any function calls. What do you think about such a watchdog thread?

Or maybe not. Turns out there's no thread wasting, and creating a
thread just for the watchdog would not be easy, would be hard to make
it anything but a sticking out fifth leg with a hit on performance. At
least on unix one could probably use SIGALRM for making these checks,
except user code might want to use them too. Bummer.

Dmitry Vyukov

unread,

Sep 27, 2011, 4:45:09 PM9/27/11

to Alexey Borzenkov, golang-nuts

On Tue, Sep 27, 2011 at 11:23 PM, Alexey Borzenkov <sna...@gmail.com> wrote:

On Tue, Sep 27, 2011 at 10:11 PM, Alexey Borzenkov <sna...@gmail.com> wrote:
> On Sep 27, 2011 9:08 PM, "Dmitry Vyukov" <dvy...@google.com> wrote:
>>> 2 Where would one modify the code if one was to make the scheduler
>>> Round-robin?
>>> (I.e. add a timeout for how long a G could run on one M).
>> I think the compiler must emit periodic checks (whether the G has
>> exhausted its timeslice), so the main place to modify is the compiler.

> I don't think it's a job for the compiler. What I think should be done is
> for a thread (if my memory serves correctly there is some additional thread
> for something, otherwise there needs to be) to periodically check queues for
> stalling, by checking e.g. last scheduling timestamp. If there's no
> scheduling for some time, it should effectively increase gomaxprocs.
> Compiler checking would need checkpoints, and obvious checkpoints like
> function prologue would not only be wasteful, but miss tight busy loops
> without any function calls. What do you think about such a watchdog thread?

Well, I understand the negative impact of the checks on performance (the checks definitely must be placed in loops as well).

However note that we want preemption not only to provide fair processing, but also to stop all threads for garbage collection (otherwise you get the same deadlocks: one thread requests garbage collection, another thread in an infinite busy loop, the system is stalled).

Moreover, that will increase number of threads (and thus big thread stacks) infinitely.

The solution may be as follows.

Send a signal to hung threads. Signal handler checks that the thread is not in a critical region (e.g. m->locks==0), then saves current goroutine context and starts new goroutine in the same thread. If GC is pending then the handler just stops the thread.

On Windows signals are emulated as follows.

SuspendThread(), save the context, modify the context so that RIP points to the handler, ResumeThread(), handler does its work and restores the context.

Or maybe not. Turns out there's no thread wasting, and creating a
thread just for the watchdog would not be easy, would be hard to make
it anything but a sticking out fifth leg with a hit on performance. At
least on unix one could probably use SIGALRM for making these checks,
except user code might want to use them too. Bummer.

I don't think that additional thread is a problem. Or we can reuse profiling signal.

Anthony Martin

unread,

Sep 27, 2011, 6:44:14 PM9/27/11

to Dmitry Vyukov, Sindre Myren, golang-nuts

Dmitry Vyukov <dvy...@google.com> once said:
> It is basically impossible to figure out from the code, and I don't
> remember how I figured that out. But it seems that M refers to
> Machine. You may provide that transcript in brackets, M was always
> confusing for me.

In the Plan 9 kernel each processor has an associated Mach
structure at a specific address. This structure contains,
among other things, a queue of readied procs and a pointer
to the currently executing proc. When the kernel wants to
refer to "this processor's Mach structure", it indirects
through a per-cpu global named "m".

This is where the name comes from.

For the curious, the Go runtime and the Plan 9 kernel use
the same mechanism for "per-cpu" variables. The C compilers
implement a special storage class (on machines with enough
registers) known as "extern register".

Cheers,
Anthony

Reply all

Reply to author

Forward