Recently, it was brought to my attention by a fellow compatriot that Go
didn't detect the available number of logical processor cores available
to the host process, and rather relied on the end-user or administrator
to specify the maximum number of concurrent goroutines with the
GOMAXPROCS environment variable. Now, don't get me wrong configuration
is good, but requiring it when it's possible to determine a good default
automatically I thought was rather silly. So I decided to take a closer
look at the source code. I then saw that the existing task scheduler
merely uses a single global job queue and does a lot of locking and
serialization, it forks a whole new thread for each goroutine, doesn't
create additional worker threads when there are ones which are blocking,
doesn't do load-balancing, among other dastardly things.
We got started on thinking about improving the state of things there,
and decided we'd just start on doing a proof of concept to detect the
number of available logical cpu cores and use that when GOMAXPROCS isn't
defined. I have it running on Linux x86/x64. You can see the changeset
in my clone here:
http://code.google.com/r/jesset-go-concurrency/source/detail?r=d3d6a55a48fce98f9fff62915f11ec01a9f30945
Now, the way we've done it doesn't really fix too much of the overall
problem because of the way the current scheduler is implemented. It has
no way of ensuring that say all 8 cores on an 8 core machine are working
all of the time, because some of the goroutines on those threads may
block for a timk, and once the scheduler hits runtime�gomaxprocs, it
stops creating new workers. So I can see GOMAXPROCS actually being
assigned a much larger value in practice to ensure over-subscription of
workers to cores.
Ideally, you'd want a locality-aware scheduler capable of load-balancing
using work-stealing and/or work-requesting, in which case you'd want to
detect a lot more than simply the number of available cores--you'd want
to detect all of the NUMA nodes available on the machine as well, which
my current approach doesn't do.
I then thought I'd post about it on the mailing list, see if our general
approach is somewhat valid, what other people might be working on, and
where we could take this, etc. So it was to my surprise that I saw none
other than Dmitriy Yyukov already starting to work on some of these
problems!
In the interest of not duplicating any work, I'd be interested in
helping out on this effort if at all possible? I'm not sure where you're
at with things Dimitry, but perhaps I could continue with implementing
machine CPU and NUMA node configuration/detection on various platforms
if that's not yet complete, if that makes sense?
- Jesse
> So
> I decided to take a closer look at the source code. I then saw that
> the existing task scheduler merely uses a single global job queue and
> does a lot of locking and serialization, it forks a whole new thread
> for each goroutine, doesn't create additional worker threads when
> there are ones which are blocking, doesn't do load-balancing, among
> other dastardly things.
We plan to improve the scheduler, but your description isn't wholly
accurate. The scheduler does not fork a new thread for each goroutine.
It multiplexes goroutines onto a pool of threads. GOMAXPROCS sets the
number of goroutines which will run simultaneously on threads, not
counting goroutines waiting on a system call or running C code.
> Now, the way we've done it doesn't really fix too much of the overall
> problem because of the way the current scheduler is implemented. It
> has no way of ensuring that say all 8 cores on an 8 core machine are
> working all of the time, because some of the goroutines on those
> threads may block for a timk, and once the scheduler hits
> runtime·gomaxprocs, it stops creating new workers. So I can see
> GOMAXPROCS actually being assigned a much larger value in practice to
> ensure over-subscription of workers to cores.
Goroutines which are blocked don't count against GOMAXPROCS.
> In the interest of not duplicating any work, I'd be interested in
> helping out on this effort if at all possible? I'm not sure where
> you're at with things Dimitry, but perhaps I could continue with
> implementing machine CPU and NUMA node configuration/detection on
> various platforms if that's not yet complete, if that makes sense?
I'm not sure what Dimitry is going to work on, but I think that would be
helpful in any case. Thanks.
Ian
Recently, it was brought to my attention by a fellow compatriot that Go didn't detect the available number of logical processor cores available to the host process
and rather relied on the end-user or administrator to specify the maximum number of concurrent goroutines with the GOMAXPROCS environment variable. Now, don't get me wrong configuration is good, but requiring it when it's possible to determine a good default automatically I thought was rather silly. So I decided to take a closer look at the source code. I then saw that the existing task scheduler merely uses a single global job queue and does a lot of locking and serialization, it forks a whole new thread for each goroutine
doesn't create additional worker threads when there are ones which are blocking, doesn't do load-balancing, among other dastardly things.
We got started on thinking about improving the state of things there, and decided we'd just start on doing a proof of concept to detect the number of available logical cpu cores and use that when GOMAXPROCS isn't defined. I have it running on Linux x86/x64. You can see the changeset in my clone here: http://code.google.com/r/jesset-go-concurrency/source/detail?r=d3d6a55a48fce98f9fff62915f11ec01a9f30945
Now, the way we've done it doesn't really fix too much of the overall problem because of the way the current scheduler is implemented. It has no way of ensuring that say all 8 cores on an 8 core machine are working all of the time, because some of the goroutines on those threads may block for a timk, and once the scheduler hits runtime·gomaxprocs, it stops creating new workers. So I can see GOMAXPROCS actually being assigned a much larger value in practice to ensure over-subscription of workers to cores.
Ideally, you'd want a locality-aware scheduler capable of load-balancing using work-stealing and/or work-requesting, in which case you'd want to detect a lot more than simply the number of available cores--you'd want to detect all of the NUMA nodes available on the machine as well, which my current approach doesn't do.
I then thought I'd post about it on the mailing list, see if our general approach is somewhat valid, what other people might be working on, and where we could take this, etc. So it was to my surprise that I saw none other than Dmitriy Yyukov already starting to work on some of these problems!
Ian pointed out that this was wrong.
> doesn't create additional worker
> threads when there are ones which are blocking,
This is wrong too.
Russ
It is important to note that the goal here is
not to write an operating system.
Operating systems have a long history of tweaking the scheduling
routines to handle lots of concurrent processes. I'm guessing Martin's
comment is that yes, you can do a lot for handing scheduling, but maybe
you should let your OS do its job as much as possible.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEUEARECAAYFAk4uw9IACgkQJdeBCYSNAAM/5wCXWfkAH3xVGHL3i5pJfn4+pO7a
7ACgzDnvqrGC+i6KsPVzcvJL5QcTuPk=
=WlEQ
-----END PGP SIGNATURE-----
routines to handle lots of concurrent processes. I'm guessing Martin's
comment is that yes, you can do a lot for handing scheduling, but maybe
you should let your OS do its job as much as possible.
It's easy to let these things get arbitrarily complicated,
and you wake up one day and runtime/proc.c has turned
into its own very complex operating system that runs on
top of another very complex operating system.
One of my goals is to keep that from happening,
to keep the Go implementation working well with the
host operating system instead of trying to replace it.
Russ
My ultimate goal, is to write a project report, describing how Google
Go does it as a real-time programming language.
The project is due in
the end of December.
> Go is not a real-time programming language. End of report.
And Java isn't a real-time programming language?
Is C a real-time programming language?
> The project is due in
> the end of December.
>
>
> Nearly 3 months saved.
Project failed though due to lack of content. Remember OP is a student
on a course, not a contract programmer trying to complete a fixed price
contract.
--
Russel.
=============================================================================
Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel...@ekiga.net
41 Buckmaster Road m: +44 7770 465 077 xmpp: rus...@russel.org.uk
London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Thats true Russel. If Google Go is worse suited for RT-programming
then say, C, Java and ADA, I would still need to explain why that is
the case. Even I understand that using a language as young as Go in a
industrial hard RT-computing case, is not a good idea. But as an
academic, I can still study it's potentials:)
> --
> Russel.
-- Sindre
...
>> Project failed though due to lack of content. Remember OP is a
>> student on a course, not a contract programmer trying to complete
>> a fixed price contract.
>>
>
> Thats true Russel. If Google Go is worse suited for RT-programming
> then say, C, Java and ADA, I would still need to explain why that
> is the case. Even I understand that using a language as young as Go
> in a industrial hard RT-computing case, is not a good idea. But as
> an academic, I can still study it's potentials:)
>
At a minimum, you have a Garbage Collected language, which makes it
hard to be hard RT. Java has a similar problem. It is possible to
switch out what collector runs, there are certainly a lot of
possibilities.
There is a pretty interesting series about "Recycler", using a
reference counted with lazy evaluation and generations to enable
minimal interruptions:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.1065
I'm not sure how much it applies to a language without a VM and
cooperative coroutines. They could be made preemptive, but really how
much do you have to change before it is good enough. And how much can
you call the final thing "Go".
I wouldn't say it is "worse" than Java at a theory level, but there
certainly would have been a lot more *implementation* level effort
trying to make Java work better on RT systems.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk5jQbAACgkQJdeBCYSNAAPk+gCfYauUkWkV7dy65U7rhGIve9kU
dzQAnjGcVziRHcrQtfxslh7nZb54MJPq
=OTNT
-----END PGP SIGNATURE-----
Hi, Dmitry. Do you have any details on this design lying around?
On Jul 22, 2:58 pm, Dmitry Vyukov <dvyu...@google.com> wrote:
> As for my progress, I had studied current implementation, investigate all
> the requirements, worked out a design, but had written no code.
Or
could you share some insight on how exactly the current google go
scheduler works, maybe a bit deeper then what can be easily extracted
from the source code comments?
Hi again Dimitry.2011/9/4 Dmitry Vyukov <dvy...@google.com>On Sat, Sep 3, 2011 at 10:00 PM, Sindre Myren <smy...@gmail.com> wrote:Hi, Dmitry. Do you have any details on this design lying around?
On Jul 22, 2:58 pm, Dmitry Vyukov <dvyu...@google.com> wrote:
> As for my progress, I had studied current implementation, investigate all
> the requirements, worked out a design, but had written no code.
No, I don't.Or
could you share some insight on how exactly the current google go
scheduler works, maybe a bit deeper then what can be easily extracted
from the source code comments?I believe the source code is the best documentation.However, I am ready to validate your understanding and/or answer some specific questions.I hope you still have the time to do this, and thank you in advance for your help -- If somebody else wants to answer, that is cool to:-)The following expleatino of how the scheduler works, is not purely based on my understanding of the code, but also partly on assumptions and things I have randomly read on the mailinglist - what is correct, and what is just totally wrong? To make it Simple, you could just write C (correct), W (wrong) and D (don't know) after each sentence. If you feel like explaining something in detail though, please do."""\section{The Go Scheduler}The comments in the Go scheduler code (src/runtime/proc.c) talks about Ms andGs. Each G is a Goroutine, and each M is an OS Thread.
The number of Ms islimited by the environment variable GOMAXPROCS, or by the GOMAXPROCS(int)function.
If the number of Gs is less then GOMAXPROCS, then the number of Mswill be equal to the number of Gs (unless the number of Ms has previously beenhigher - i.e. Ms are not terminated when the number of Gs decline). The defaultof GOMAXPROCS is 1. Usually, you would want to set GOMAXPROCS to the number ofCPU cores. The Go scheduler schedules ready-to-run goroutines to free Ms.The ready-to-run Gs are scheduled according to a the simplest algorithm thereis: cyclic execution. A goroutine is requed only when it can no longer continueto run - e.g. when it hits a syncronizing function (channels, mutex.wait,etc.), a sleep function or a system call. There is no timeout where thegoroutines are requed.
"""Questions:1 What happens if you set GOMAXPROCS to something much larger then the numberof cores? E.g. to 100. Given you have 100 goroutines, would you actually get100 OS Threads (Ms)?
And is this a stupid thing to do?
2 Where would one modify the code if one was to make the scheduler Round-robin?(I.e. add a timeout for how long a G could run on one M).
3 If one wanted the scheduler to do something completly different, i.e. useprorities, and some fancy scheduling algorithm - where would one want to dothis changes?
4. Is there some other file(s) in http://golang.org/src/pkg/runtime/ then proc.c that I should be looking at?
\section{The Go Scheduler}The comments in the Go scheduler code (src/runtime/proc.c) talks about Ms andGs. Each G is a Goroutine, and each M is an OS Thread.
The number of Ms islimited by the environment variable GOMAXPROCS, or by the GOMAXPROCS(int)function.
If the number of Gs is less then GOMAXPROCS
then the number of Ms will be equal to the number of Gs (unless the number of Ms has previously beenhigher - i.e. Ms are not terminated when the number of Gs decline).
The default of GOMAXPROCS is 1.
Usually, you would want to set GOMAXPROCS to the number of CPU cores.
The Go scheduler schedules ready-to-run goroutines to free Ms.
The ready-to-run Gs are scheduled according to a the simplest algorithm thereis: cyclic execution.
A goroutine is requed only when it can no longer continueto run - e.g. when it hits a syncronizing function (channels, mutex.wait,etc.), a sleep function or a system call.
There is no timeout where thegoroutines are requed.
"""Questions:1 What happens if you set GOMAXPROCS to something much larger then the numberof cores? E.g. to 100. Given you have 100 goroutines, would you actually get100 OS Threads (Ms)? And is this a stupid thing to do?
Or maybe not. Turns out there's no thread wasting, and creating a
thread just for the watchdog would not be easy, would be hard to make
it anything but a sticking out fifth leg with a hit on performance. At
least on unix one could probably use SIGALRM for making these checks,
except user code might want to use them too. Bummer.
On Tue, Sep 27, 2011 at 10:11 PM, Alexey Borzenkov <sna...@gmail.com> wrote:
> On Sep 27, 2011 9:08 PM, "Dmitry Vyukov" <dvy...@google.com> wrote:
>>> 2 Where would one modify the code if one was to make the scheduler
>>> Round-robin?
>>> (I.e. add a timeout for how long a G could run on one M).
>> I think the compiler must emit periodic checks (whether the G has
>> exhausted its timeslice), so the main place to modify is the compiler.
> I don't think it's a job for the compiler. What I think should be done is
> for a thread (if my memory serves correctly there is some additional thread
> for something, otherwise there needs to be) to periodically check queues for
> stalling, by checking e.g. last scheduling timestamp. If there's no
> scheduling for some time, it should effectively increase gomaxprocs.
> Compiler checking would need checkpoints, and obvious checkpoints like
> function prologue would not only be wasteful, but miss tight busy loops
> without any function calls. What do you think about such a watchdog thread?
Or maybe not. Turns out there's no thread wasting, and creating a
thread just for the watchdog would not be easy, would be hard to make
it anything but a sticking out fifth leg with a hit on performance. At
least on unix one could probably use SIGALRM for making these checks,
except user code might want to use them too. Bummer.
In the Plan 9 kernel each processor has an associated Mach
structure at a specific address. This structure contains,
among other things, a queue of readied procs and a pointer
to the currently executing proc. When the kernel wants to
refer to "this processor's Mach structure", it indirects
through a per-cpu global named "m".
This is where the name comes from.
For the curious, the Go runtime and the Plan 9 kernel use
the same mechanism for "per-cpu" variables. The C compilers
implement a special storage class (on machines with enough
registers) known as "extern register".
Cheers,
Anthony