scheduler for go : doubts in design approach

888 views
Skip to first unread message

anshul makkar

unread,
Sep 25, 2017, 5:32:18 PM9/25/17
to golang-dev

Hi Dmitry,


Sorry, I am new to this group. I am contributor to Xen Scheduler and recently got exposed to golang and became interested in its scheduler. I was going through the go scheduler design doc and and found its design quite different from Xen's. https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub.


I have few doubts which I am posting below, please anyone of you can clarify on these.


"P’s (logical processors) are strictly partitioned between NUMA nodes. That is, if we want to start a P bound to NODE0, we need to get an M from NODE0 M pool. Each P has own local RunQ as now. Global RunQ and pool of idle G descriptors become partitioned between NUMA nodes."


---> Why there is need of runq per P. There can be Runq per groups of Ps in a node, or Ps per core ?


"ensure P <-> M affinity, lastm field is added to P. When Truntime wants to start a P it first tries to get p->lastm M to run it."


---> I hope here you mean that if P has to run then it will check the last M where it executed, if lastM is not idle it will look for other Ms. (but from where,  does it pick any random M. Do we define any set of Ms where P can run or it can pick any P).


"What's left is scheduling policy. Current scheduling policy is very simple: new and unblocked goroutines go to local RunQ; network poller injects work into global queue; work stealing is completely random; GC reshuffles goroutines to balance work."


---> Do you mean work is picked form a global queue instead of independent Runqs. (Bit confused please, can you clarify).


I am looking at https://github.com/golang/go/blob/e97209515ad8c4042f5a3ef32068200366892fc2/src/runtime/proc.go for scheduler implementation.


Also, please can you let me know if the code is open for public contribution. 


Thanks

Anshul Makkar

www.justkernel.com

Ian Lance Taylor

unread,
Sep 25, 2017, 6:54:52 PM9/25/17
to anshul makkar, golang-dev
On Mon, Sep 25, 2017 at 11:53 AM, anshul makkar <anshul...@gmail.com> wrote:
>
> Sorry, I am new to this group. I am contributor to Xen Scheduler and
> recently got exposed to golang and became interested in its scheduler. I was
> going through the go scheduler design doc and and found its design quite
> different from Xen's.
> https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub.

Note that that was a proposal by Dmitry that as far as I know was



> "P’s (logical processors) are strictly partitioned between NUMA nodes. That
> is, if we want to start a P bound to NODE0, we need to get an M from NODE0 M
> pool. Each P has own local RunQ as now. Global RunQ and pool of idle G
> descriptors become partitioned between NUMA nodes."
>
>
> ---> Why there is need of runq per P. There can be Runq per groups of Ps in
> a node, or Ps per core ?

I don't think there is a need of a runq per P. As the doc says, that
is how it is currently implemented. The advantage of a runq per P is
that the P can switch between those goroutines without any locking.

I don't know what Dmitry was thinking regarding your other questions.

Ian

anshul makkar

unread,
Sep 27, 2017, 6:25:57 PM9/27/17
to golang-dev

> "P’s (logical processors) are strictly partitioned between NUMA nodes. That
> is, if we want to start a P bound to NODE0, we need to get an M from NODE0 M
> pool. Each P has own local RunQ as now. Global RunQ and pool of idle G
> descriptors become partitioned between NUMA nodes."
>
>
> ---> Why there is need of runq per P. There can be Runq per groups of Ps in
> a node, or Ps per core ?

I don't think there is a need of a runq per P.  As the doc says, that
is how it is currently implemented.  The advantage of a runq per P is
that the P can switch between those goroutines without any locking.

I don't know what Dmitry was thinking regarding your other questions. 
Yeah, from that perspective its good to have runq per P but its really an overkill that can be easily optimized. 
 Dmitry's thoughts would be useful. 
Ian
 Anshul
  

Dmitry Vyukov

unread,
Sep 30, 2017, 8:49:16 AM9/30/17
to golang-dev
On Monday, September 25, 2017 at 11:32:18 PM UTC+2, anshul makkar wrote:

Hi Dmitry,



Hi,

I found this by pure accident (you did not add me to CC).

 

Sorry, I am new to this group. I am contributor to Xen Scheduler and recently got exposed to golang and became interested in its scheduler. I was going through the go scheduler design doc and and found its design quite different from Xen's. https://docs.google.com/document/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub.



I have few doubts which I am posting below, please anyone of you can clarify on these.



"P’s (logical processors) are strictly partitioned between NUMA nodes. That is, if we want to start a P bound to NODE0, we need to get an M from NODE0 M pool. Each P has own local RunQ as now. Global RunQ and pool of idle G descriptors become partitioned between NUMA nodes."




---> Why there is need of runq per P. There can be Runq per groups of Ps in a node, or Ps per core ?




A local runq makes some operations cheaper, e.g. push can be done with only a release-store. This also provides better locality as work in P-local queue is likely cached in L1.


 


"ensure P <-> M affinity, lastm field is added to P. When Truntime wants to start a P it first tries to get p->lastm M to run it."




---> I hope here you mean that if P has to run then it will check the last M where it executed, if lastM is not idle it will look for other Ms.


Yes.

 
(but from where, does it pick any random M.

From sched.midle list.

 
Do we define any set of Ms where P can run or it can pick any P).



Currently M picks any P.
 



 

"What's left is scheduling policy. Current scheduling policy is very simple: new and unblocked goroutines go to local RunQ; network poller injects work into global queue; work stealing is completely random; GC reshuffles goroutines to balance work."




---> Do you mean work is picked form a global queue instead of independent Runqs. (Bit confused please, can you clarify).


I mean that when goroutines are added/removed to/from run queues, we can choose queue based on some notion of locality. E.g. put G to a non-local queue, or steal from hyperthread sibling first instead of random.
This is the correct place.
 


Also, please can you let me know if the code is open for public contribution.



anshul makkar

unread,
Oct 2, 2017, 6:48:19 PM10/2/17
to Dmitry Vyukov, golang-dev
On Sat, Sep 30, 2017 at 1:49 PM, 'Dmitry Vyukov' via golang-dev
<golan...@googlegroups.com> wrote:
>> On Monday, September 25, 2017 at 11:32:18 PM UTC+2, anshul makkar wrote:
>>
>> Hi Dmitry,
>>
>>
>
> Hi,
>
> I found this by pure accident (you did not add me to CC).
>
>
>>
>>
>>
>> "P’s (logical processors) are strictly partitioned between NUMA nodes.
>> That is, if we want to start a P bound to NODE0, we need to get an M from
>> NODE0 M pool. Each P has own local RunQ as now. Global RunQ and pool of idle
>> G descriptors become partitioned between NUMA nodes."
>>
>>
>>
>>
>> ---> Why there is need of runq per P. There can be Runq per groups of Ps
>> in a node, or Ps per core ?
>>
>>
>
>
> A local runq makes some operations cheaper, e.g. push can be done with only
> a release-store. This also provides better locality as work in P-local queue
> is likely cached in L1.
>
Its not always performance efficient to have runq per P. Depending on
the workload, runq per core
or runq per node or runq per cpu (logical processor) can give better
performance. We
can discuss this in more details.
>
>
>>
>>
>>
>> "ensure P <-> M affinity, lastm field is added to P. When Truntime wants
>> to start a P it first tries to get p->lastm M to run it."
>>
>>
>>
>>
>> ---> I hope here you mean that if P has to run then it will check the last
>> M where it executed, if lastM is not idle it will look for other Ms.
>
>
>
> Yes.
>
>
As go scheduler uses only 1:1 correspondence between P and M, here
there is a bit disadvantage as
the options are limited.If the lastM is not available, then we look
for other Ms. Based on the code, I am not
sure how many cpucycles are getting wasted here. If we can have a
concept of affinity of P to
multiple Ms, (may be Ms are in same core or node or share same L1
cache, that can be decided), then
to run P we will have multiple options (a simple bitmask comparison
will be sufficient).
>>
>> (but from where, does it pick any random M.
>
>
> From sched.midle list.
>
>
>>
>> Do we define any set of Ms where P can run or it can pick any P).
>>
>
>
> Currently M picks any P.
>
>
>
>
>
>>
>>
>> "What's left is scheduling policy. Current scheduling policy is very
>> simple: new and unblocked goroutines go to local RunQ; network poller
>> injects work into global queue; work stealing is completely random; GC
>> reshuffles goroutines to balance work."
>>
>>
Again this is something that can give quite skewed results depending upon
the workload. Its better to have a policy where different coefficients
and their weights are considered
and then the best scheduling decisions are made. Again, we can discuss
this in more
details.
>>
>>
>> ---> Do you mean work is picked form a global queue instead of independent
>> Runqs. (Bit confused please, can you clarify).
>
>
>
> I mean that when goroutines are added/removed to/from run queues, we can
> choose queue based on some notion of locality. E.g. put G to a non-local
> queue, or steal from hyperthread sibling first instead of random.
>
>
>>
>>
>> I am looking at
>> https://github.com/golang/go/blob/e97209515ad8c4042f5a3ef32068200366892fc2/src/runtime/proc.go
>> for scheduler implementation.
>>
>
>
> This is the correct place.
>
>>
>>
>>
>> Also, please can you let me know if the code is open for public
>> contribution.
>>
>>
>
> Go is fully open, see:
> https://golang.org/doc/contribute.html
>
Thanks for clarifying.

Yeah, as hinted above there are various areas where I would like to
contribute and start a
design discussion. Let me see how can I manage with my current work schedule

Thanks
Anshul Makkar

> --
> You received this message because you are subscribed to a topic in the
> Google Groups "golang-dev" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-dev/ol2uSaqiorU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dmitry Vyukov

unread,
Oct 10, 2017, 8:40:56 AM10/10/17
to anshul makkar, golang-dev
On Tue, Oct 3, 2017 at 12:48 AM, anshul makkar <anshul...@gmail.com> wrote:
> On Sat, Sep 30, 2017 at 1:49 PM, 'Dmitry Vyukov' via golang-dev
> <golan...@googlegroups.com> wrote:
>>> On Monday, September 25, 2017 at 11:32:18 PM UTC+2, anshul makkar wrote:
>>>
>>> Hi Dmitry,
>>>
>>>
>>
>> Hi,
>>
>> I found this by pure accident (you did not add me to CC).
>>
>>
>>>
>>>
>>>
>>> "P’s (logical processors) are strictly partitioned between NUMA nodes.
>>> That is, if we want to start a P bound to NODE0, we need to get an M from
>>> NODE0 M pool. Each P has own local RunQ as now. Global RunQ and pool of idle
>>> G descriptors become partitioned between NUMA nodes."
>>>
>>>
>>>
>>>
>>> ---> Why there is need of runq per P. There can be Runq per groups of Ps
>>> in a node, or Ps per core ?
>>>
>>>
>>
>>
>> A local runq makes some operations cheaper, e.g. push can be done with only
>> a release-store. This also provides better locality as work in P-local queue
>> is likely cached in L1.
>>
> Its not always performance efficient to have runq per P. Depending on
> the workload, runq per core
> or runq per node or runq per cpu (logical processor) can give better
> performance. We
> can discuss this in more details.


Yes. But the problem is that we don't know the workload. In lots of
cases it's also certainly not optimal to do FIFO scheduling and LIFO
would do much better.


>>> "ensure P <-> M affinity, lastm field is added to P. When Truntime wants
>>> to start a P it first tries to get p->lastm M to run it."
>>>
>>>
>>>
>>>
>>> ---> I hope here you mean that if P has to run then it will check the last
>>> M where it executed, if lastM is not idle it will look for other Ms.
>>
>>
>>
>> Yes.
>>
>>
> As go scheduler uses only 1:1 correspondence between P and M, here
> there is a bit disadvantage as
> the options are limited.If the lastM is not available, then we look
> for other Ms. Based on the code, I am not
> sure how many cpucycles are getting wasted here. If we can have a
> concept of affinity of P to
> multiple Ms, (may be Ms are in same core or node or share same L1
> cache, that can be decided), then
> to run P we will have multiple options (a simple bitmask comparison
> will be sufficient).


Yes, sure we can have more M's associated with a P.
But I think the main problem overall is thread affinity. If as don't
do thread affinity, then we have no notion of actual/physical locality
(e.g. we can have 2 M's associated with a P, but OS actually decides
to run these Ms on different NUMA nodes). If we do thread affinity,
then we take over OS global load balancing but we don't have enough
information to do it properly (don't know what's the load of CPUs and
what other processes are running and if they also have affinity or
not, etc).



>>> (but from where, does it pick any random M.
>>
>>
>> From sched.midle list.
>>
>>
>>>
>>> Do we define any set of Ms where P can run or it can pick any P).
>>>
>>
>>
>> Currently M picks any P.
>>
>>
>>
>>
>>
>>>
>>>
>>> "What's left is scheduling policy. Current scheduling policy is very
>>> simple: new and unblocked goroutines go to local RunQ; network poller
>>> injects work into global queue; work stealing is completely random; GC
>>> reshuffles goroutines to balance work."
>>>
>>>
> Again this is something that can give quite skewed results depending upon
> the workload. Its better to have a policy where different coefficients
> and their weights are considered
> and then the best scheduling decisions are made. Again, we can discuss
> this in more
> details.


Agree. But it's quite difficult to chose these policies automatically.
For informed decisions we would need to collect more info, but we may
not pack back for the cost of the collection in the end.

anshul makkar

unread,
Oct 12, 2017, 12:57:40 PM10/12/17
to Dmitry Vyukov, golang-dev
On Tue, Oct 10, 2017 at 1:40 PM, Dmitry Vyukov <dvy...@google.com> wrote:
> On Tue, Oct 3, 2017 at 12:48 AM, anshul makkar <anshul...@gmail.com> wrote:
>> On Sat, Sep 30, 2017 at 1:49 PM, 'Dmitry Vyukov' via golang-dev
>> <golan...@googlegroups.com> wrote:
>>>> On Monday, September 25, 2017 at 11:32:18 PM UTC+2, anshul makkar wrote:
>>>>
>>>> Hi Dmitry,
>>>>
>>>>
>>>
>>> Hi,
>>>
>>> I found this by pure accident (you did not add me to CC).
>>>
>>>
>> Its not always performance efficient to have runq per P. Depending on
>> the workload, runq per core
>> or runq per node or runq per cpu (logical processor) can give better
>> performance. We
>> can discuss this in more details.
>
>
> Yes. But the problem is that we don't know the workload. In lots of
> cases it's also certainly not optimal to do FIFO scheduling and LIFO
> would do much better.
>
You are right. But atleast we can provide different configuration
suitable for different workload so that atleast user, who might be
aware of his use case, can run scheduler with default configuration.

On a side not, I am working on implementing a neural network which
will adpat the scheduler configuration based on the workload. But, its
still in nascent stages.


>>>
>>> Yes.
>>>
>>>
>> As go scheduler uses only 1:1 correspondence between P and M, here
>> there is a bit disadvantage as
>> the options are limited.If the lastM is not available, then we look
>> for other Ms. Based on the code, I am not
>> sure how many cpucycles are getting wasted here. If we can have a
>> concept of affinity of P to
>> multiple Ms, (may be Ms are in same core or node or share same L1
>> cache, that can be decided), then
>> to run P we will have multiple options (a simple bitmask comparison
>> will be sufficient).
>
>
> Yes, sure we can have more M's associated with a P.
> But I think the main problem overall is thread affinity. If as don't
> do thread affinity, then we have no notion of actual/physical locality
> (e.g. we can have 2 M's associated with a P, but OS actually decides
> to run these Ms on different NUMA nodes). If we do thread affinity,
> then we take over OS global load balancing but we don't have enough
> information to do it properly (don't know what's the load of CPUs and
> what other processes are running and if they also have affinity or
> not, etc).
>
hmm. I get your point. I was thinking from the point from Xen where
the scheduler runs on bare metal and there is no OS in between..
>
>>>> (but from where, does it pick any random M.
>>>

>>>>
>>>> "What's left is scheduling policy. Current scheduling policy is very
>>>> simple: new and unblocked goroutines go to local RunQ; network poller
>>>> injects work into global queue; work stealing is completely random; GC
>>>> reshuffles goroutines to balance work."
>>>>
>>>>
>> Again this is something that can give quite skewed results depending upon
>> the workload. Its better to have a policy where different coefficients
>> and their weights are considered
>> and then the best scheduling decisions are made. Again, we can discuss
>> this in more
>> details.
>
>
> Agree. But it's quite difficult to chose these policies automatically.
> For informed decisions we would need to collect more info, but we may
> not pack back for the cost of the collection in the end.

hmm. again my thought process from the point of view of scheduler that
is the main scheduler in the system and that even schedules the OS
thread. There we have all the information regarding workload on each
runq, processing time for each thread and based on this information we
have implemented a generic algorithm to secure maximum performance
from the system.

I am not sure at the go level, how much low level information about
the physical CPU we can retrieve. Something, I have no idea right now
and need to explore.

Thanks
Anshul Makkar

Dmitry Vyukov

unread,
Oct 12, 2017, 1:03:30 PM10/12/17
to anshul makkar, golang-dev
On Thu, Oct 12, 2017 at 6:57 PM, anshul makkar <anshul...@gmail.com> wrote:
> On Tue, Oct 10, 2017 at 1:40 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>> On Tue, Oct 3, 2017 at 12:48 AM, anshul makkar <anshul...@gmail.com> wrote:
>>> On Sat, Sep 30, 2017 at 1:49 PM, 'Dmitry Vyukov' via golang-dev
>>> <golan...@googlegroups.com> wrote:
>>>>> On Monday, September 25, 2017 at 11:32:18 PM UTC+2, anshul makkar wrote:
>>>>>
>>>>> Hi Dmitry,
>>>>>
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> I found this by pure accident (you did not add me to CC).
>>>>
>>>>
>>> Its not always performance efficient to have runq per P. Depending on
>>> the workload, runq per core
>>> or runq per node or runq per cpu (logical processor) can give better
>>> performance. We
>>> can discuss this in more details.
>>
>>
>> Yes. But the problem is that we don't know the workload. In lots of
>> cases it's also certainly not optimal to do FIFO scheduling and LIFO
>> would do much better.
>>
> You are right. But atleast we can provide different configuration
> suitable for different workload so that atleast user, who might be
> aware of his use case, can run scheduler with default configuration.


Go tries to avoid knobs as much as possible. Manual tuning is
difficult, time-consuming and tends to become outdated as workload
changes.
That's definitely not the case for majority of Go uses. Go programs
need to cohabit with whatever workload is running on a machine.
Reply all
Reply to author
Forward
0 new messages