NUMA support

1,167 views

Skip to first unread message

Dmitry Vyukov

unread,

Sep 8, 2016, 10:41:22 AM9/8/16

to Bulatova, Maria, golan...@googlegroups.com, Keith Randall, Rick Hudson, Austin Clements

On Thu, Sep 8, 2016 at 4:21 PM, Dmitry Vyukov <dvy...@google.com> wrote:
> On Thu, Sep 8, 2016 at 10:40 AM, Bulatova, Maria
> <maria.b...@intel.com> wrote:
>> Hi!
>>
>> I am trying to implement some of your suggestions about NUMA-aware
>> scheduler, and want to understand how it is better evaluate runtime
>> performance after some changes. Now we run benchmarks (debian shootout and
>> some synthetic benches with lots of memory accesses) and measure completion
>> time for them.
>>
>> Looking for better way I found http://golang.org/cl/21503 - could you
>> please more explain how measurements here done?
>
>
> Hello Maria,
>
> If you mean histograms in http://golang.org/cl/21503: I added manual
> instrumentation to runtime that printed latency to console and then
> used some standard linux utility to turn that into histograms (don't
> remember name, something that I found with apt-cache search).
>
> The other results are obtained on http benchark from
> https://github.com/golang/benchmarks
> Results are processed with benchstat utility
> (https://godoc.org/rsc.io/benchstat).
>
>
> Do we have any good benchmarks with large memory consumption for NUMA testing?
> I can think only of garbage benchmark with -benchmem=32768 flag:
> https://github.com/golang/benchmarks/blob/master/garbage/garbage.go

Maria,

We are talking about this proposal, right?
https://docs.google.com/document/u/1/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub

What part of the proposal are you implementing?

I still don't have good answer for the following problem listed at the
end of the doc:
"Several processes can decide to schedule threads on the same NUMA
node. If each process has only one runnable goroutine, the NUMA node
will be over-subscribed, while other nodes will be idle. To partially
alleviate the problem, we can randomize node numbering within each
process. Then the starting NODE0 refers to different physical nodes
across [Go] processes".

Hard binding of threads to hardware cores has number of unavoidable
(with current OS APIs) problems. Besides the above issues:
- Consider that OS/user wants to conserve power by moving running
threads to a single package and shutting down the other one. If we do
hard binding, we will interfere with that.
- Some container management systems periodically reset affinity masks
of processes. It will interfere with runtime trying to control where
threads run. Container manager will win, so runtime will need to
somehow adapt. That's pretty unpleasant from implementation point of
view.

A better solution may be to not try to fight operating system, but
rather adapt to its decisions.
The general idea is to tie mcache (stack cache, goroutine descriptor
cache, runqueue, etc) to physical CPU id rather than to logical P id.
Namely malloc uses current CPU id to find right mcache, while GC
threads query current CPU is and start scanning node-local stacks.
If we have something like RSEQ support
(https://lwn.net/Articles/697979/) we could directly use
per-physical-CPU data structures. But it's not particularly portable
(not sure if RSEQ can be emulates with UMS on Windows).
Another option would be to still cache mcache pointer in P, but
periodically (say, when we start running a new goroutine) check if we
are still on the same NUMA node. If we were moved to a different NUMA
node - release the old mcache and acquire a new mcache associated with
the current node. Platform specific part here is only obtaining
current CPU id, which is available at least on linux and windows.
However, the problem here is that we will need more than GOMAXPROCS
mcache's (a preempted/non-running thread can hold an mcache and we
don't have a way to retake the cache from that thread).

Bulatova, Maria

unread,

Sep 21, 2016, 10:43:11 AM9/21/16

to Dmitry Vyukov, golan...@googlegroups.com, Keith Randall, Rick Hudson, Austin Clements

>What part of the proposal are you implementing?

Implementing Scheduling part, option with hard binding of threads to nodes

>I still don't have good answer for the following problem listed at the end of the doc:
"Several processes can decide to schedule threads on the same NUMA node. If each process has only one runnable goroutine, the NUMA node will be over-subscribed, while other nodes will be idle. To partially alleviate the problem, we can randomize node numbering within each process. Then the starting NODE0 refers to different physical nodes across [Go] processes"

Theoretically we also can establish some inter-process communication to check if there are already Go processes, and behave accordingly (something like Intel TBB does).

>Hard binding of threads to hardware cores has number of unavoidable (with current OS APIs) problems. Besides the above issues: Consider that OS/user wants to conserve power by moving running threads to a single package and shutting down the other one. If we do hard binding, we will interfere with that.
We can check thread mask to find out, if there are limitations. Or do you mean changing in runtime? It looks harder, but can be managed.

>Some container management systems periodically reset affinity masks of processes. It will interfere with runtime trying to control where threads run. Container manager will win, so runtime will need to somehow adapt. That's pretty unpleasant from implementation point of view.

That is a good question, it should be possible to make runtime to interact with specific container management system - but we have no answer right now.

>A better solution may be to not try to fight operating system, but rather adapt to its decisions.

>Another option would be to still cache mcache pointer in P, but periodically (say, when we start running a new goroutine) check if we are still on the same NUMA node. If we were moved to a different NUMA node - release the old mcache and acquire a new mcache associated with the current node.

Do you mean such periodic check only in thread, that is going to run a goroutine? Or check all threads (to find threads that were moved during goroutine execution and not discovered that yet)?
And one moment about releasing mcache - that should be ok for newly created goroutines, but for old Gs arises question of moving data. So release of mcache suppose copying data to new mcache with notifying GC – is that what you mean?

Here is one point: can we suppose typical goroutines life duration? There should be completely different way of scheduling long- and short-living Gs. For long-living it is important to keep locality to data (steal accurately and turn goroutines back to home node if possible, or move data in memory). For short-living – that is not so crucial, but for them increase importance to place communicating goroutines close to each other.

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Dmitry Vyukov

unread,

Sep 22, 2016, 3:42:28 AM9/22/16

to Bulatova, Maria, golan...@googlegroups.com, Keith Randall, Rick Hudson, Austin Clements

On Wed, Sep 21, 2016 at 4:42 PM, Bulatova, Maria
<maria.b...@intel.com> wrote:
>>What part of the proposal are you implementing?
> Implementing Scheduling part, option with hard binding of threads to nodes
>
>>I still don't have good answer for the following problem listed at the end of the doc:
> "Several processes can decide to schedule threads on the same NUMA node. If each process has only one runnable goroutine, the NUMA node will be over-subscribed, while other nodes will be idle. To partially alleviate the problem, we can randomize node numbering within each process. Then the starting NODE0 refers to different physical nodes across [Go] processes"
> Theoretically we also can establish some inter-process communication to check if there are already Go processes, and behave accordingly (something like Intel TBB does).

This is not only about Go processes.
Even for Go processes, they don't necessary have permissions to
communicate via IPC (e.g. belonging to different users).
Or the processes can be a subject to some fair scheduling (say, one
should get 90% of CPU time and another -- 10%), if we start
interfering with scheduling that can result in each process getting
50% of time.
What I want to say is that interprocess load balancing looks like OS task.

>>Hard binding of threads to hardware cores has number of unavoidable (with current OS APIs) problems. Besides the above issues: Consider that OS/user wants to conserve power by moving running threads to a single package and shutting down the other one. If we do hard binding, we will interfere with that.
> We can check thread mask to find out, if there are limitations. Or do you mean changing in runtime? It looks harder, but can be managed.

That can be well changing at runtime. And that is not necessary
reflected in thread/process affinity masks, i.e. OS moves few
remaining runnable threads to a single package but leaves masks
intact.

>>Some container management systems periodically reset affinity masks of processes. It will interfere with runtime trying to control where threads run. Container manager will win, so runtime will need to somehow adapt. That's pretty unpleasant from implementation point of view.
> That is a good question, it should be possible to make runtime to interact with specific container management system - but we have no answer right now.
>
>>A better solution may be to not try to fight operating system, but rather adapt to its decisions.
>>Another option would be to still cache mcache pointer in P, but periodically (say, when we start running a new goroutine) check if we are still on the same NUMA node. If we were moved to a different NUMA node - release the old mcache and acquire a new mcache associated with the current node.
> Do you mean such periodic check only in thread, that is going to run a goroutine? Or check all threads (to find threads that were moved during goroutine execution and not discovered that yet)?

I meant checking only in the current thread.
Though, sysmon thread could do this checking for all threads and issue
preemption requests if necessary. I don't know how useful it will be.

> And one moment about releasing mcache - that should be ok for newly created goroutines, but for old Gs arises question of moving data. So release of mcache suppose copying data to new mcache with notifying GC – is that what you mean?
>
> Here is one point: can we suppose typical goroutines life duration? There should be completely different way of scheduling long- and short-living Gs. For long-living it is important to keep locality to data (steal accurately and turn goroutines back to home node if possible, or move data in memory). For short-living – that is not so crucial, but for them increase importance to place communicating goroutines close to each other.

These all are hard questions.
Some more points:
1. A typical long-running Go goroutine does not necessary have a large
memory footprint. A good example is goroutines accepting network
connections with almost 0 memory footprint.
2. Goroutines can allocate memory for other goroutines. E.g. prepare N
work items, spawn N worker goroutines, join them, repeat.
3. Modern OS try to migrate memory between nodes. OS bases decisions
on threads (not goroutines).

I don't think we can provide perfect locality for all possible cases,
but we can start with simpler parts:
- allocate local memory
- give new goroutines local stacks
- scan local stacks first

Reply all

Reply to author

Forward

0 new messages