On Thu, Sep 8, 2016 at 4:21 PM, Dmitry Vyukov <
dvy...@google.com> wrote:
> On Thu, Sep 8, 2016 at 10:40 AM, Bulatova, Maria
> <
maria.b...@intel.com> wrote:
>> Hi!
>>
>> I am trying to implement some of your suggestions about NUMA-aware
>> scheduler, and want to understand how it is better evaluate runtime
>> performance after some changes. Now we run benchmarks (debian shootout and
>> some synthetic benches with lots of memory accesses) and measure completion
>> time for them.
>>
>> Looking for better way I found
http://golang.org/cl/21503 - could you
>> please more explain how measurements here done?
>
>
> Hello Maria,
>
> If you mean histograms in
http://golang.org/cl/21503: I added manual
> instrumentation to runtime that printed latency to console and then
> used some standard linux utility to turn that into histograms (don't
> remember name, something that I found with apt-cache search).
>
> The other results are obtained on http benchark from
>
https://github.com/golang/benchmarks
> Results are processed with benchstat utility
> (
https://godoc.org/rsc.io/benchstat).
>
>
> Do we have any good benchmarks with large memory consumption for NUMA testing?
> I can think only of garbage benchmark with -benchmem=32768 flag:
>
https://github.com/golang/benchmarks/blob/master/garbage/garbage.go
Maria,
We are talking about this proposal, right?
https://docs.google.com/document/u/1/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub
What part of the proposal are you implementing?
I still don't have good answer for the following problem listed at the
end of the doc:
"Several processes can decide to schedule threads on the same NUMA
node. If each process has only one runnable goroutine, the NUMA node
will be over-subscribed, while other nodes will be idle. To partially
alleviate the problem, we can randomize node numbering within each
process. Then the starting NODE0 refers to different physical nodes
across [Go] processes".
Hard binding of threads to hardware cores has number of unavoidable
(with current OS APIs) problems. Besides the above issues:
- Consider that OS/user wants to conserve power by moving running
threads to a single package and shutting down the other one. If we do
hard binding, we will interfere with that.
- Some container management systems periodically reset affinity masks
of processes. It will interfere with runtime trying to control where
threads run. Container manager will win, so runtime will need to
somehow adapt. That's pretty unpleasant from implementation point of
view.
A better solution may be to not try to fight operating system, but
rather adapt to its decisions.
The general idea is to tie mcache (stack cache, goroutine descriptor
cache, runqueue, etc) to physical CPU id rather than to logical P id.
Namely malloc uses current CPU id to find right mcache, while GC
threads query current CPU is and start scanning node-local stacks.
If we have something like RSEQ support
(
https://lwn.net/Articles/697979/) we could directly use
per-physical-CPU data structures. But it's not particularly portable
(not sure if RSEQ can be emulates with UMS on Windows).
Another option would be to still cache mcache pointer in P, but
periodically (say, when we start running a new goroutine) check if we
are still on the same NUMA node. If we were moved to a different NUMA
node - release the old mcache and acquire a new mcache associated with
the current node. Platform specific part here is only obtaining
current CPU id, which is available at least on linux and windows.
However, the problem here is that we will need more than GOMAXPROCS
mcache's (a preempted/non-running thread can hold an mcache and we
don't have a way to retake the cache from that thread).