Below are some possible avenues of investigation based on the
information available. Your cpu is a 10 core. you said you have 80
cpus therefore you have 8 processors. If you put 25 JVMs in 8 10 cpu
containers(cpuset[2] cgroups) and align them along processor
boundaries then you'll likely minimise remote numa memory cost and
kernel scheduling[3][4] costs :
"
By default, there is one sched domain covering all CPUs, [...]
This default load balancing across all CPUs is not well suited for
the following two situations:
1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary.
2) [...]
When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other.
"[1]
To make the link between scheduler overhead and Java work you can use
linux perf either sampling the cpu or as below tracing events like
'sched:sched_wakeup' :
java 26969 [000] 1368598.128643: sched:sched_wakeup: java:26945 [120]
success=1 CPU:006
ffffffff810a6732 ttwu_do_wakeup ([kernel.kallsyms])
ffffffff810a682d ttwu_do_activate.constprop.85 ([kernel.kallsyms])
ffffffff810a93bd try_to_wake_up ([kernel.kallsyms])
ffffffff810a94f0 wake_up_state ([kernel.kallsyms])
ffffffff810d2396 wake_futex ([kernel.kallsyms])
ffffffff810d2757 futex_wake_op ([kernel.kallsyms])
ffffffff810d4f52 do_futex ([kernel.kallsyms])
ffffffff810d5410 sys_futex ([kernel.kallsyms])
ffffffff81614209 system_call_fastpath ([kernel.kallsyms])
7f6dddb29dc6 pthread_cond_signal@@GLIBC_2.3.2
(/usr/lib64/
libpthread-2.17.so)
7f6ddcc40095 Monitor::IWait
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcc40746 Monitor::wait
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce42cfa WorkGang::run_task
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcca175a ParNewGeneration::collect
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddc94fdca GenCollectedHeap::do_collection
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddc7ee4c4 GenCollectorPolicy::satisfy_failed_allocation
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce237f4 VM_GenCollectForAllocation::doit
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2bc35 VM_Operation::evaluate
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a00a VMThread::evaluate_operation
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a38e VMThread::loop
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a800 VMThread::run
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcc85b88 java_start
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6dddb25df5 start_thread (/usr/lib64/
libpthread-2.17.so)
In terms of gathering more information:
Aleksey mentioned using Perf in its cpu sampling mode. You can also
run it system wide:
java MyApp
# wait for problem
perf record -ga -- sleep 20
./run_perf_map_agent
# check the output/symbols/events
perf script
# post process to flamegraphs
Cpu sampling indicates where the cpu time is spent. Tracing events on
the other hand shows how threads proceed and interact at certain
points in the code. In the knowledge that this is something to do with
futexes you could trace signal generation/delivery or scheduler
wakeup/switch, etc.
The perf events for wakeup/switch are 'sched:sched_wakeup' and
'sched:sched_switch' (caution - these may be high frequency events and
cause high tracing overhead). Use Perf map agent [8] to get
visibility of compiled java stack frames if application code is
provoking contended locks, jit compilation, failed allocations
requiring gc etc. All of these will involve the kernel scheduler to
coordinate with other threads.
Perf sends all its events from kernel to userspace, which isn't very
efficient for high frequency events like the scheduler, both in terms
of cpu but also disk space for the trace files. If you have a more
recent 4.x kernel then you can use eBPF programs to trace and
summarise in kernel, and only copy the aggregated results up to
userspace making it more efficient.
You could monitor scheduler stats[5], numastat, vmstat. Linux Perf
counters (perf stat) may also some high level insights into cpu
efficiency and counters on events. One counters question might be
whether the number of scheduler events is similar for container vs. no
container and therefore the scheduler is working harder per event.
Finally, loosely related, some wakeups may not be necessary. You are
probably not hitting this but there have been some enhancements in
Java 9 [7] to eliminate spurious/ unnecessary wakeups on threads
participating in safepoints and GC with certain configs. With such a
large number of JVMs "guaranteed safepoints"[9] could also potentially
become a material overhead.
Thanks,
Alex
[1]
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
[2]
http://man7.org/linux/man-pages/man7/cpuset.7.html
[3]
https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt
[4]
https://lwn.net/Articles/252533/
[5]
https://www.kernel.org/doc/Documentation/scheduler/sched-stats.txt
[7]
https://bugs.openjdk.java.net/browse/JDK-8151670
[8]
https://github.com/jrudolph/perf-map-agent
[9]
https://www.lmax.com/blog/staff-blogs/2015/08/05/jvm-guaranteed-safepoints/
On Mon, Jan 23, 2017 at 8:47 AM, Дмитрий Пеньков <
dmit...@gmail.com> wrote:
> Separating server by containers with 10 CPU per container ratio solves
> problem, but it explains almost nothing. As I said, I configured number of
> GC threads, and if in this case nothing happens, I think problem somewhere
> in interJVMs (or their GCs' threads) communication. But I can't find
> information about behavior of JVMs and their GC's in my JVMs/CPU's ratio
> situation.
>>> email to
mechanical-symp...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
mechanical-symp...@googlegroups.com.