Futex flood

Дмитрий Пеньков

unread,

Jan 22, 2017, 11:42:54 PM1/22/17

to mechanical-sympathy

Hello everyone. I have following situation: multiple HotSpot JVMs with multiple GCs' (ParallelGC) threads on multiple CPUs. I've already seen recommendations to have min 2 cores per JVM, but in my case it's 80 CPU's and about 200 JVMs. After working for about an hour, I have >95% of time for futex syscall futex_wait. I tried to fix the number of parallel GC threads in 4, but it didn't make effect. Mb GC changing can help? How to solve this situation?

Gil Tene

unread,

Jan 23, 2017, 12:48:16 AM1/23/17

to mechanical-sympathy

What OS distro and kernel version are you running? I also because any time someone mentions "interesting" behavior with futex_wait, my knee-jerk reaction is to first rule out the issue described in this previous posting before spending too much time investigating other things. Ruling it out is quick and easy (make sure your kernel is not in the version ranges affected by the bug described), so well worth doing as a first step.

Assuming you issue is NOT the one linked to above. You have some analysis and digging to do. I'd say that when having 200 actually active JVMs run on 80 cpus, seeing significant time spent in futex_wait is not necessarily a surprising situation. It [heavily] depends on what what the applications are actually doing, and what their activity patterns look like. You'd need to study that and start drilling down for explanations from there. Speculating about the cause (and potential solutions) based on the info below would be premature.

Kirk Pepperdine

unread,

Jan 23, 2017, 2:56:32 AM1/23/17

to mechanica...@googlegroups.com

I would suggest that with 200 JVMs running on 80 core you should consider using the serial collector.

Kind regards,

Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Дмитрий Пеньков

unread,

Jan 23, 2017, 3:36:02 AM1/23/17

to mechanical-sympathy

Hello, Gil! Surprised by your fast response.

Yes, I found topic about linux futex issue and checked my OS ver. and CPU's arch. for matches with described in it, no matches.

I have RHEL Server release 7.2, Intel(R) Xeon(R) CPU E7-8860 2.27GHz, JDK 1.8_0.65, HotSpot 64-bit Server VM. JVMs are HP IUM collectors and handle some data.

I know that my JVMs/CPUs ratio is not good, but don't think that's problem's root. As I know GC needs to stop all running thread to be able to walk the heap/stack. For this, it uses a particular signal which is sent to all threads. The signal handler calls sigwait to stop the thread.

Based on this, decreasing number of threads configuring GC (there's ParallelGC works on HotSpot as I know) by -XX:ParallelGCThreads=N would be to help I thought, but it didn't. So I can make conclusion I don't understand relations between GC threads and futex syscalls in full.

понедельник, 23 января 2017 г., 12:48:16 UTC+7 пользователь Gil Tene написал:

Дмитрий Пеньков

unread,

Jan 23, 2017, 3:47:10 AM1/23/17

to mechanica...@googlegroups.com

Separating server by containers with 10 CPU per container ratio solves problem, but it explains almost nothing. As I said, I configured number of GC threads, and if in this case nothing happens, I think problem somewhere in interJVMs (or their GCs' threads) communication. But I can't find information about behavior of JVMs and their GC's in my JVMs/CPU's ratio case.

понедельник, 23 января 2017 г., 14:56:32 UTC+7 пользователь Kirk Pepperdine написал:

Kirk Pepperdine

unread,

Jan 23, 2017, 3:56:59 AM1/23/17

to mechanica...@googlegroups.com

So, if my understanding is correct you are trying to give 200 JVMs 10 core each. That means you are trying to distribute the work of 2000 virtual cores amongst 80. That is a 25:1 virtual to real core ratio or each virtual core seeing 4% of a real core. Let's not talk about the numbers of threads running in each JVM or the effects of having 8 of the 200 JVMs have overlapping GC.

Kind regards,

Kirk

On Monday, January 23, 2017, Дмитрий Пеньков <dmit...@gmail.com> wrote:

Separating server by containers with 10 CPU per container ratio solves problem, but it explains almost nothing. As I said, I configured number of GC threads, and if in this case nothing happens, I think problem somewhere in interJVMs (or their GCs' threads) communication. But I can't find information about behavior of JVMs and their GC's in my JVMs/CPU's ratio situation.

понедельник, 23 января 2017 г., 14:56:32 UTC+7 пользователь Kirk Pepperdine написал:

I would suggest that with 200 JVMs running on 80 core you should consider using the serial collector.

Kind regards,

Kirk

On Monday, January 23, 2017, Gil Tene <g...@azul.com> wrote:

What OS distro and kernel version are you running? I also because any time someone mentions "interesting" behavior with futex_wait, my knee-jerk reaction is to first rule out the issue described in this previous posting before spending too much time investigating other things. Ruling it out is quick and easy (make sure your kernel is not in the version ranges affected by the bug described), so well worth doing as a first step.

Assuming you issue is NOT the one linked to above. You have some analysis and digging to do. I'd say that when having 200 actually active JVMs run on 80 cpus, seeing significant time spent in futex_wait is not necessarily a surprising situation. It [heavily] depends on what what the applications are actually doing, and what their activity patterns look like. You'd need to study that and start drilling down for explanations from there. Speculating about the cause (and potential solutions) based on the info below would be premature.

On Sunday, January 22, 2017 at 8:42:54 PM UTC-8, Дмитрий Пеньков wrote:
Hello everyone. I have following situation: multiple HotSpot JVMs with multiple GCs' (ParallelGC) threads on multiple CPUs. I've already seen recommendations to have min 2 cores per JVM, but in my case it's 80 CPU's and about 200 JVMs. After working for about an hour, I have >95% of time for futex syscall futex_wait. I tried to fix the number of parallel GC threads in 4, but it didn't make effect. Mb GC changing can help? How to solve this situation?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Aleksey Shipilev

unread,

Jan 23, 2017, 4:14:51 AM1/23/17

to mechanica...@googlegroups.com

$ perf record -g java ...

...and then

$ perf report

...to see where those futexes are called from.

It might be relatively benign, like the awaits from the GC worker taskqueue. But
maybe there are actual Java locks involved.

Thanks,
-Aleksey

signature.asc

Alex Bagehot

unread,

Jan 23, 2017, 6:15:29 PM1/23/17

to mechanica...@googlegroups.com

Below are some possible avenues of investigation based on the
information available. Your cpu is a 10 core. you said you have 80
cpus therefore you have 8 processors. If you put 25 JVMs in 8 10 cpu
containers(cpuset[2] cgroups) and align them along processor
boundaries then you'll likely minimise remote numa memory cost and
kernel scheduling[3][4] costs :

"
By default, there is one sched domain covering all CPUs, [...]

This default load balancing across all CPUs is not well suited for
the following two situations:
1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary.
2) [...]

When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other.
"[1]

To make the link between scheduler overhead and Java work you can use
linux perf either sampling the cpu or as below tracing events like
'sched:sched_wakeup' :

java 26969 [000] 1368598.128643: sched:sched_wakeup: java:26945 [120]
success=1 CPU:006
ffffffff810a6732 ttwu_do_wakeup ([kernel.kallsyms])
ffffffff810a682d ttwu_do_activate.constprop.85 ([kernel.kallsyms])
ffffffff810a93bd try_to_wake_up ([kernel.kallsyms])
ffffffff810a94f0 wake_up_state ([kernel.kallsyms])
ffffffff810d2396 wake_futex ([kernel.kallsyms])
ffffffff810d2757 futex_wake_op ([kernel.kallsyms])
ffffffff810d4f52 do_futex ([kernel.kallsyms])
ffffffff810d5410 sys_futex ([kernel.kallsyms])
ffffffff81614209 system_call_fastpath ([kernel.kallsyms])
7f6dddb29dc6 pthread_cond_signal@@GLIBC_2.3.2
(/usr/lib64/libpthread-2.17.so)
7f6ddcc40095 Monitor::IWait
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcc40746 Monitor::wait
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce42cfa WorkGang::run_task
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcca175a ParNewGeneration::collect
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddc94fdca GenCollectedHeap::do_collection
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddc7ee4c4 GenCollectorPolicy::satisfy_failed_allocation
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce237f4 VM_GenCollectForAllocation::doit
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2bc35 VM_Operation::evaluate
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a00a VMThread::evaluate_operation
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a38e VMThread::loop
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddce2a800 VMThread::run
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6ddcc85b88 java_start
(/usr/java/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so)
7f6dddb25df5 start_thread (/usr/lib64/libpthread-2.17.so)

In terms of gathering more information:

Aleksey mentioned using Perf in its cpu sampling mode. You can also
run it system wide:

java MyApp
# wait for problem
perf record -ga -- sleep 20
./run_perf_map_agent
# check the output/symbols/events
perf script
# post process to flamegraphs

Cpu sampling indicates where the cpu time is spent. Tracing events on
the other hand shows how threads proceed and interact at certain
points in the code. In the knowledge that this is something to do with
futexes you could trace signal generation/delivery or scheduler
wakeup/switch, etc.

The perf events for wakeup/switch are 'sched:sched_wakeup' and
'sched:sched_switch' (caution - these may be high frequency events and
cause high tracing overhead). Use Perf map agent [8] to get
visibility of compiled java stack frames if application code is
provoking contended locks, jit compilation, failed allocations
requiring gc etc. All of these will involve the kernel scheduler to
coordinate with other threads.

Perf sends all its events from kernel to userspace, which isn't very
efficient for high frequency events like the scheduler, both in terms
of cpu but also disk space for the trace files. If you have a more
recent 4.x kernel then you can use eBPF programs to trace and
summarise in kernel, and only copy the aggregated results up to
userspace making it more efficient.

You could monitor scheduler stats[5], numastat, vmstat. Linux Perf
counters (perf stat) may also some high level insights into cpu
efficiency and counters on events. One counters question might be
whether the number of scheduler events is similar for container vs. no
container and therefore the scheduler is working harder per event.

Finally, loosely related, some wakeups may not be necessary. You are
probably not hitting this but there have been some enhancements in
Java 9 [7] to eliminate spurious/ unnecessary wakeups on threads
participating in safepoints and GC with certain configs. With such a
large number of JVMs "guaranteed safepoints"[9] could also potentially
become a material overhead.

Thanks,
Alex

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
[2] http://man7.org/linux/man-pages/man7/cpuset.7.html
[3] https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt
[4] https://lwn.net/Articles/252533/
[5] https://www.kernel.org/doc/Documentation/scheduler/sched-stats.txt
[7] https://bugs.openjdk.java.net/browse/JDK-8151670
[8] https://github.com/jrudolph/perf-map-agent
[9] https://www.lmax.com/blog/staff-blogs/2015/08/05/jvm-guaranteed-safepoints/

On Mon, Jan 23, 2017 at 8:47 AM, Дмитрий Пеньков <dmit...@gmail.com> wrote:
> Separating server by containers with 10 CPU per container ratio solves
> problem, but it explains almost nothing. As I said, I configured number of
> GC threads, and if in this case nothing happens, I think problem somewhere
> in interJVMs (or their GCs' threads) communication. But I can't find
> information about behavior of JVMs and their GC's in my JVMs/CPU's ratio

> situation.

>>> email to mechanical-symp...@googlegroups.com.

>>> For more options, visit https://groups.google.com/d/optout.
>

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-symp...@googlegroups.com.

Дмитрий Пеньков

unread,

Jan 29, 2017, 8:36:51 PM1/29/17

to mechanical-sympathy

you are trying to give 200 JVMs 10 core each.

Before separating all 200 JVMs "see" all 80 CPUs. After separating I have 8 boxes with some fixed number JVMs (part of those 200) and 10 CPU per box.

понедельник, 23 января 2017 г., 15:56:59 UTC+7 пользователь Kirk Pepperdine написал:

So, if my understanding is correct you are trying to give 200 JVMs 10 core each. That means you are trying to distribute the work of 2000 virtual cores amongst 80. That is a 25:1 virtual to real core ratio or each virtual core seeing 4% of a real core. Let's not talk about the numbers of threads running in each JVM or the effects of having 8 of the 200 JVMs have overlapping GC.

Kind regards,

Kirk

On Monday, January 23, 2017, Дмитрий Пеньков <dmit...@gmail.com> wrote:

Separating server by containers with 10 CPU per container ratio solves problem, but it explains almost nothing. As I said, I configured number of GC threads, and if in this case nothing happens, I think problem somewhere in interJVMs (or their GCs' threads) communication. But I can't find information about behavior of JVMs and their GC's in my JVMs/CPU's ratio situation.

понедельник, 23 января 2017 г., 14:56:32 UTC+7 пользователь Kirk Pepperdine написал:

I would suggest that with 200 JVMs running on 80 core you should consider using the serial collector.

Kind regards,

Kirk

On Monday, January 23, 2017, Gil Tene <g...@azul.com> wrote:

What OS distro and kernel version are you running? I also because any time someone mentions "interesting" behavior with futex_wait, my knee-jerk reaction is to first rule out the issue described in this previous posting before spending too much time investigating other things. Ruling it out is quick and easy (make sure your kernel is not in the version ranges affected by the bug described), so well worth doing as a first step.

Assuming you issue is NOT the one linked to above. You have some analysis and digging to do. I'd say that when having 200 actually active JVMs run on 80 cpus, seeing significant time spent in futex_wait is not necessarily a surprising situation. It [heavily] depends on what what the applications are actually doing, and what their activity patterns look like. You'd need to study that and start drilling down for explanations from there. Speculating about the cause (and potential solutions) based on the info below would be premature.

On Sunday, January 22, 2017 at 8:42:54 PM UTC-8, Дмитрий Пеньков wrote:
Hello everyone. I have following situation: multiple HotSpot JVMs with multiple GCs' (ParallelGC) threads on multiple CPUs. I've already seen recommendations to have min 2 cores per JVM, but in my case it's 80 CPU's and about 200 JVMs. After working for about an hour, I have >95% of time for futex syscall futex_wait. I tried to fix the number of parallel GC threads in 4, but it didn't make effect. Mb GC changing can help? How to solve this situation?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward