Long time to safepoint

Ross Black

unread,

Dec 5, 2018, 7:28:11 PM12/5/18

to mechanical-sympathy

Hi,

When running Kafka servers on Linux, the zookeeper sessions have been expiring frequently.
After lots of investigation, I discovered some threads (not the zookeeper threads)
taking a very long time to get to the safepoint (frequently > 2 seconds)

The output from a pause looks like:

[2018-12-05T22:54:58.034+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:54:58.034+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:54:58.034+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
... (repeated continually for approx 2 seconds)

[2018-12-05T22:55:00.548+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:55:00.549+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:55:00.550+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _running _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:55:00.551+0000][safepoint] ... found polling page return exception at pc = 0x00007f3397c7deff, stub =0x00007f338fb4c900
[2018-12-05T22:55:00.551+0000][safepoint] Thread: 0x00007f23840d1800  [0x17e4] State: _call_back _has_called_back 0 _at_poll_safepoint 0
[2018-12-05T22:55:00.551+0000][safepoint] Waiting for 3 thread(s) to block
[2018-12-05T22:55:00.551+0000][safepoint] Entering safepoint region: G1IncCollectionPause
[2018-12-05T22:55:00.566+0000][safepoint] Leaving safepoint region
[2018-12-05T22:55:00.566+0000][safepoint] Total time for which application threads were stopped: 2.5331635 seconds, Stopping threads took: 2.5180973 seconds

=> BOOT @ 2018-12-05T03:12:26+00:00, epoch 1543979546

=> 2018-12-05T22:55:00+00:00: epoch 1544050500, boot offset 70954

thread 0x17e4 = 6116

Running 'perf' to sample threads:
sudo perf record -F 99 --proc-map-timeout 5000 --sample-cpu --switch-events

sudo perf script --tid 6116 > thread-6116.txt

 ReplicaFetcherT  6116 [007] 70951.936265:   47425865 cycles:ppp:      7f338fc9ad8a register_finalizer Runtime1 stub (/tmp/perf-3239.map)
 ReplicaFetcherT  6116 [007] 70951.987993:   61736469 cycles:ppp:      7f33977d3165 Ljava/util/ArrayList;::add (/tmp/perf-3239.map)
** big time gap here
 ReplicaFetcherT  6116 [007] 70954.862422:   70423359 cycles:ppp:      7f3397b26fe0 Ljava/util/LinkedHashMap;::afterNodeInsertion (/tmp/perf-3239.map)
 ReplicaFetcherT  6116 [007] 70954.955571:   58601064 cycles:ppp:      7f3397782118 Ljava/lang/String;::charAt (/tmp/perf-3239.map)
 ReplicaFetcherT  6116 [007] 70955.010857:   56837522 cycles:ppp:      7f33977347b6 Ljava/util/concurrent/ConcurrentHashMap;::tabAt (/tmp/perf-3239.map)

sudo perf script --cpu 7 > cpu-7.txt

 ReplicaFetcherT  6116 [007] 70951.864747:   52073629 cycles:ppp:      7f3397f784f4 Lorg/apache/kafka/common/TopicPartition;::equals (/tmp/perf-3239.map)
         swapper     0 [007] 70951.889555:   52939697 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70951.914775:   45114901 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
 ReplicaFetcherT  6116 [007] 70951.936265:   47425865 cycles:ppp:      7f338fc9ad8a register_finalizer Runtime1 stub (/tmp/perf-3239.map)
         swapper     0 [007] 70951.958858:   61162448 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
 ReplicaFetcherT  6116 [007] 70951.987993:   61736469 cycles:ppp:      7f33977d3165 Ljava/util/ArrayList;::add (/tmp/perf-3239.map)
         swapper     0 [007] 70952.017399:   66816847 cycles:ppp:  ffffffff9f39ccb4 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70952.049231:   63438867 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
... (repeated continually for approx 2 seconds)

         swapper     0 [007] 70954.228216:  201536401 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.324191:  216277840 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.427185:  227564188 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.535554:  236205951 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.648051:  137191740 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.713384:  159810740 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.789490:  153089075 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
 ReplicaFetcherT  6116 [007] 70954.862422:   70423359 cycles:ppp:      7f3397b26fe0 Ljava/util/LinkedHashMap;::afterNodeInsertion (/tmp/perf-3239.map)
         swapper     0 [007] 70954.895969:   65525177 cycles:ppp:  ffffffff9f39cca9 cpu_idle_poll ([kernel.kallsyms])
         swapper     0 [007] 70954.927183:   59590550 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
 ReplicaFetcherT  6116 [007] 70954.955571:   58601064 cycles:ppp:      7f3397782118 Ljava/lang/String;::charAt (/tmp/perf-3239.map)
         swapper     0 [007] 70954.983485:   57455478 cycles:ppp:  ffffffff9f39ccab cpu_idle_poll ([kernel.kallsyms])
 ReplicaFetcherT  6116 [007] 70955.010857:   56837522 cycles:ppp:      7f33977347b6 Ljava/util/concurrent/ConcurrentHashMap;::tabAt (/tmp/perf-3239.map)

It looks like the thread (and CPU?) has been descheduled for approx 2 seconds!

I am struggling to find what is causing this behaviour.
It is occurring on 3 Kafka servers running a moderate load (all the same configuration).
No other applications are running on the machine (except for fail2ban-server, dnsmasq, node_exporter)

I have made numerous changes to the JDK, OS, and BIOS settings, but none of them have solved the problem.
The problem occurred on multiple versions of JDK 8 - I updated to JDK 11 just to get better safepoint diagnostics.
I have been hunting this problem for weeks now, and was hoping somebody could provide some help
(suggestions, tools, diagnostic params, tracing tips, anything!)

Thanks,
Ross

---
The current state of the system is:

Running Ubuntu JDK 11 + debug symbols:
openjdk version "10.0.2" 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.3)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.3, mixed mode)

/usr/lib/jvm/java-11-openjdk-amd64/bin/java -Xmx4G -Xms4G \
  -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+ExplicitGCInvokesConcurrent \
  -Djava.awt.headless=true \
  -XX:+PreserveFramePointer \
  -XX:+UnlockDiagnosticVMOptions \
  -XX:+DebugNonSafepoints \
  -Xlog:safepoint=trace:safepoint.log:utctime,tags:filecount=20,filesize=102400000 \
  -XX:+UseCountedLoopSafepoints \
  -XX:LoopStripMiningIter=10 \
  -XX:-Inline \
  ...

Description: Ubuntu 18.04.1 LTS
Release: 18.04

BOOT_IMAGE=/vmlinuz-4.15.0-36-generic intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll nomodeset net.ifnames=0

Motherboard: SuperMicro X10DRU-i+_R1.02b
Backplane: SuperMicro BPN-SAS-815TQ
Processor: 2.1GHz Intel Xeon-Broadwell (E5-2620-V4)
Firmware Version 3.1 6-8-2018

I have disabled everything power related that I could find in BIOS
(following https://www.thomas-krenn.com/en/wiki/Disable_CPU_Power_Saving_Management_in_BIOS)

$ cat /sys/module/intel_idle/parameters/max_cstate
0
$ cat /sys/devices/system/cpu/cpuidle/current_driver
none
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance

Alex Bagehot

unread,

Dec 6, 2018, 6:45:43 AM12/6/18

to mechanica...@googlegroups.com

Hi Ross,

Take a look at the safepoint profiler set-up analysis I did on this here:

"JVM detection of thread at safepoint"

https://groups.google.com/d/msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ

Here the detail on the ability to determine what the thread that was not coming to safepoint was doing :

https://groups.google.com/d/msg/mechanical-sympathy/tepoA7PRFRU/kSfIibpUAAAJ

Hopefully it should give you some commands to determine what the TTSP "big time gap here" is.

Thanks,

Alex

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,

Dec 6, 2018, 6:59:11 AM12/6/18

to mechanica...@googlegroups.com, Ross Black

On 12/6/18 2:28 AM, Ross Black wrote:

Hi,

When running Kafka servers on Linux, the zookeeper sessions have been expiring frequently.
After lots of investigation, I discovered some threads (not the zookeeper threads)
taking a very long time to get to the safepoint (frequently > 2 seconds)

A quick google shows that Kafka is using writeable mmaps. By itself this is already terrible if you have a large amount of data to write, but in combination with the JVM it is horrendous. An innocent memory access can cause the thread to be blocked on I/O, and since the dirtying rate can be much higher that cleaning rate, you can end up with the thread blocked for large amounts of time. The JVM doesn't know anything about it, so as far as it knows the thread is at a non-safe point.

If Kafka has a setting that disables mmap, I recommend trying it. The thread will still block, but at least it won't prevent Java from reaching a safepoint.

Margaret Figura

unread,

Dec 6, 2018, 9:49:58 AM12/6/18

to mechanica...@googlegroups.com

Hi Ross,

Do your servers have plenty of free memory that the kernel can use for the filesystem cache? Ideally, all data that might be requested from Kafka will be in the cache, otherwise accessing the data will cause a page fault to disk.

I'm not sure on your use-case, but for my project, we've done some work to make our usage of Kafka a bit more sympathetic to the system. We are in control of both the producers and the consumers, so our consumers publish their presence into ZooKeeper so that the producers for corresponding topics can find them. The producers then use their consumers' current offsets (position within the Kafka topic (queue)) to determine if they should slow down or pause and wait for the consumers to catch up. This back pressure prevents widening the gap between producers/consumers, which would require more filesystem cache space in order to keep requested data in memory.

Hope that helps,

Meg

Kodewerk

unread,

Dec 6, 2018, 10:06:02 AM12/6/18

to mechanica...@googlegroups.com

Hi Ross,

The only thing I would like to know is how many cores are you using and what is the total thread count?

Kind regards,

Kirk

Ross Black

unread,

Dec 7, 2018, 4:15:18 AM12/7/18

to mechanical-sympathy

Take a look at the safepoint profiler set-up analysis I did on this here:
"JVM detection of thread at safepoint"
https://groups.google.com/d/msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ

Hi Alex.

If I understand this correctly it is grabbing the thread stack at the time it invokes the safepoint poll?

I am quite new to using 'perf' so I did not manage to get your commands to work.

$ sudo perf probe -d probe_libjvm:*
"probe_libjvm:*" does not hit any event.

Are you able to explain in more detail please ? (I am presumably missing something obvious)

I think that the JDK 11 safepoint logging is already showing the PC when it reaches the safepoint (if I can trust it?)

[2018-12-05T22:55:00.551+0000][safepoint] ... found polling page return exception at pc = 0x00007f3397c7deff, stub =0x00007f338fb4c900

from the perf-map generated for the Kafka process:
7f3397c7db40 80 Lkafka/server/ReplicaFetcherThread;::replicaId
7f3397c7dea0 80 Ljdk/internal/misc/Unsafe;::putIntUnaligned
7f3397c7e1c0 c0 Ljdk/internal/misc/Unsafe;::putIntUnaligned

which puts the PC somewhere within Unsafe;::putIntUnaligned. ?

Thanks,

Ross

Ross Black

unread,

Dec 7, 2018, 4:25:27 AM12/7/18

to mechanical-sympathy

The only thing I would like to know is how many cores are you using and what is the total thread count?

Hi Kirk,

It is running bare metal, with 32 CPUs showing under /sys/devices/system/cpu/

jstack shows ~80 threads excluding GC thread, and 136 threads including GC.

> [_java_thread_list=0x00007f41fc000b20, length=80]

This matches the safepoint logging that shows the thread count:

> Safepoint synchronization initiated. (81)

Ross Black

unread,

Dec 7, 2018, 4:41:47 AM12/7/18

to mechanical-sympathy

Hi Meg,

Thanks for the response.

The info about how consumer lag may impact mapped memory is extremely interesting. I had read similar about lag control elsewhere but never even considered it could be a problem here.

I control both producers and consumers for all of the topics, so I will take notice of whether the problem occurs more when some consumers are lagging.

As @Avi suggested, it seems extremely likely that Kafka use of mmap is the cause of the safepoint delays.

Part of my problem is trying to prove it.. I cannot see anything on the stack traces of blocked threads (but sampling makes it unlikely to catch)

We are (probably unwisely) using ext4 + raid - this might amplify any mmap problems.

We are running with 64G memory, and 4G allocated to Kafka heap.

It looks like Kafka has consumed most of the memory for cache.

              total        used        free      shared buff/cache   available
Mem:       65849884     5526480      395740        8644    59927664    59640260
Swap:       1999868           0     1999868

%Cpu(s): 6.8 us, 0.4 sy, 0.0 ni, 91.6 id, 1.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65849884 total,   351272 free, 5523400 used, 59975212 buff/cache
KiB Swap: 1999868 total, 1999868 free,        0 used. 59642964 avail Mem

PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
12805 kafka     20   0 61.843g 4.523g 59444 S 312.5 7.2   3704:57 java

Thanks,

Ross

Anthony Maire

unread,

Dec 7, 2018, 4:53:43 AM12/7/18

to mechanica...@googlegroups.com

What file system are you using ?

EXT4 journalling could really be a huge perf killer in such usecase and cause this TTSP issue if Kafka uses mmap ( cf http://epickrram.blogspot.com/2015/12/journalling-revisited.html : After much detailed investigation, we tracked the problem down to interaction between the writer code and the ext4 journalling daemon (jbd2) )

If you are using EXT4, you might want to do some tests with XFS or disabling journaling, it might help.

Ismael Juma

unread,

Jan 5, 2019, 10:04:16 PM1/5/19

to mechanica...@googlegroups.com, Ross Black

On Thu, Dec 6, 2018 at 3:59 AM Avi Kivity <a...@scylladb.com> wrote:

A quick google shows that Kafka is using writeable mmaps. By itself this is already terrible if you have a large amount of data to write, but in combination with the JVM it is horrendous.

Quick clarification: Kafka only uses writeable mmaps for the offset and time indexes.

Ismael

Ross Black

unread,

Jan 20, 2019, 10:31:28 PM1/20/19

to mechanical-sympathy

Hi Anthony.

We were using RAID + EXT4, with journalling enabled.

Thanks for providing the link to the Mark Price blog .. those articles were fantastic in helping me understand the issue.

We moved several of the Kafka servers to XFS + JBOD and no longer get the large pauses.

Thanks for your help,

Ross

nigr...@gmail.com

unread,

Apr 28, 2019, 4:30:47 PM4/28/19

to mechanical-sympathy

Sorry to join late the party, but...a read only mmap isn't equally dangerous?
If the page cache doesn't contain its mapping, reading from it sequentially (and madvice is just...an advice) won't save us from major/minor page faults anyway.
Just 1 major page fault with transparent huge page + the JVM safepoint machinery could lead long TTSP...or am I missing something?
FileChannel::read seems to me the only way to go to avoid long TTSP unless we are sure of what the OS page contain..

Reply all

Reply to author

Forward