vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
298.236: no vm operation [ 385 0 0 ] [ 0 0 21638 1 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
320.877: no vm operation [ 336 0 0 ] [ 0 0 1978 1 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
336.859: no vm operation [ 309 1 2 ] [ 0 0 6497 1 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
351.362: no vm operation [ 294 0 0 ] [ 0 0 32821 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
401.189: no vm operation [ 296 3 2 ] [ 0 0 3471 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
427.675: no vm operation [ 303 0 0 ] [ 0 0 10414 1 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
446.098: no vm operation [ 346 2 1 ] [ 0 0 48768 1 0 ] 0
2020-06-04T00:49:40.008+0000: 268.104: Total time for which application threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 seconds
2020-06-04T00:49:42.411+0000: 270.507: Total time for which application threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 seconds
2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs]
2020-06-04T00:49:44.566+0000: 272.662: Total time for which application threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 seconds
2020-06-04T00:49:46.114+0000: 274.210: Total time for which application threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 seconds
2020-06-04T00:49:47.120+0000: 275.216: Total time for which application threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 seconds
2020-06-04T00:49:50.127+0000: 278.222: Total time for which application threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 seconds
2020-06-04T00:50:02.136+0000: 290.232: Total time for which application threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 seconds
2020-06-04T00:50:07.140+0000: 295.235: Total time for which application threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 seconds
2020-06-04T00:50:31.781+0000: 319.877: Total time for which application threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 seconds
2020-06-04T00:50:34.761+0000: 322.857: Total time for which application threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 seconds
2020-06-04T00:50:55.261+0000: 343.357: Total time for which application threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 seconds
2020-06-04T00:51:36.088+0000: 384.184: Total time for which application threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 seconds
2020-06-04T00:51:56.565+0000: 404.661: Total time for which application threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 seconds
2020-06-04T00:52:29.996+0000: 438.092: Total time for which application threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 seconds
2020-06-04T00:53:26.772+0000: 494.868: Total time for which application threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 seconds
-Xmx32g
-Xms32g
-XX:MaxDirectMemorySize=24g
-XX:+PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=10000
-XX:-UseBiasedLocking
-Xloggc:<path>/logs/gc_worker.log
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=7093
-Dcom.sun.management.jmxremote.rmi.port=7093
-Djava.net.preferIPv4Stack=true
I already read through most of the discussion at these links regarding somewhat
similar behavior:
- https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
- https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
- https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
- https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
- http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
- http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606...@webmail.messagingengine.com%3e
The application itself does a lot of mapped file IO against spinning disks. Each mmap
is usually just a few MiB across 12 enterprise-class disk drives or a ramdisk. However,
previously we've been able to max out 10Gbit connections with the same application on
much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144
threads doing the mmapped IO at any given time on this hardware.
One of the points that stands out the most in this situation is that the
-XX:+SafepointTimeout doesn't actually print any information about threads failing to come
to a safepoint. From my understanding, any time I get a safepoint statistics log where the
"sync" time is larger than 10000, then I should be getting output to stdout of the thread
names which failed to come to a safepoint. In my logs, there is no such output.
Any ideas for debugging this is greatly appreciated. One thing I have not yet tried is using perf
and kernel uprobes to capture the stacks (as suggested here). The reason I have not tried this
is because this software actually runs on a cluster of ~90 nodes (all the same hardware and
software stack) and the occurrence of the pauses isseemingly random across all of the nodes.
Setting up the uprobes and collecting the data will be a complex task which I haven't had the
time for yet.
Appreciate ideas anyone has to offer
Thanks,
Zac
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/05345cdc-fd0b-4ae6-853b-ad8e140005b6o%40googlegroups.com.
Detected TTSP issue: start: 41.768 wait: 1748.384
Dumping stack for thread 0x0000440019600000
"elasticsearch[localhost][search][T#49]" id: 171575 prio: 5 os_prio: 0 sched: SCHED_OTHER allowed_cpus: 000000000ffffffffffffffff
lock_release: 43.516 last_cpu: 9 cpu_time: 25
0 0x00007fc63245f5bc SafepointProfilerBuf::record_sync_stack(JavaThread*)
1 0x00007fc6324817e5 JavaThread::poll_at_safepoint_static(JavaThread*)
2 0x000000003000eae9 StubRoutines::safepoint_handler
3 0x00000000300dfc73 org.apache.lucene.store.DataInput.readVInt()I
4 0x0000000031fa91df org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V
5 0x0000000031f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
6 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/lucene/index/TermContext;
7 0x00000000319a40f6 interpreter
8 0x00000000319a3cba interpreter
signal_sent: 42.768 signal_responded: 43.516 state: D wchan: wait_on_page_bit_killable last_cpu: 9 cpu_time: 25
0 0x00000000300926f2 org.apache.lucene.store.ByteBufferIndexInput.readByte()B
1 0x00000000300dfacd org.apache.lucene.store.DataInput.readVInt()I
2 0x0000000031fa9119 org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V
3 0x0000000031f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
4 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/apache/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/lucene/index/TermContext;
5 0x00000000319a40f6 interpreter
6 0x00000000319a3cba interpreter
7 0x00000000319a3cba interpreter
8 0x00000000319a3cba interpreter