I appreciate the responses. We
looked into switching JVMs. We will try to upgrade to the latest OpenJDK
8u252, but that won't be for a bit - and the Zing VM was ruled out for
now unfortunately. If I can reproduce the issue in an isolated environment
outside the cluster I may try the newer JVMs myself, but so far I
haven't had luck isolating and reproducing it anywhere else.
brought up that the application performs IO via mmap in the original
post because from most of my research, mmap can cause the JVM to pause
for an extended period of time if there is a page fault. However, based
on our system instrumentation, nothing seems to suggest that we're
fetching or writing pages out to disk for tens of seconds to multiple
minutes at a time. The most data we mmap at a time is probably about 1MB
According to our monitoring system
the disk utilization is relatively low. Here's an example of the system
behavior we are seeing.
are some pauses we saw with the timestamps (the columns represent the
same as they would from the safepoint output) in the last week.
Op: operation timestamp spin block sync cleanup vmop page-trap-count
Op: no-vm-operation JVM Time: 2020-06-18 05:47:14.972000+00:00; 0ms 0ms 1058341ms 12ms 0ms 0ms
Op: no-vm-operation JVM Time: 2020-06-18 06:05:06.332000+00:00; 0ms 0ms 463072ms 15ms 0ms 0ms
the monitoring system info for the ~1.5 hours around the time period
from the events about (timestamps above are GMT, but monitoring is PST).
The spikes in utilization and disk IO don't seem to have a strong
correlation with the times that the JVM paused.
mmap is at the root of the issue I would have expected more IO
utilization. From my understanding for mmap to take an extended period
of time it would likely be evicting pages from the page cache, and then
reading the data into memory. That would take a long period of time if
either (1) the only pages that are evict-able are dirty since those need
to be written to disk, (2) if a read from disk took a long amount of time, or some combination of both (1) and (2) .
If the evict-able pages are not dirty, then a page cache miss should
only take approximately the amount of time to read data from disk
because I would expect evicting a page in memory to be a quick
operation. Maybe I have some misconceptions about how the page cache
functions though. Either way, based on the system resource utilization,
it seems like it's not under any serious kind of stress during these time periods.
I am skeptical that it is due to mmap because
of the observed system stats,
but I can't come up with another culprit. One of the pauses above is
1058s - nearly 18 minutes. It's hard for me to reason about how a
single (or even a few) mmap calls could take so long.
1000s+ pause is definitely the highest I've seen. I wonder if it could
somehow be attributed to faulty disks which take inexplicably long
amounts of time to read data? The problem with that theory though is
that we see multi-minute pauses pretty consistently across most, if not
all, machines in the cluster. The probability that so many disks could
behave in such similar ways seems low.
Going to continue to investigating the issue. My next steps for looking at this are going to include
- Grabbing kernel and userspace stacks during a pause
- Collecting and analyzing stats on syscall latencies
Thanks again for the suggestions,