Detecting cause of java app freezing

Denis Orlov

unread,

Apr 4, 2014, 2:35:17 AM4/4/14

to mechanica...@googlegroups.com

We have problem with our java application.

Sometimes (once per several month) app pauses for 2-5 seconds. We observe it as loss of heartbeats for set of our network clients. And app's log don't have any records during that freeze time. GC logs don't show anything that could be the cause of application freeze for 5! seconds.

Brief application profile:

- about 50 incoming and outgoing network connections (LBM over TCP/IP and UDP, JMS and JMX/RMI over TCP/IP),
- 3Gb memory,
- about 100threads

Could you suggest some tools/approach that will allow us to identify root cause.

As I understand jhiccup will just show us that we have pauses but will not help understand the reasons.

If this list is a wrong place for such kind of questions pointing to proper places will be much appreciated.

Richard Warburton

unread,

Apr 4, 2014, 4:17:20 AM4/4/14

to mechanica...@googlegroups.com

Hi,

I'm sure other people will have helpful suggestions to make, but here's a couple of observations.

Sometimes (once per several month) app pauses for 2-5 seconds. We observe it as loss of heartbeats for set of our network clients. And app's log don't have any records during that freeze time. GC logs don't show anything that could be the cause of application freeze for 5! seconds.

GC isn't the only thing that causes STW pauses, even before the GC occurs it takes time to get to a safepoint where your app will be fully paused. Also lock inflation/deflation still causes a STW pause on hotspot. You'll see both of these things in your GC logs only if you use -XX:+PrintGCApplicationStoppedTime. I would say its unlikely that they're causing a 5 second pause but its still worth measuring to see.

As I understand jhiccup will just show us that we have pauses but will not help understand the reasons.

I've not used jhiccup recently but I remember it having a mode where it would spin off another jvm, run jhiccup in that jvm and then correlate the resulting pauses. That won't tell you the root cause of your problem, but it will help with the diagnosis. If you are seeing long pauses within the application jvm, but not the jhiccup-only jvm then you should look to JVM instance specific issues such as GC, lock deflation/inflation etc. If you see correlated pauses, or pauses of equal length in both the application jvm and the jhiccup jvm then you can infer that you have a machine level problem. For example your threads are appearing to be paused because they've been context switched out or you're running on say a virtualised environment with a noisy neighbour. So jhiccup can be a helpful diagnostic tool, but won't provide you with a complete diagnosis.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Martin Thompson

unread,

Apr 4, 2014, 4:38:50 AM4/4/14

to mechanica...@googlegroups.com

As Richard has pointed out, you need to confirm if it is a JVM or a greater platform issue. If the JVM is not the issue I would turn attention to the virtual memory sub-system of Linux. I've seen similar sized pauses related to page cache activity or the use of transparent huge pages (THP). In a low-latency environment you should turn off THP. Failure to allocate a THP can result in kernel having to compact pages to have sufficient contiguous space for the huge page as a stop-the-world event, just like GC!

A page cache behaviour to lookout for is "page allocation failure" in the logs that can be helped by tuning vm.min_free_kbytes so page reclamation starts sooner and you don't get caught out. Linux is very aggressive at caching pages for file systems. Also the kernel needs memory to handle "atomic" operations for actions such as IO transfers without waiting on page reclamation.

It it also worth considering that in times of burst activity can your 100+ threads starve out your heart-beating mechanism? Have you designed to prevent this or have some means of knowing if it has happened?

Martin...

Denis Orlov

unread,

Apr 4, 2014, 7:37:53 AM4/4/14

to mechanica...@googlegroups.com

Thanks a lot for suggestions. We will start from jhiccip setup.

I forget to mention that we are on Intel, Solaris with Solaris zones

04.04.2014 12:38 пользователь "Martin Thompson" <mjp...@gmail.com> написал:

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/bEelGUQSZmA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Orlov

unread,

Apr 7, 2014, 1:47:06 AM4/7/14

to mechanica...@googlegroups.com

>>It it also worth considering that in times of burst activity can your 100+ threads starve out your heart-beating mechanism? Have you designed to prevent this or have some means of knowing if it has happened?

Yes, such situation is possible. As I know we don't try to prevent such situations (we use network session library from our other internal group). We just set 'heartbeat loss' threshold to big enough value (it was 1 sec and after we had app freeze last time we make it bigger and set to 4 sec now).

Is there any well known technics that allow to cope/identify situations where we have threads starvation?

We noticed other thing that could be the reason of our app freezes. Each time before app freeze other network library writes to app log that it is tries to connect remote host (that doesn't response since we shut down that downstream system some time ago). And right after that app freeze happens.

Is it possible in java app that some IO from one thread blocks whole application IO in all other threads?

alex bagehot

unread,

Apr 8, 2014, 12:04:03 PM4/8/14

to mechanica...@googlegroups.com

Is there any well known technics that allow to cope/identify situations where we have threads starvation?

you need a lightweight way of detecting it given it's only a few seconds out of 2 months time.

if the absence of a heartbeat event can be detected locally by for example a looping script to check the last event in a log, when it detects the absence of the event it can take diagnostics on the system and jvm whilst the pause is active.

it would be worth setting up system and jvm monitoring on the environment to get a picture of what it is like when working normally.

if the heartbeats are coming from clients how do you know that they are reaching the server? ie. that it is the server/jvm and not the network or clients that are freezing.

Is it possible in java app that some IO from one thread blocks whole application IO in all other threads?

yes, if the io blocking thread holds monitor locks that the other threads must obtain at some point without timeout then if the io blocks for long enough then inevitably all the other threads will become blocked over time. It doesn't have to be the io itself but the locks that the io thread is holding, or something else!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,

Apr 10, 2014, 4:54:06 PM4/10/14

to

Just as a blind shot: I have had similar issues caused by stream redirection to an external python script which catched all System.out and System.err stuff. Every time this single threaded script did a log rollover, the java process could not write to sysout which caused the whole VM to stall. You can also get similar behaviour with very stupid synchronous logging + logfile rollover or logging to a nfs drive.

Reply all

Reply to author

Forward