Putting regular jstack snapshots into production.

327 views
Skip to first unread message

Kevin Burton

unread,
Mar 16, 2015, 5:31:28 PM3/16/15
to mechanica...@googlegroups.com
My app has a bit of jitter where it periodically will do 'weird' things from time to time that I"m trying to track down.

I just don't have retroactive telemetry to figure out what the code was doing at the time.

So I think that I want to do is have a 1 minute cron job that takes snapshots of the JVM via jstack and writes it to a 'ring buffer' or rolling log of snapshot files.

This way I can keep around about 2 hours of jstack snapshots and if something fails I can analyze these for any potential issues.

One thing I'm concerned about is holding a long safepoints regularly.

It looks like it's 200-400ms to dump the whole jstack. I"m not sure what % of this is due to actually reading the data or printing/formatting it.

But I think 200ms every 60000ms (1 minute) is probably find.

Think I'm being paranoid here? It seems like the telemetry data is probably worth it.  

Greg Young

unread,
Mar 16, 2015, 5:32:52 PM3/16/15
to mechanica...@googlegroups.com
Do you have multiple production servers e.g. for back up etc? Can you
run it on a secondary node?
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Kevin Burton

unread,
Mar 16, 2015, 6:05:52 PM3/16/15
to mechanica...@googlegroups.com
Yes.  that's a good point.  I think we *could* but what if the problem is in production systems.  But maybe I could have a paranoid mode where I just turn it on for all systems for a while.

Another idea is to not have it running all the time and only enable it when I need to track down an error.  

Greg Young

unread,
Mar 16, 2015, 6:08:29 PM3/16/15
to mechanica...@googlegroups.com
So we used to do something like this where our "back up" system was a
fully live system e.g. fully duplicated. We ran such metrics normally
only on the back up and not on the primary (even though they both did
the same work). It worked for many things but not all.

The problem with retrospective stuff is often when you turn it on is
not when you need it :)
>> > email to mechanical-symp...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Studying for the Turing test
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Trask Stalnaker

unread,
Mar 16, 2015, 7:18:05 PM3/16/15
to mechanica...@googlegroups.com
You can find out how long it takes the threads to reach safepoint and how long the safepoint is held by running with

-XX:+PrintSafepointStatistics -XX:+PrintGCApplicationStoppedTime -XX:PrintSafepointStatisticsCount=1

Then when you take a thread dump it will print something like

38.070: ThreadDump                       [      27          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0004191 seconds, Stopping threads took: 0.0000197 seconds

200-400ms sounds like a lot, if you have a chance to try this in your environment please post back findings.

Btw, it sounds like Flight Recorder is exactly what you're looking for (low overhead non-safepoint rolling stack trace sampling), minus the (unknown?) commercial cost of course.

Thanks,
Trask


--

Todd Lipcon

unread,
Mar 16, 2015, 8:02:12 PM3/16/15
to mechanica...@googlegroups.com
Funny you bring this up -- I was having the exact discussion with colleagues this morning.

From my reading of the code a while back, jstack without the "-l" option doesn't invoke a global safepoint. Each thread has to come to a safepoint, but they don't need to wait for the other threads to also reach safepoint. So, although there might be some minor latency hit to each thread in turn, there isn't a global pause.

Do others who know more about this stuff disagree with my interpretation of the code?

-Todd

Vitaly Davidovich

unread,
Mar 16, 2015, 8:19:39 PM3/16/15
to mechanica...@googlegroups.com

There's an old-ish thread on this: https://groups.google.com/forum/m/#!topic/mechanical-sympathy/Mb3OiLACer8

Gil says hotspot doesn't support bringing just 1 thread to a safepoint - entire VM needs to come to it.  The double whammy is that jstack apparently causes this to be done per-thread.

sent from my phone

Kevin Burton

unread,
Mar 16, 2015, 9:08:43 PM3/16/15
to mechanica...@googlegroups.com
Wow.  That's not a double whammy isn't... that's an N whammy because it needs a global safepoint PER THREAD?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kevin Burton

unread,
Mar 16, 2015, 9:19:41 PM3/16/15
to mechanica...@googlegroups.com


On Monday, March 16, 2015 at 4:18:05 PM UTC-7, Trask Stalnaker wrote:
You can find out how long it takes the threads to reach safepoint and how long the safepoint is held by running with

-XX:+PrintSafepointStatistics -XX:+PrintGCApplicationStoppedTime -XX:PrintSafepointStatisticsCount=1


Cool. I"ll try to play with that. 

peter royal

unread,
Mar 16, 2015, 10:07:44 PM3/16/15
to mechanica...@googlegroups.com
I read about https://github.com/riemann/riemann-jvm-profiler today which might be useful. (concept-wise at least)
-pete 

-- 
(peter.royal|osi)@pobox.com - http://fotap.org/~osi
--

Nitsan Wakart

unread,
Mar 18, 2015, 2:38:28 AM3/18/15
to mechanica...@googlegroups.com
If you care only for executing threads you can use honest profiler. No safe point is required, but only on CPU threads are profiled.

Lifey

unread,
Mar 29, 2015, 10:26:18 AM3/29/15
to mechanica...@googlegroups.com
BTW if you want to analyze your thread dumps altogether you may consider using mjprof "merge" and "group" 

Kevin Burton

unread,
Mar 29, 2015, 7:00:50 PM3/29/15
to mechanica...@googlegroups.com
That actually is pretty cool.. I have some internal code called 'stackreport' (which I should probably OSS) that does grouping of all the threads.  So I find that this solved 99% of my requirements.

but I'll def take a look at this in the futrue.
Reply all
Reply to author
Forward
0 new messages