Continuous performance monitoring with Java FlightRecorder (JFR)?

1,020 views
Skip to first unread message

zeo...@gmail.com

unread,
Dec 1, 2016, 3:18:18 PM12/1/16
to mechanical-sympathy, mar...@hirt.se
Hi all,

Does anyone know a good way to do continuous performance monitoring using JFR (JDK8)? I am interested in using this on some apache data pipeline projects (Spark, Flink etc). I have used JFR for perf profiling with fixed duration before. Continuous monitoring would be quite different.

The ideal scenario is to set up JFR to write to UDP <ip:port> destinations with configurable update frequencies. Obviously that is not supported by JFR as it stands today. So I tried setting up continuous JFR with maxage=30s and running JFR.dump every 30s, to my surprise the time range covered by the dumped jfr files does NOT correspond to the maxage parameter I gave. Instead the time ranges (FlightRecordingLoader.loadFile(new File("xyz.jfr")).timeRange) from successive JFR.dump can be overlapping and much bigger than maxage.

So couple of questions for those experienced users of JFR:

-- What exactly is the semantics of maxage?
I imagined that maxage has 2 effects: discarding events older than maxage and aggregating certain metrics (like stacktrace sample counts) over the time interval. It appears my understanding was way off.

-- How does the event pool/buffer under consideration for next JFR.dump get reset?
I was hoping every JFR.dump would reset the pool and allow the next JFR.dump to output non-overlapping time range. I was also wrong here.

-- Is there any way to do continuous perf monitoring with JFR with a configured aggregation and output interval?
One thing I did notice is that JFR would periodically (default seems 60s) flush to chunk files and then rotate chunk files according to maxchunksize param. I could use that mechanism to inotify-watch the repository dir and just read and parse the chunk files. However there are a few things missing if I wanted to go down this route: there is no way to set "maxchunkage" (would like to be able to set one as low as 10s), I will need to write some custom chunk file parser, not sure if chunk files have all the symbols to resolve the typeids.

Thanks!

Gil Tene

unread,
Dec 1, 2016, 7:06:59 PM12/1/16
to mechanical-sympathy, mar...@hirt.se
Virtually all the benefits of monitoring come in production environments (by definition, I think), and that's probably why you don't see this scenario (as) commonly used with JFR.

Basically, using JFR for production [currently at least] requires a commercial Java SE Advanced license. How/if this is enforced technically is irrelevant, the click-thropiugh license that allows you to use it for free is specifically restricted to non-production use. This is spelled out in the Oracle Binary Code License Agreement for the Java SE Platform Products and JavaFX (http://www.oracle.com/technetwork/java/javase/terms/license/index.html), under SUPPLEMENTAL LICENSE TERMS... A. COMMERCIAL FEATURES. and B. SOFTWARE INTERNAL USE FOR DEVELOPMENT LICENSE GRANT. And since JFR is clearly marked as a "Commercial Feature" (you literally have to use the -XX:+UnlockCommercialFeatures -XX:+FlightRecorder to use it) it's impossible to claim ignorance of this fact. See e.g. https://www.infoq.com/news/2013/10/misson-control-flight-recorder, http://www.adam-bien.com/roller/abien/entry/java_mission_control_development_pricing, and https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/run.htm#JFRUH164 for some discussion and mentions around it. 

So while JFR can and may do some cool (and even semi-unique) things for production monitoring, you'd have to clear the commercial pricing terms first, and those seem pretty steep, as in a list price of $5000 per 2 x86 cores according to the Oracle price list (http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf), which would equate to e.g. $40K per instance for EC2 m3.2xlarge instances, and $80K-160K per server for modern 2 socket servers (those with "only" 16-32 cores). While I'm sure the actual production pricing could end up much lower once purchasing departments finish hand-wrestling with Oracle's sales folks, it would probably still be way more than other commercial monitoring and JVM-knowledgable APM solutions that are much more feature rich and focused would cost (e.g. Dynatrace, AppDynamics, NewRelic, etc.), all of which list at a tiny fraction of the Oracle Java SE Advanced list price levels (and are massively used in production systems).

Marcus Hirt

unread,
Dec 1, 2016, 7:44:38 PM12/1/16
to zeo...@gmail.com, mechanical-sympathy
Hi there,

The max age tells the recording engine to keep data around until the max age is passed. Since the smallest chunk of data to reason about in the file repository is a chunk, that usually means a little bit more. It’s only for controlling what data to keep in the repo - a simple mechanism to limit it.

Every chunk is self contained. Even though it is not supported, there are people who simply copy chunks from the repo continuously. I know one example where this is done in a massive deployment, and where hundreds of thousands of recordings from select instances are passed through an automated script to look for certain patterns. A developer estimated 95% of the issues encountered are solved using nothing but the recording data. 

Kind regards,
Marcus 

zeo...@gmail.com

unread,
Dec 2, 2016, 5:08:37 AM12/2/16
to mechanical-sympathy, mar...@hirt.se
Hi Gil,

Thanks for the heads up and price references! I was certainly wary of the license aspect even though the project I am planning is for open source development.

Would there be anything of similar capability in openjdk? Looking at the openjdk src repo, it seems that there has been some more JEP 167 (http://openjdk.java.net/jeps/167) oriented changes introduced recently into jdk9. The event tracing logic in the jdk8 tracing code seems to already cover the core feature set of native (as opposed to BCI) JFR: stacktrace samples, monitor waits, alloc/gc events, compiler events. Judging by the small volume of changes between 2013 and 2015, I am guessing the tracing feature is not used much in openjdk7/8 and might be of uncertain reliability however (e.g. see this: https://bugs.openjdk.java.net/browse/JDK-8145788). Maybe I should look more into using those openjdk tracing capabilities instead of JFR for jdk9. The runtime configurability and resource management (like JFR's buffer/chunk/checkpoint) isn't quite there yet and I might need to write some hacks to enable output destination that is not stdout/stderr.

As for commercial APM out there, not doubt they will have lots of custom BCI to cover app server use cases. I wonder how well (low overhead and high accuracy) they do on the jvm native instrumentation side (stack sample, alloc events, monitor wait). Same goes for profilers like Yourkit/JProfiler.

Zee

zeo...@gmail.com

unread,
Dec 2, 2016, 6:13:56 AM12/2/16
to mechanical-sympathy, zeo...@gmail.com
Hi Marcus,

Thanks for the clarification, it is encouraging to hear about real deployment cases!

Do you know if there is a way to trigger chunk rotation at a higher frequency than 60s? There doesn't seem to be any JFR option param that will modify that behavior. BTW in JDK8 jcmd doesn't accept most of the options (maxchunksize etc) for JFR.start. Is it possible to call the VMJFR.rotate() method in the target jvm directly (maybe via btrace script)?

Zee

Gil Tene

unread,
Dec 2, 2016, 11:45:21 AM12/2/16
to mechanical-sympathy, mar...@hirt.se
I agree with the need/wish for a common way to get such information from JVMs. A common & standard way for JVMs and OpenJDK to provide event tracing as well as low-runtime-cost JVM instrumentation details (for information no currently covered by JVMTI and not cheaply [enough] gleaned via BCI) would be very useful but is not (yet?) part of the platform. JFR is very capable but is custom to Oracle. Zing has similar capabilities (and in some cases even more detailed information, such as viewing down-to-the-generated-machine-instruction hotness and stack traces) but those capabilities are also custom (to Zing). And IBM's J9 has it's own very detailed instrumentation capabilities. 

When looking at overlaps with APM and profiling tools, BCI, JVMTI, and other standard and semi-standard instrumentation levels do provide some overlapping capabilities for most things, and often at very affordable and practical runtime costs [that's why these production-time API tools are so popular]. But there are certainly some JVM-based instrumentation capabilities that current JVMs (HotSpot, Zing, J9) can fundamentally do "better" than the spec'ed standard things that profilers and APMs have access to, and/or in much cheaper ways, leaving their information in the custom jvm tooling arena for now. Specific examples of this include things like (A) tracking and examining wait times on monitor and j.u.c lock instances (for which the JVM has first hand knowledge that is better and more useful than the information that can be gleaned by external tools in clean/cheap-enoigh way); (B) the ability to use tick-based stack tracing outside of safepoints to profile code behavior. This tick-based stack tracing [outside of safepoints] is important not only because it is "cheap enough" to provide practical profiles with near-zero runtime overhead, but because it is accurate enough to make that information useful (as opposed to tick-based at-safepoint or at-BCI-instruimented-point instrumentations, which will often skew profiles dramatically). And (C) the ability to track and report on very useful heap content stats (e.g. by-type heap occupancy and by-type occupancy velocities) that the JVM can fundamentally measure cheaply [as a nearly-free part of GC scanning] but is not available to common tools via defined interfaces or log formats [forcing tools to re-instrument the heap to extract this data if they want it, often at a prohibitively high runtime cost]. Knowledge of generated code behavior, including information about compilations and deoptimizations, as well as the ability to express stack traces in terms of generated code locations (which makes stack traces both much cheaper and much more accurate in the non-heizengerg-ing sense) is also an area for which JVMTI capabilities could be greatly extended.

But even without those we-could-do-better capabilities, monitoring JVM behaviors in production seems to be doing pretty well. It can always be better, of course, but the state of these productioon-monitoring tools for Java is generally well ahead of what is available in almost all other languages and/or runtimes.

Nitsan Wakart

unread,
Dec 7, 2016, 6:10:04 AM12/7/16
to mechanica...@googlegroups.com
For profiling information you have the option of using honest-profiler:
It works on OpenJDK, Oracle (going back to before JMC days) and Zing (recent releases), though it does rely on an unofficial profiling API.
It does not cover anything else though (monitors/allocations etc)...
Reply all
Reply to author
Forward
0 new messages