How badly do JFR stack traces lie?

233 views
Skip to first unread message

Remko Popma

unread,
Dec 2, 2017, 12:29:44 AM12/2/17
to mechanical-sympathy
For background, see https://stackoverflow.com/q/47590263/1446916

Apangin’s answer seems plausible, would like to hear insights from people on this list.

Remko

Oleg Mazurov

unread,
Dec 2, 2017, 10:43:34 PM12/2/17
to mechanical-sympathy
Indeed, JFR (or AsyncGetCallTrace) callstacks can not be blindly trusted. What likely happens here is that StringBuilder.setLength() is inlined along with its callees (Array.fills among them) and callers into some dynamically compiled method. When that method is asynchronously interrupted at an arbitrary point, in order to reconstruct the full Java callstack JFR has to rely on the PC to Java callstack fragment mapping provided by the JIT compiler for this method. As JDK-8022893 demonstrates, the compiler is not particularly good at that. To test this theory one would need to map events attributed to Arrays.fill() in the provided context to PC's in the actual compiled method, figure out what Java callstack fragments were provided by JIT for those PC's and see whether those fragments were correct (not trivial).
Why isn't this mapping problem revealed in exception callstacks generated by runtime? Exceptions like NPE do not happen at arbitrary points in code and, presumably, were easier to report and escalate to developers (it's runtime behavior after all). More subtle corner cases visible only with sophisticated tools get lower priority and remain in the backlog for years, if not merely discarded.

    -- Oleg

Nitsan Wakart

unread,
Dec 3, 2017, 6:38:53 AM12/3/17
to mechanica...@googlegroups.com
So, as apangin points out there's an issue where JFR cannot walk the stack safely. To add insult to injury, JFR does not report failed samples at all, which results in a systematic omission of certain methods from the profile. This is a massive reporting issue in my opinion, and has not been fixed in JDK 9. I have discussed it with members of the JFR team, hopefully it will be fixed in the near future. Honest Profiler and Async-Profiler are both significantly better in that regard.
On top of that, JFR and other Java level profilers, rely on the debug information provided by the JVM to help translate the sampled program counter(PC)to a Java bytecode and relevant stack trace. The debug information provided by default is quite sparse, and can be greatly improved by -XX:+DebugNonSafepoints. Even after that, the translation can be lacking, or misleading. This is due in part to certain compiler optimisations not creating the relevant mapping information. In any case the information is often incomplete, and where a mapping is not available the nearest mapping is taken(e.g. No bytecode is associated with PC, but there's a mapping available for PC+10, so report that BCI). The mapping of a single instruction to a BCI is also at times incorrect as the PC is in fact the combined result of many bytecodes.
Further more, instruction profiling itself suffers from certain inaccuracies leading to the reported PC to be a few(normally 1-10, but on rare occasions allot more) instructions after the instruction where most of the actual cost is.
The above complications are compounded by method in lining, which results in the compiler mixing up code from several methods together to generate a single 'real method'. So where before you could skid a few instructions, map to the wrong BCI, but still end up looking at the right method, with inlining you can easily skid between lines in different methods.
This issue exists in all profilers when reporting a Java line of code or method. Instruction level profilers will show the skid within the context of the real method, but will provide enough context IME for you to find the root cause.
The potential for error is quite large, and does happen. It does not render JFR useless, but it helps to be aware of the above and reach for other tools(honest-profiler, async-profiler, perf, Oracle Studio, VTune etc) when the data seems suspect. Definitely start by enabling DebugNonSafepoints.
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Richard Warburton

unread,
Dec 3, 2017, 7:15:17 AM12/3/17
to mechanica...@googlegroups.com
Hi,

I'm glad being lazy in replying to this email thread has lead to Nitsan making most of the points I was going to make and some more that I would have missed ;)

I will also add to this that the underlying code that JFR uses I believe is a bit different to AsyncGetCallTrace, even though its taking a similar approach of using sigprof signals to interrupt the program and collect stack traces without safepointing. I've noticed a couple of occassions when profiling things where JFR has misattributed time taken even with -XX:+DebugNonSafepoints switched on that Honest Profiler didn't. I suspect that async-profiler and Oracle Studio would both do as well as Honest Profiler in this regard, though I didn't try them. I've not seen it enough times to really nail down when/why it happens but if you try to profile something network IO heavy it can sometimes miss a lot of the time sending/receiving from the network. I've also seen it happen when profiling an application that was doing large quantities of compression where the Java code was calling into a native library to perform the compression. It underestimated the time in this code path.

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
regards,

  Richard Warburton

Reply all
Reply to author
Forward
0 new messages