JVM behavior switches between compiled and Interpreted modes?

64 views
Skip to first unread message

Animesh Trivedi

unread,
Nov 16, 2017, 4:40:57 AM11/16/17
to JITWatch
Hi all, 

I am having a peculiar issue with one of my Java/Scala projects. Let me elaborate what is happening. At some point during a run, the JVM switches from the compiled into the interpreter mode. At least this is what gather with my understanding of the JVM internals and the profile (using https://medium.com/netflix-techblog/java-in-flames-e763b3d32166). When this happens, the CPU profile looks something like this 

65.28%  java           perf-2030.map      [.] Interpreter
20.41%  java           perf-2030.map      [.] I2C/C2I adapters
8.84%    java           perf-2030.map      [.] jbyte_disjoint_arraycopy
1.42%    java           libjvm.so               [.] CodeHeap::find_start
0.54%    java           libjvm.so               [.] TypeArrayKlass::allocate_common
[...]

This state stays until the end of the run. And naturally, the performance collapses. However, there are runs (say about 30%) when this switch does not happen, and everything is properly JITed and I get good performance. This behavior I have seen while profiling some other Apache/Java projects as well but not this severely. My code is a standalone mixed-Java/Scala project but with some HDFS dependencies. I have tested this code with Oracle and OpenJDK JVMs, with both with Java 8 and 9. So, my questions are: 

1. Am I correct in assuming that with symbols like "Interpreter" and "I2C/C2I" symbols, the JVM is running in the interpreted mode?
2. Can someone explain what is happening? I can suspect some heuristic kicks in, but what? 
3. I have profiled it using JITWatch, however I do not see anything obvious. What kind of signals I should be looking for?
4. Is there is way I can see what code the "Interpreter" is executing?

Let me know if more information is need.

Many thanks for your help! 
--
Animesh

Chris Newland

unread,
Nov 16, 2017, 4:46:13 PM11/16/17
to JITWatch
Hi Aminesh,

It sounds like you could be experiencing high numbers of decompilation/recompilations and possibly hitting the "give-up" state in the JIT where it stops attempting to optimise a method and leaves it in the interpreted execution mode.

You can probably narrow this down to a few hot methods in your code and when you select those in JITWatch you can use the timeline view or just the main window to see how many compilations there were for those methods.

Unless you've overridden it, you will be using TieredCompilation (C1 attempts to produce a partly optimised version while gathering more profiling, and then C2 performs the full optimisation using the profile). I've seen several applications where this has caused de-opt thrashing and even code-cache exhaustion.

If this is proprietary code I won't ask you to post a LogCompilation file as they can expose a lot about your application but if you're happy to send me one directly (ch...@chrisnewland.com) I could take a look and see if I spot anything obvious.

You could also try disabling TC with -XX:-TieredCompilation or increasing the code cache with -XX:ReservedCodeCacheSize=BIGNUMBERm

Cheers,

Chris

Animesh Trivedi

unread,
Nov 17, 2017, 7:29:05 AM11/17/17
to Chris Newland, JITWatch
Hi Chris, 

Many thanks for helping me out. Based on your recommendation, I have narrowed down the suspects to these two functions [1] in the source code [2]

1. FastRowIterator:hasNext()  - in the a good run [4] this function is compiled 4 times whereas in a bad run this is compiled 10 times! And this function is the condition in the benchmark loop. So, I can imagine when this is interpreted, the performance collapses. You can see this happening around the 6 second mark in the bad run log [3]. 
2. FastRowIterator:readNext() - it also has some compilation around the 6 second mark. But I suspect it is related to the execution of the hasNext() function. 

The code is open sourced at [1]. The code is trying to read an HDFS file which has <int,payload> type data pattern written to it. The int indicated the size of the variable length byte array payload. There are 16 parallel threads reading 16 HDFS files. The code of interest is in SFFReadPattern.scala file. The private TestFastIterator class's run() [5] method has the benchmarking loop which class the FastRowIterator:hasNext() method. 

I also tried to disable the TieredCompilation and increase the cache code size up to 2GiB, but it did not make a difference. Some further questions I have: 

1. How can I check why a particular decision was taken regarding the compilation/de-optimization of a function?
2. Between C1 and C2 compilers and the different levels L1-L4, is there any preference order? such that one can say that the C2/L4 code is better than the code of C1/L3? 
3. Are OSR considered harmful or good? 
4. Is Scala code not supported with the TriView and source commenting?

Meanwhile, I am going to try to see how can I isolate the faulting condition that forces the compiler to de-optimize the hadNext() function. 

Thanks,
--
Animesh

[1] Benchmarking project code: https://github.com/animeshtrivedi/filebench


--
You received this message because you are subscribed to a topic in the Google Groups "JITWatch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jitwatch/vpzfphri0OQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jitwatch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Newland

unread,
Nov 19, 2017, 5:43:38 PM11/19/17
to JITWatch
Hi Animesh,

Answering in reverse order:

4) I know almost zero Scala and am relying on the bytecode line number table to map LogCompilation BCI references back to source lines. Scala source code mapping worked on the very basic examples I wrote for the sandbox but may not work at all for anything more complicated.

3) OSR is a good thing in that it gets you out of the interpreter and into (partially) optimised code faster than waiting for a fully optimised C2 version of the code to be emitted by the JIT.

2) I did a talk at JavaZone 2016 which hopefully answers this and some other introductory JIT questions https://2016.javazone.no/program/understanding-hot-spot-jvm-performance-with-jit-watch

1) Reasons for the JIT to compile a method: method invocation or loop back edge counters hit a threshold.

Reasons for the JIT to deoptimise a method:

a) Speculative optimisation invalidated (I think this one is happening)
b) monomorphic or bimorphic method call site sees a 3rd type used (megamorphic)
c) class loading invalidates an assumption about the class hierarchy

In your case here are the compilations for FastIteratorRow.hasNext:

BAD RUN

grep "FastIteratorRow hasNext" jvm-slow.log | grep "<task "

<task compile_id='1814' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='119652' iicount='119651' decompiles='2' level='3' stamp='1.918'>

<task compile_id='1873' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='159223' iicount='159224' decompiles='3' level='3' stamp='1.972'>

<task compile_id='1766' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='7120' iicount='7120' level='2' stamp='1.792'>

<task compile_id='1797' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='49718' iicount='49719' decompiles='1' level='2' stamp='1.811'>

<task compile_id='2431' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='184882' iicount='184883' decompiles='4' level='3' stamp='6.471'>

<task compile_id='1815' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='125175' iicount='125176' decompiles='2' stamp='1.925'>

<task compile_id='2432' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='185532' iicount='185533' decompiles='4' stamp='6.472'>

<task compile_id='1798' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='52136' iicount='52137' decompiles='1' stamp='1.812'>

<task compile_id='1878' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='171984' iicount='171982' decompiles='3' stamp='1.985'>

<task compile_id='1787' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='13589' iicount='13591' stamp='1.796'>


GOOD RUN


grep "FastIteratorRow hasNext" jvm-ok.log | grep "<task "

<task compile_id='1763' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='1337' iicount='1337' level='3' stamp='2.028'>

<task compile_id='1792' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='23799' iicount='23800' decompiles='1' level='3' stamp='2.053'>

<task compile_id='1795' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='29256' iicount='29258' decompiles='1' stamp='2.057'>

<task compile_id='1782' method='com/github/animeshtrivedi/FileBench/FastIteratorRow hasNext ()Z' bytes='17' count='8742' iicount='8743' stamp='2.037'>

 
So you can see in a bad run there are significantly more deoptimisations which is likely the cause of the performance drop, the question is why?

I'm going to speculate now what might be happening and could be completely wrong but:

 override def hasNext():Boolean= {

    readNext()

    !done

  }


hasNext() calls readNext() and I can see from the JIT logs that readNext is inlined into hasNext as a hot method.

I believe that a speculative optimisation in readNext is getting invalidated and therefore invalidating hasNext where it was inlined.

Here is the code for readNext()

  def readNext(): Unit = {

    incomingSize = stream.readInt()

    if(incomingSize == -1){

      done = true

      this.stream.close()

    } else {

      if (buffer.length < incomingSize) {

        /* we resize the buffer */

        buffer = new Array[Byte](incomingSize)

      }

      /* then we read the next value */

      readFullByteArray(stream, buffer, incomingSize)

      unsafeRow.pointTo(buffer, incomingSize)

    }

  }



There are 2 if statements:

if(incomingSize == -1)

if (buffer.length < incomingSize) 

The first if is an end-of-stream test which will most often be false. The second if tests if the buffer needs to be resized.

I think that both of these conditions are unlikely and the JIT has dead-code-eliminated their if-bodies and left a trap in place for deoptimisation if they do occur.

Can you confirm if you are running your tests with repeatable input data? If the data varies then I would expect you to deoptimise when the input data causes either one of these if branches to be taken after they have been DCE'd.

I also see you are testing with 16 reader threads. It's possible that the OS thread scheduling is causing small variations in the JIT behaviour due to the compilation counters triggering in different orders. Can you try your test with 1 thread and let me know if the behaviour becomes predictable?

Cheers,

Chris

Animesh Trivedi

unread,
Nov 20, 2017, 1:03:42 PM11/20/17
to Chris Newland, JITWatch
Hi Chris, 

thanks a lot for the detailed explanation, and the link to your talk. It was very helpful ! 

Over the weekend, with brute force exploration, and a bit of systematic look through at the key functions, I manage (I think!) to narrow it down to the fact that - hasNext()'s deoptimization has really some bad effect on JITing, specially due the fact that it is used in the while(itr.hasNext()) conditioning. I looked around at some sample implementations of the scala's Iterator[T] interface, and almost all of them have hasNext() as a light-weight function, and next() does much of the heavy lifting. Whereas in my code, it is the other way around. So, I suspect everytime hasNext is de-optimized, it really confuses JVM with the conditional branching and loop. So I have changed the sequence of the code now to something   like: https://github.com/animeshtrivedi/filebench/blob/dev/src/main/scala/com/github/animeshtrivedi/FileBench/FastIteratorRow.scala#L61

With this chance, I have not run into this issue so far (!, let see though). 

For your questions, I am running the code with repeatable input data set. The incoming payload was around 192 bytes, so I also pre-allocated that and eliminated the resizing logic. The error persisted. This error was hard to produce with 1,2, or 4 cores, but with 8 and 16 cores, it was more deterministic. So indeed, there must be something with the OS scheduling as well in to the mix. The challenging part was/is that the same logic sometime worked, so it was hard to press on a particular condition (logic wise) which might be influencing the execution . I still am not sure what has happened, or more importantly how to figure out from the logs which condition/environment/setting has triggered this error, but for now i can continue. I will try to confirm in couple of days if indeed this is the root cause and how to identify and avoid such conditions ;)

Many thanks Chris, 
--
Animesh

Jason Zaugg

unread,
Nov 20, 2017, 9:55:46 PM11/20/17
to JITWatch
Animesh: how long did your benchmark run in total? Did you have a warmup before you started measuring?

(I tried to run it myself but it seems to require `mvn install` on a set of downstream projects)

Regards

Jason Zaugg
To unsubscribe from this group and all its topics, send an email to jitwatch+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "JITWatch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jitwatch/vpzfphri0OQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jitwatch+u...@googlegroups.com.

Animesh Trivedi

unread,
Nov 22, 2017, 4:01:02 AM11/22/17
to Jason Zaugg, JITWatch
Hi Jason, 

The benchmark run about 5-6 seconds (when it manages to run without problems). During this run, each thread reads about couple of GBs of data from HDFS. Otherwise, the fastest thread (out of 16) takes about the same time, and then gradually others take about 10-15-20 seconds. These runs are without any warm-up. 

The general challenge is that often this behavior is triggered in a large distributed data processing runs (with Spark, Hadoop, Yarn) on multiple machines, where some are affected by this behavior, others are not. Hence, it is more difficult to analyze and pin down issues in that environment. I still have figure out what has happened with the distributed run when this behavior is trigger. I will try to re-write the Scala Iterator[T] interface implementation there. 

For the issue I reported, I managed to make a sort of micro-benchmark with just HDFS file reading. May be it is possible to make a standalone java code that triggers this behavior as well. 

Let me know if I can help you to re-produce the error or can tell you more details from my run. 

Thanks,
--
Animesh


To unsubscribe from this group and all its topics, send an email to jitwatch+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages