HotSpot code cache and instruction prefetching

404 views
Skip to first unread message

Chris Newland

unread,
Feb 3, 2017, 4:26:22 AM2/3/17
to mechanical-sympathy
Hi,

I've been looking into how HotSpot arranges JIT-compiled native code in the code cache and the method appears to be:

1) Search the free-list (a linked list of blocks freed up by old methods removed from the code cache)

2) If there is a large enough free block then use it. If not then create a new block at the end of the current block (until you reach the code cache size limit).

I've added a visualisation for this to JITWatch[1] https://www.youtube.com/watch?v=XeTgtS3xdcc using the information found in the LogCompilation nmethod tags.

My question is: Is the HotSpot location of compiled methods optimal regarding the CPU's instruction prefetching?

After methods start getting removed from the code cache and nmethod locations occur less sequentially in blocks from the free-list will this make the layout worse for prefetching?

Do you think the HotSpot designers took this into account but found empirically that the simple algorithm is adequate (cost complexity outweighs gains and hot methods are generally JIT-compiled together).

Could there be any benefit in relocating blocks for hot call chains to match the call pattern once the program has reached a steady state? (assuming inlining has already succeeded as much as possible).

Since tiered compilation became the default, do you think the many (possibly unconnected) intermediate compilations have made prefetching worse?

Sorry for so many questions! Just interested in whether this matters or not to modern CPUs.

Many thanks,

Chris
@chriswhocodes

Aleksey Shipilev

unread,
Feb 3, 2017, 4:54:35 AM2/3/17
to mechanica...@googlegroups.com
On 02/03/2017 10:26 AM, Chris Newland wrote:
> Do you think the HotSpot designers took this into account but found empirically
> that the simple algorithm is adequate (cost complexity outweighs gains and hotimp
> methods are generally JIT-compiled together).

Let's ask another question: do you have the example where that matters?

> Could there be any benefit in relocating blocks for hot call chains to match the
> call pattern once the program has reached a steady state? (assuming inlining has
> already succeeded as much as possible).

Well, the "hot path" is supposed to be inlined and critical path laid out
sequentially within the compilation unit, so it is not catastrophic.

> Since tiered compilation became the default, do you think the many (possibly
> unconnected) intermediate compilations have made prefetching worse?

...but yes, for tiered, there are versions of the code that are known to be
temporary (e.g. compilations on level 1,2,3), while the final compilation stays
around for longer (level 4). This is why Segmented Code Cache was implemented in
JDK 9: http://openjdk.java.net/jeps/197

IIRC, there were improvements on torturous workloads, and improvements in
nmethod scans.

Thanks,
-Aleksey

signature.asc

Chris Newland

unread,
Feb 4, 2017, 5:41:38 AM2/4/17
to mechanical-sympathy
Hi Aleksey,


On Friday, 3 February 2017 09:54:35 UTC, Aleksey Shipilev wrote:
On 02/03/2017 10:26 AM, Chris Newland wrote:
> Do you think the HotSpot designers took this into account but found empirically
> that the simple algorithm is adequate (cost complexity outweighs gains and hotimp
> methods are generally JIT-compiled together).

Let's ask another question: do you have the example where that matters?

No, this was just pure curiosity :) My gut feel was that the CPU instruction cache would smooth out any hiccups in the prefetcher.
 

> Could there be any benefit in relocating blocks for hot call chains to match the
> call pattern once the program has reached a steady state? (assuming inlining has
> already succeeded as much as possible).

Well, the "hot path" is supposed to be inlined and critical path laid out
sequentially within the compilation unit, so it is not catastrophic.

> Since tiered compilation became the default, do you think the many (possibly
> unconnected) intermediate compilations have made prefetching worse?

...but yes, for tiered, there are versions of the code that are known to be
temporary (e.g. compilations on level 1,2,3), while the final compilation stays
around for longer (level 4). This is why Segmented Code Cache was implemented in
JDK 9: http://openjdk.java.net/jeps/197


This JEP makes a lot more sense now I've got an understanding of the current code cache.

Thanks,

Chris

Nitsan Wakart

unread,
Feb 8, 2017, 10:18:12 AM2/8/17
to mechanica...@googlegroups.com
We've implemented a code cache allocation scheme to this effect in recent versions of Zing. Zing's code cache was similarly naive and since Zing has been tiered compiling for a while now we started at a similar point to what you describe.
The hypothesis (supported by some evidence) was that in sufficiently large compiled code workloads, with sufficiently numerous hot methods (one of those flat profiles where the list of methods taking > 1% cycles is long), and given long enough for late compiles to kick in, you can end up with a dispersed set of code blobs in you code cache. Ignoring the risk of code cache exhaustion, the cost we were seeing for some workloads was in iTLB misses.
The scheme we ended up with is different than JEP 197, we still have one code heap and the segmentation is internal and follows a pretty simple scheme, but it seems to help :)
The change improved some workloads (client code, finance application) by up to 4%, the impact varied by CPU. As these things go, a modest win.
Relocating observed hot paths together is complex and as Aleksey points out, if they are very strongly correlated inlining already helps this case. I can imagine a workload where it would help, but I doubt it justifies the work.
So, your intuition was to a large extent correct, and vendors are actively pursuing it with some solutions already in the field and some around the corner. At least for Zing, for certain real world workloads we saw a measurable positive effect, and I expect the OpenJDK solution will deliver similarly in these workloads.
A further, and more significant boost, was achieved by allocating larger code cache pages. This is internal to Zing and does not require OS configuration. Increasing the page size to 2M improved certain workloads by more than 10% (large number of compilation units + memory pressure). A similar improvement should be possible on OpenJDK by enabling -XX:+UseLargePages, I believe Sergey Kuksenko describes such a case in one of his talks. I've not used this option myself so cannot comment on it's suitability.
Both optimizations are default for latest Zing versions.
Hope this helps,
Nitsan

Sergey Melnikov

unread,
Feb 9, 2017, 3:21:14 PM2/9/17
to mechanical-sympathy
Nitsan, just as a curiosity, in Zing, do you have any optimizations for code size? I mean most advanced performance optimizations require additional code (code layout, aggressive versioning, unrolling, ...). So, code size boosting caused by perf optimizations may exhaust code cache.

--Sergey


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nitsan Wakart

unread,
Feb 10, 2017, 7:22:49 AM2/10/17
to mechanica...@googlegroups.com
We have no availability based heuristics, nor does anyone else AFAIK, for dynamically enabling/disabling/tuning code gen to fit into smaller code cache pools or react to low space indications. Running out of code cache space is pretty rare and you can have a bigger cache if you like.
Zing and OpenJDK as well as other compilers make aggressive and optimistic optimizations which minimize code size such as implicit null checks, brach/exception elimination, constant folding, Class Hierarchy Analysis to name a few. The general way these work is by the compiler either proving some code is redundant or not generating some unlikely code path optimistically (with a de-optimization fallback).
As you point out, some optimizations bloat the code (unrolling etc), or result in code duplication (inlining). The compilers have different heuristics for how much to inline with a lot of seemingly arbitrary weights for different parameters. I leave it to people more involved in determining these heuristics to answer how much they worry about code size. I recommend looking at the GA sample in JMH for an interesting approach to exploring the parameter space for inlining.
Reply all
Reply to author
Forward
0 new messages