Tuning / Monitoring / Understanding Java GC Promotion Pressure

dk.h...@gmail.com

unread,

Aug 25, 2015, 8:28:27 AM8/25/15

to mechanical-sympathy

Hello,

I have a couple of questions / doubts regarding the impact on latency of the GC promotion process from new generation to old generation.

1. Is there anyway to log / monitor when this is happening? Does the -verbose:gc help in any way here?

2. What is the best GC algorithm and tuning strategy to minimize these promotions during the program execution?

3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?

4. Is the latency cost of promoting really relevant?

Thx!

== Derek

Kirk Pepperdine

unread,

Aug 25, 2015, 8:57:12 AM8/25/15

to mechanica...@googlegroups.com

> On Aug 25, 2015, at 2:28 PM, dk.h...@gmail.com wrote:
>
> Hello,
>
> I have a couple of questions / doubts regarding the impact on latency of the GC promotion process from new generation to old generation.

I have a number of GC related talks and presentations scattered all over the web as does Martin and Peter and a few others on this list that you may find helpful.

>
> 1. Is there anyway to log / monitor when this is happening? Does the -verbose:gc help in any way here?

Yes, I recommend -Xloggc:<some log file name> -XX:+PrintGCDetails -XX:+PrintTenuringDistribution as a minimal set of flags to log GC activity. I have written my own (commercial) tooling that I use to analyze gc logs to help make tuning decisions.

>
> 2. What is the best GC algorithm and tuning strategy to minimize these promotions during the program execution?

There are three aspects to tuning, application tuning, choosing a GC algorithm and memory pool sizing. The first thing to focus on is making the application more memory efficient. So that activity includes minimizing retained or live set sizes as well as allocation rates. As far as collectors go, there is no one size fits all. There are throughput collectors and mostly concurrent collectors. The throughput collectors tend to pause your application for longer periods of time than the mostly concurrent collectors.. but not always. Again, how the different collectors will perform is a combination of the type of strength of memory pressure your application subjects your system to.

>
> 3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?

Yes, but generally not desirable with generational collectors. Collections of nursery spaces tend to be very cheap where as collections of other pools tends to be much more expensive. Promoting too early pushes data into a memory pool that is more expensive to collect. Promoting too early tends to cause that memory pool to be collected much more frequently. Generally frequent expensive collections will hurt application throughput.

>
> 4. Is the latency cost of promoting really relevant?

No, generally not unless you promote volumes of small objects.

Kind regards,
Kirk

signature.asc

Gil Tene

unread,

Aug 25, 2015, 11:40:55 AM8/25/15

to mechanical-sympathy

On Tuesday, August 25, 2015 at 5:28:27 AM UTC-7, dk.h...@gmail.com wrote:

Hello,

I have a couple of questions / doubts regarding the impact on latency of the GC promotion process from new generation to old generation.

1. Is there anyway to log / monitor when this is happening? Does the -verbose:gc help in any way here?

You can compute the amount of promotion and it's rate over time directly from GC logs by calculating the growth of the oldgen at newgen collections. -XX:+PrintGCDetails -XX:+PrintGCTimeStamps are both useful for doing this

For example, in the below output:

% java -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx1g -jar HeapFragger.jar

1.365: [GC (Allocation Failure) [PSYoungGen: 65536K->10741K(76288K)] 65536K->46194K(251392K), 0.0315796 secs] [Times: user=0.18 sys=0.04, real=0.03 secs]

3.911: [GC (Allocation Failure) [PSYoungGen: 76277K->10747K(141824K)] 111730K->46968K(316928K), 0.0123873 secs] [Times: user=0.07 sys=0.01, real=0.02 secs]

8.965: [GC (Allocation Failure) [PSYoungGen: 141819K->10746K(141824K)] 178040K->48336K(316928K), 0.0123209 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]

13.945: [GC (Allocation Failure) [PSYoungGen: 141818K->10747K(141824K)] 179408K->49712K(316928K), 0.0104540 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]

18.922: [GC (Allocation Failure) [PSYoungGen: 141819K->10723K(137728K)] 180784K->51128K(312832K), 0.0107468 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]

You can deduce that there were 774KB (46968K - 46194K) promoted during the young generation collection that happened 3.911 seconds into the run. You can similarly deduce that 1368KB was promoted at 8.965 (at an avg. of 270KB/sec), 1376KB was promoted at 13.945 (at an avg. of 276KB/sec), and 1416KB was promoted at 18.922 (at an avg. of 284KB/sec).

You can obviously build scripts to do this math for you, but there are some nice tools out there that do this too. Censum is a pretty cool one.

2. What is the best GC algorithm and tuning strategy to minimize these promotions during the program execution?

This will dramatically depend on your application's behavior. And you need to make sure you are asking the right question: Do you want to minimize promotion, or do you want to minimize the impact of promotion on your application's performance metrics (latency behavior, throughput, etc.)?

Taking the question as is ("... strategy to minimize these promotions..."):

There are certainly some filters that can be achieved through GC tuning, and for the part of promotion that can be affected by GC tuning, promotion amount (and rate) is not really a function of GC algorithm, and more of GC tuning within an algorithm (e.g. reducing young gen frequency by growing the younger size, or increasing the number of young gen cycles that an object would have to live for to get promoted by tuning things like promotion thresholds). But once you get through the simple filters, you get to the inherent promotion behavior of your actual application, which is simply a side-effect of the application work being performed.

Promotion is an inevitable side effect of "churn" in relatively long lived objects, where the word "relatively" is in relation to the frequency of young generation collections. There are certainly some applications for which this "churn" rate is zero, but applications that hold some state for a relatively long (larger than a few seconds) amount of time and eventually eventually get that state replaced by other state cannot avoid promotion. Past some simple GC tuning that can be seen as a "filter", you are left with the result of the natural object lifecycle of your application's domain. That (inherent) promotion tends to have two key behaviors: one whose rate is semi-linear to the application's throughout, and another that comes in spikes with phase changes in the application's execution. The semi-linear-to-throughput part is common to see for patterns like caching and stream processing (among many others) where some interesting amount of state is kept for a while, but that state is inherently replaced by other state over time. The spiky behavior is often associated with large amounts of relatively stable data being suddenly replaced (e.g. catalog updates, cache flush and refreshes, re-indexing, node compaction in databases, etc.). If your actual application includes these behavior patterns, then promotion is simply part of it's behavior.

But if we modify the question (to "... strategy to minimize the impact of these promotions..."):

The choice of GC mechanism is critical to the impact of promotion on you application behavior metrics.

E.g. in HotSpot (Oracle JDK and OpenJDK) ParallelGC (which is the default collector in HotSpot) will deal with promotion fairly efficienctly from an overall throughput and CPU consumption perspective, but will tend to exhibit larger pauses with a higher frequency in both newgen and oldgen than other collectors do. Some of the larger pauses are due to inherent behavior (oldgen is pure stop-the-world), and some are due to default tuning choices (tends to use larger newgen sizes which result in larger newgen pauses when promotion spikes occur, but can be controlled with flag to do otherwise). CMS will tend to do better in frequency of large pauses (but not in the absolute size of how big those large pauses can get), and G1 has yet another mix.

Of course, you have other choices, like Zing/C4, where the pausing effects of promotion on is completely eliminated and no tuning or tradeoffs needed...

3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?

Only if your application makes zero use of long-lived-but-eventually-dying objects. Certainly possible for some applications, but impossible for e.g. most cacheing patterns and for stream processors that use a window of state that spans a time that is significantly longer than the time between young generation GC cycles.

4. Is the latency cost of promoting really relevant?

That depends on what latencies are relevant to you. The latency costs of promotion come in two flavors:

1. Longer young gen collections (which means longer pauses if your younger is stop-the-world).

2. More frequent old gen collections (which means more frequent "larger" pauses if you old gen collection is stop-the-world or has stop-the-world portions whose pause lengths are larger than your young gen pauses).

Both of these can be tuned to some degree.

For (1), Promotion contributes to the length of an individual young gen pause length (in pausing young gens). This contribution can be anywhere from milliseconds to hundreds of milliseconds (and maybe even more) depending on it's pattern. E.g. promoting a few MB per young ten GC cycle will usually only add milliseconds to your young gen pause time, but promoting a newly computed index or a freshly re-loaded cache will often take 10s or 100s of msec. GC tuning can be used to trade off worst-case young-gen pause size vs. frequency here. By capping the newgen size you cap the amount of promotion work that will be done in a given pause, but increase the frequency of the young gen pauses. This is common practice in some latency-sensitive applications. Conversely, increasing the young gen size reduces the frequency of young gen pauses, and can often reduce overall promotion rates, but increases the worst-case size of a young gen pause )which will show up in phase changes when a large % of the young gen size can end up being promoted in a single pause).

For (2), Promotion contributes to the frequency of oldgen pauses more than to their length (for pausing oldgen collectors). Oldgen pause lengths have to do with the amount of live material in the oldgen and the type of work that is needed in the oldgen collection. These considerations will vary by collector. E.g. in CMS, you can sometimes get away with a mark/sweep and avoid compaction, which will leave you with longer-than-younggen-but-shorter-than-FullGC CMS pauses, but you will sometimes have to compact the oldgen because it has gotten too fragmented, which will result in a "much larger than typical" pause. Promotion rate itself does not change the size of the pause needed in either case, or the need to do the pause, but it can significantly effect how often the pauses (of whatever size) are needed. A higher promotion rate will lead to a higher oldgen GC rate (for a given oldgen heap size). A larger oldgen heap will lead to a lower oldgen GC rate (for a given promotion rate). But on pausing collectors, a larger heap will also mean longer pauses when they happen (for both oldgen and newgen). So within an application's inherent promotion behavior, you have things to trade off against each other (e.g. heap size vs. pause length and pause frequency).

Thx!

== Derek

dk.h...@gmail.com

unread,

Aug 25, 2015, 1:29:56 PM8/25/15

to mechanical-sympathy

>> 3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?

> Only if your application makes zero use of long-lived-but-eventually-dying objects.

Applications that resort to object pooling and create no garbage for the GC would guarantee zero use of long-lived-but-eventually-dying objects as you mentioned, correct?

Gil Tene

unread,

Aug 25, 2015, 3:53:59 PM8/25/15

to mechanical-sympathy

On Tuesday, August 25, 2015 at 10:29:56 AM UTC-7, dk.h...@gmail.com wrote:

>> 3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?
> Only if your application makes zero use of long-lived-but-eventually-dying objects.

Applications that resort to object pooling and create no garbage for the GC would guarantee zero use of long-lived-but-eventually-dying objects as you mentioned, correct?

Yes. But only if they did so for ALL Java objects (or all long-lived-but-eventually-dying ones). Pooling only part of those objects will only reduce the rate of promotion, but won't eliminate it and it's eventual side effects.

While pooling (or otherwise abvoiding the dynamic heap allocation of) all Objects is certainly achievable (several people on this list have personally achieved it), it is a pretty big/hard thing to do. It usually comes with some pretty struct rules. Rule like "don't use anyone else's Java code, including the core libraries, for anything" (because normal 3d party java code, including the core Java libraries and collections, tend to allocate objects in the heap without pooling).

Partial pooling is a lot more common than 100% pooling. And as such it tends to delay but not resolve the issue. If that delay is enough to live to the until your next nightly reboot, maybe that's enough, but...

There is also the off-heaping option, which amounts to a combination of pooling (off heap memory) and removal of the pooled material from the oldgen heap. When you successfully do that you actually reduce the size of Oldgen GC pauses (and probably their frequency, too). But there too you have the 100% solutions vs. the partial solutions. Both exist, but the 100% ones are much more rare and harder to achieve.

Michael Barker

unread,

Aug 25, 2015, 5:18:38 PM8/25/15

to mechanica...@googlegroups.com

Partial pooling solutions often will often be worse than a more Java-idiomatic allocate and die early approach to allocation. We've witnessed this in some of our performance testing. Often the cost tends to be in promotion as you have more objects surviving the young generation. This is particularly bad if your pools grow lazily, it can be mitigated somewhat by sizing/allocating the pools upfront and forcing GC early, but wasn't hugely successful in our experiments. Note that this is specifically with the collectors in Hotspot. Now we run with Zing/C4 and this is not issue we care about any more - we still have the same partial pooling implementation, it just doesn't have any impact application pause time.

Mike.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Aug 25, 2015, 5:59:57 PM8/25/15

to mechanical-sympathy

You have to size your young gen such that you don't GC until next downtime; if that's achievable, then even lazily growing pools are fine. However, it's still better to preallocate intelligently so you don't waste time on initial allocs and you can keep the pooled objects close in memory (assuming that's beneficial in the use case).

sent from my phone

ricardo.dan...@gmail.com

unread,

Aug 25, 2015, 7:01:14 PM8/25/15

to mechanical-sympathy

The best GC strategy in my opinion is not to GC at all through object pooling everything. Of course that pretty much means you will be using Java as a syntax language with a set of libraries that also produce zero garbage. As Gil mentioned, the Java collections themselves produce garbage, so you have to use your own or a good third-party real-time and garbage-free set of data structures. If you need inter-thread communication with zero garbage, you have to use CoralQueue or Disruptor because even java.util.concurrent.ConcurrentLinkedQueue produces garbage.

Even if you are using a "bad" third-party library that produces garbage you can use some instrumentation techniques to spot where the "garbage leak" is and fix the source code if it is available. Not many people are aware that it is possible to intercept any Java allocation done with the 'new' keyword anywhere in your code through the use of a Java Agent (-javaagent JVM command-line option). If interested in this technique you can check CoralBits' MemorySampler class.

If you pool all objects and have a big enough young generation it is possible to write a network client in Java that sends/receives trillions of messages without ever triggering any GC activity (either promotion activity or straight garbage collection). That's what we have accomplished with CoralReactor.

If you can't or don't want to use these techniques, you must at least use a low-latency GC algo (such as Zing) otherwise your system will have to cope with big and random GC latencies.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 25, 2015, 7:47:18 PM8/25/15

to mechanical-sympathy

I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 25, 2015, 8:37:10 PM8/25/15

to mechanica...@googlegroups.com

While "fit within the younggen until next scheduled downtime and never GC" applications certainly exist, they represent a tiny fraction even of low-latency applications, and a much tinier applications of applications that you'd just call "latency sensitive". I've encountered only a handful or so of such applications. For the vast-vast-vast majority of applications build in Java, it is completely impractical to live in a static enough working set such that no GC of any sort would be required for the lifetime of the process.

Then there is the next class of "fit within the oldgen util the next scheduled downtime and never incur an oldgen GC of any kind (but do incur young gen GCs)" applications. In the low latency space, there are certainly more of these than there are in the first class, and I've certainly encountered several tens of them. But the vast majority of latency-sensitive applications don't fit into this category either, and will encounter multiple old gen GCs between scheduled downtimes.

A telltale sign is the choice of garbage collector in HotSpot. When an application chooses CMS over ParallelGC, it is *always* (or maybe

"should always be") because oldgen collections are expected to be encountered during normal operation. If no oldgen GC's are expected, ParallelGC's newgen tends to do better in terms of both throughput and latency compared to CMS's ParNew (they are both parallel and monolithic STW newgens, but ParallelGC's seems to do more try-to-be-shorter tricks, e.g. capping the card scanning range so it only cover the currently used part of oldgen. The same trick would be ineffective in CMS past the first oldgen because CMS does not compact the oldgen...).

So for the vast-vast-vast majority of Java applications young gen GC cycles are very real, and for the smaller but still vast majority of applications old gen GC cycles are also very real...

On Aug 25, 2015, at 4:47 PM, Vitaly Davidovich <vit...@gmail.com> wrote:

I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.
sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/Sdpx9qCxyEs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

signature.asc

ricardo.dan...@gmail.com

unread,

Aug 25, 2015, 8:37:44 PM8/25/15

to mechanical-sympathy

> I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.

Totally agree, Vitaly. You must pool any allocation happening inside the critical loop (reactor/selector thread) at a high rate (one per loop?). Otherwise you can be flexible, as you said, depending on the young gen available + frequency of restarts.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 25, 2015, 9:35:08 PM8/25/15

to mechanical-sympathy

It's harder to write zero GC java apps, sure, but there are advantages to doing that over "half-way" pseudo no GC implementations. At the end of the day, if you want stable performance that you control (with respect to memory), you treat the memory manager with utmost care, whether it's GC or native. The quality and implementation details of the manager dictates what you can get away with, but it doesn't change the big picture.

sent from my phone

On Aug 25, 2015 8:37 PM, "Gil Tene" <g...@azulsystems.com> wrote:

While "fit within the younggen until next scheduled downtime and never GC" applications certainly exist, they represent a tiny fraction even of low-latency applications, and a much tinier applications of applications that you'd just call "latency sensitive". I've encountered only a handful or so of such applications. For the vast-vast-vast majority of applications build in Java, it is completely impractical to live in a static enough working set such that no GC of any sort would be required for the lifetime of the process.

Then there is the next class of "fit within the oldgen util the next scheduled downtime and never incur an oldgen GC of any kind (but do incur young gen GCs)" applications. In the low latency space, there are certainly more of these than there are in the first class, and I've certainly encountered several tens of them. But the vast majority of latency-sensitive applications don't fit into this category either, and will encounter multiple old gen GCs between scheduled downtimes.

A telltale sign is the choice of garbage collector in HotSpot. When an application chooses CMS over ParallelGC, it is *always*(or maybe
"should always be") because an oldgen collections are expected to be encountered during normal operation. If no oldgen GC's are expected, ParallelGC's newgen tends to do better in terms of both throughput and latency compared to CMS's ParNew (they are both parallel and monolithic STW newgens, but ParallelGC's seems to do more try-to-be-shorter tricks, e.g. capping the card scanning range so it only cover the currently used part of oldgen. The same trick would be ineffective in CMS past the first oldgen because CMS does not compact the oldgen...).

So for the vast-vast-vast majority of Java applications young gen GC cycles are very real, and for the smaller but still vast majority of applications old gen GC cycles are also very real...

On Aug 25, 2015, at 4:47 PM, Vitaly Davidovich <vit...@gmail.com> wrote:

I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.
sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/Sdpx9qCxyEs/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 25, 2015, 10:18:05 PM8/25/15

to mechanical-sympathy

Right. Invariably, every app will have long lived eternal data, a ton of short lived per request type of data, and then some that's medium term (worst kind). Pooling the short lived and medium term objects is worthwhile; these drive your old gen sizing, and the short lived pooled + slow path allocs drive the young gen size. One can certainly try to tune the GC, but having spent sufficient time doing this myself before and seeing traffic on hotspot-gc-use mailing list made me turn away from that. You'll ramp up quicker by not being as frugal/strict with allocations, but then may spend an inordinate amount of time playing the "turn the knob and see what happens" game, looking at GC impl code, becoming intimately familiar with each GCs log info, implementation details, tuning parameters (some that aren't documented all that well), checking if throughout degrades, etc. It's much easier in some sense to put more control in your hands and then only decide on your young and old gen sizes. From an operational and capacity planning standpoint, it's easier to reason about memory usage that's constant or near constant rather than zig zags. GC is good in some ways, but it doesn't remove the need to being frugal with allocations (if latency/perf/jitter is a concern).

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 25, 2015, 10:35:34 PM8/25/15

to mechanical-sympathy

Vitaly, that's a good summary of life before Zing. It's also a list of things that people who've moved to it no longer worry about. Plus pooling of course, which they also no longer have to worry about...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 25, 2015, 10:41:31 PM8/25/15

to mechanical-sympathy

What is the throughput hit from C4?

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ricardo.dan...@gmail.com

unread,

Aug 25, 2015, 10:51:11 PM8/25/15

to mechanica...@googlegroups.com

> Vitaly, that's a good summary of life before Zing. It's also a list of things that people who've moved to it no longer worry about. Plus pooling of course, which they also no longer have to worry about...

Zing helps to solve the GC problem. But it is best if you don't have a GC problem to solve. I agree that Zing can help a lot as not everybody can focus on producing garbage-free code, but the less garbage you produce the better. The no-GC strategy is possible and it is the approach used by CoralBlocks and the top quant shops that use Java, have a real need for every microsecond and can't tolerate any GC latency.

Michael Barker

unread,

Aug 25, 2015, 11:38:24 PM8/25/15

to mechanica...@googlegroups.com

What is the throughput hit from C4?

From our (LMAX) experience, it varies (both above and below 0). There is a small cost (single digit %) as a result of the loaded value barrier the size of which depends on the shape of your code. For us of the difference in performance between Hotspot and Zing mostly centred around how the sorts of code patterns that the JIT had be optimised to handle making it hard to separate the costs of C4 from the rest of the VM. E.g. for one areas of our code Hotspot was better at BigDecimals and eliminating costs associated with exceptions, such that we saw a ~20% slow down. But using Zing highlighted that bad code, which we rewrote to use scaled longs and removed the exceptions and it went even faster that the original code with Hotspot. In another case we saw that Zing was much faster at JNI upcalls, which was a pretty crucial path from our messaging bus (third party) to our application.

That small throughput cost for us was worthwhile to get improved predictability and not invest significant time rewriting large portions of our gateway, exchange and broker code to be garbage free.

Mike.

Kirk Pepperdine

unread,

Aug 26, 2015, 3:46:46 AM8/26/15

to mechanica...@googlegroups.com

Going for zero allocations or pooling everything is simply not a practical technique and in the vast majority of applications is overly complex and simply unnecessary. Say what you want but GC works for most applications and in fact it takes a huge burden off of developers. It’s one of the few features that I believe makes the widespread development of large complex distributed applications possible. When I look at allocation rates it’s not for the purpose of avoiding GC. Allocation rates affect GC frequency but they don’t affect GC pause times. Data retention affects GC pause times and data retention is a function of application behavior. Allocation rates are an indicator of memory efficiency and I will work hard, but no harder than needed to improve the memory efficiency of an application as it is another form of strength reduction. But I will only take that so far because to go farther you start to realize diminished returns for your efforts. Off-heap, caching everything, not being able to take advantage of 3rd party frameworks that maybe very very difficult or simply not practical to write in the first place let alone replicate with zero allocations.

I’m not saying that the collectors we have today are the be all and end all of what can be achieved. Zing clearly demonstrates that we can do better in OpenJDK and that there are still many problems with the current implementations. There are even changes that Oracle have current rejected in CMS that improve it’s throughput quite significantly. However, GC works and trying to avoid it is like climbing Mount Everest, few of us will do it because its very fucking hard to do and its cool to say afterwards, I climbed Everest (for no practical reason). Focus on memory efficiency and let the rest of the stack do what it does.

Regards,

Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

signature.asc

Richard Warburton

unread,

Aug 26, 2015, 6:34:51 AM8/26/15

to mechanica...@googlegroups.com

Hi,

Even if you are using a "bad" third-party library that produces garbage you can use some instrumentation techniques to spot where the "garbage leak" is and fix the source code if it is available. Not many people are aware that it is possible to intercept any Java allocation done with the 'new' keyword anywhere in your code through the use of a Java Agent (-javaagent JVM command-line option). If interested in this technique you can check CoralBits' MemorySampler class.

You have to be exceedingly careful when taking this approach to ensure that you don't pass a reference to the object in question to another method, otherwise you're disabling escape analysis and misreporting the allocation rates of the application under measurement.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Vitaly Davidovich

unread,

Aug 26, 2015, 6:44:26 AM8/26/15

to mechanical-sympathy

Mike,

Which Hotspot GC is the single digit % in reference to?

sent from my phone

--

Vitaly Davidovich

unread,

Aug 26, 2015, 6:45:56 AM8/26/15

to mechanical-sympathy

I hope nobody is *relying* on escape analysis to eliminate allocations in fast paths.

sent from my phone

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Richard Warburton

unread,

Aug 26, 2015, 6:56:54 AM8/26/15

to mechanica...@googlegroups.com

Hi,

I hope nobody is *relying* on escape analysis to eliminate allocations in fast paths.

I agree you shouldn't rely on it for your critical path, but that's orthogonal to the issue of accurate measurement.

Vitaly Davidovich

unread,

Aug 26, 2015, 7:01:31 AM8/26/15

to mechanical-sympathy

Sure, but I think the original point you were replying to was about detecting allocations in an application that's either trying to avoid them entirely or, more likely, avoiding them in fast paths. Accurate measurement isn't necessary in that case as it's more boolean "yes, allocates on fast path" or "no allocs on fast path".

sent from my phone

--

Vitaly Davidovich

unread,

Aug 26, 2015, 7:06:45 AM8/26/15

to mechanical-sympathy

And while we're discussing this, the problem isn't passing a ref to another method but passing a ref to a method not inlined. This is part of the "brittleness" of EA.

sent from my phone

Chris Newland

unread,

Aug 26, 2015, 7:40:20 AM8/26/15

to mechanical-sympathy

Exactly. If the method receiving the reference is inlined then what would have been an ArgEscape becomes a NoEscape and the allocation can be eliminated.

FYI eliminated heap allocs are reported by LogCompilation:

-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation

and look for output like:

<eliminate_allocation type="820">
<jvms method="813" bci="47"/>
</eliminate_allocation>
<eliminate_allocation type="816">
<jvms method="813" bci="35"/>
</eliminate_allocation>

Cross reference the BCIs with the disassembled bytecode from javap or use JITWatch to link it back to your source code.

Regards,

Chris
@chriswhocodes

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Richard Warburton

unread,

Aug 26, 2015, 8:07:47 AM8/26/15

to mechanica...@googlegroups.com

Hi Vitaly,

And while we're discussing this, the problem isn't passing a ref to another method but passing a ref to a method not inlined. This is part of the "brittleness" of EA.

I'm aware of that, but thanks for raising the issue. The reason I didn't mention this before is that from the memory profiler's point of view it won't know whether the methods it instruments in get inlined at the time that a java agent instruments. Or at least I can't see any way to do so that's not even more brittle than EA. If I'm missing something then I'm happy to be corrected on the matter.

So in practice you just have to avoid passing a reference of the object under measurement to any other method when instrumenting.

Vitaly Davidovich

unread,

Aug 26, 2015, 8:16:05 AM8/26/15

to mechanical-sympathy

I'm aware of that, but thanks for raising the issue. The reason I didn't mention this before is that from the memory profiler's point of view it won't know whether the methods it instruments in get inlined at the time that a java agent instruments. Or at least I can't see any way to do so that's not even more brittle than EA. If I'm missing something then I'm happy to be corrected on the matter.

Yes, that's true. Your original email said this disables escape analysis, but I see what you meant now.

--

Kirk Pepperdine

unread,

Aug 26, 2015, 10:00:35 AM8/26/15

to mechanica...@googlegroups.com

I think you’d want the instrumentation code to be inlined quickly so that you’d get a no-escape. But yes, you do need to be very careful.

— Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

signature.asc

ricardo.dan...@gmail.com

unread,

Aug 26, 2015, 10:09:17 AM8/26/15

to mechanica...@googlegroups.com

> Accurate measurement isn't necessary in that case as it's more boolean "yes, allocates on fast path" or "no allocs on fast path".

Yes, this only happens during the debugging phase when the -javaagent is enabled. It is also a matter of "where is it?": by investigating the call stack you can not just log "yes, it allocates memory on the fast path" but you can also log "yes, it allocates memory on the fast path in that class and on that line".

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,

Aug 26, 2015, 7:57:22 PM8/26/15

to mechanica...@googlegroups.com

Which Hotspot GC is the single digit % in reference to?

We compared with ParNew + iCMS. Although the difference probably doesn't have much to do with the actual GC and more to do with how the VM handles dereferencing.

As a bit more background, the single digit % result was from our only legitimate case of running our system flat out, which is to test how long it takes to restore the system after a crash. It would read in a days worth of journal files to recover its state. We measured the total time taken to complete the restore and there was very little difference.

Mike.

Vitaly Davidovich

unread,

Aug 26, 2015, 10:00:55 PM8/26/15

to mechanical-sympathy

Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.

Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?

sent from my phone

--

Michael Barker

unread,

Aug 26, 2015, 10:45:09 PM8/26/15

to mechanica...@googlegroups.com

Hi Vitaly,

I'm mostly working from memory at the moment. The test was done a few of years ago when we were making the decision whether or not to move to Zing, so set up harness is probably gotten out of date and been deleted. Also worth mentioning that Azul continue to optimise the LVB so it is probably faster than when I tested.

As for CPUs in Zing we allocated 2 threads for new collections and 2 for old to prevent them from contending with the application for CPU resource (to this day, that is only tuning option we've applied to the GC). With Hotspot, I think we just used the default. Worth noting that our Hotspot set up was tuned such that we wouldn't run into old GCs, so most of the time C4 was actually competing against ParNew. Given that ParNew is a STW collector the default CPU count (#thread == #core IIRC) was probably the most appropriate thing.

Mike.

Vitaly Davidovich

unread,

Aug 26, 2015, 10:52:34 PM8/26/15

to mechanical-sympathy

From memory is fine :).

Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)

sent from my phone

Michael Barker

unread,

Aug 26, 2015, 11:18:11 PM8/26/15

to mechanica...@googlegroups.com

Our old usage after initial start up is very flat (in the systems we really care about). When we first wrote the system, our general approach was, don't allocation where you don't need to, if you must, either die young or live forever. With ParNew and a smallish eden (16-32MB) we would have short (<10ms) but frequent (2-3 per second) pauses. We would run for ~24 hours without an old GC, then force a full GC during our overnight close window to prevent fragmentation and avoid the dreaded "promotion failed" GC monster. So with Hotspot the CMS old collector only ran during the overnight manual trigger. With Zing all of the GC is concurrent, but our allocation rates are fairly low so it tends only be running concurrently 1-2% of the time, i.e. it is mostly idle.

As to how Zing manages old and new, it runs the same GC algorithm (C4) in both regions and applies the generation GC model as an optimisation. I'm not sure how promotion is handled - I think I can guess, but Gil probably give a lot more accurate information.

Mike.

Gil Tene

unread,

Aug 26, 2015, 11:44:19 PM8/26/15

to mechanica...@googlegroups.com

On Wednesday, August 26, 2015 at 7:52:34 PM UTC-7, Vitaly Davidovich wrote:

From memory is fine :).

Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)

Zing uses a pure mark/compact collector for both young and old generations, so promotion is done into nice contiguous allocations blocks (normally 2MB blocks). No free lists involved. In that sense it's probably similar to ParallelGC in the total amount of promotion work per promoted unit.

A key difference in promotion logic is that Zing does not use promotion thresholds counted in terms on # of newgen cycles. Instead, it's promotion threshold is time based. [By default] objects younger than 2 seconds are new, and ones that are older than 2 seconds are old. While this age boundary is configurable, I don't know of anyone who has had to configure it in 2+ years...

This simpler time-based age decision is possible because Zing does not split the heap between the new and old generations. Newgen gets to use the entire heap and as a result newgen frequencies are usually significantly reduced when compared to STW newgens, and especially when compared to those that try to cap their pause times by keeping the Eden size to a few tens or hundreds of MB. This translates into even more things dying in newgen and avoiding both promotion and copying work, which in turn translates to less copying and less promotion work than ParallelGC would incur on a similar workload, along with a reduced rate of oldgen GCs as a result as well.

With all that said, we usually don't bother much with characterizing the efficiency side of things, or looking too much at promotion rates, because most production configs have the GC active for only 1-5% of the overall time... I don't think we've ever had anyone (using Zing) try to tune for a change in promotion rate or analyze it too closely, because it just doesn't matter. Since each extra empty GB adds efficiency to the collectors, most people just add enough to keep the collectors (both newgen and oldgen) relatively idle and stop where they feel comfortable with the % of time that GC is active. This is usually driven by a wish for "GC headroom" rather than CPU consumption concerns, since (with Zing) the collectors would generally need to be active 100% of the time before application delays start popping up.

sent from my phone

On Aug 26, 2015 10:45 PM, "Michael Barker" <mik...@gmail.com> wrote:

Hi Vitaly,

I'm mostly working from memory at the moment. The test was done a few of years ago when we were making the decision whether or not to move to Zing, so set up harness is probably gotten out of date and been deleted. Also worth mentioning that Azul continue to optimise the LVB so it is probably faster than when I tested.

As for CPUs in Zing we allocated 2 threads for new collections and 2 for old to prevent them from contending with the application for CPU resource (to this day, that is only tuning option we've applied to the GC). With Hotspot, I think we just used the default. Worth noting that our Hotspot set up was tuned such that we wouldn't run into old GCs, so most of the time C4 was actually competing against ParNew. Given that ParNew is a STW collector the default CPU count (#thread == #core IIRC) was probably the most appropriate thing.

Mike.

On 27 August 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:

Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.

Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?

sent from my phone

On Aug 26, 2015 7:57 PM, "Michael Barker" <mik...@gmail.com> wrote:

Which Hotspot GC is the single digit % in reference to?
We compared with ParNew + iCMS. Although the difference probably doesn't have much to do with the actual GC and more to do with how the VM handles dereferencing.

As a bit more background, the single digit % result was from our only legitimate case of running our system flat out, which is to test how long it takes to restore the system after a crash. It would read in a days worth of journal files to recover its state. We measured the total time taken to complete the restore and there was very little difference.

Mike.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 27, 2015, 9:15:53 AM8/27/15

to mechanical-sympathy

Thanks Gil (and Mike).

Gil, so what are the costs of the LVB and write barriers in Zing? What does an LVB look like in pseudo-assembly? Are write barriers using card marking? Are they susceptible to false sharing (like Hotspot) on the card? Are there cpu fences after the write barrier (given that the GC is concurrent and not STW)?

On Wed, Aug 26, 2015 at 11:44 PM, Gil Tene <g...@azulsystems.com> wrote:

On Wednesday, August 26, 2015 at 7:52:34 PM UTC-7, Vitaly Davidovich wrote:
From memory is fine :).

Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)

Zing is a pure mark/compact collector for both young and old generations, so promotion is done into nice contiguous allocations blocks (normally 2MB blocks). No free lists involved. In that sense it's probably similar to ParallelGC in the total amount of promotion work per promoted unit.

A key difference in promotion logic is that Zing does not use promotion thresholds counted in terms on # of newgen cycles. Instead, it's promotion threshold is time based. [By default] objects younger than 2 seconds are new, and ones that are older than 2 seconds are old. This age boundary is configurable, I don't know of anyone who has had to configure it in 2+ years...

This simpler time-based age decision is possible because Zing does not split the heap between the new and old generations. nNewgen gets to use the entire heap and as a result newgen frequencies are usually significantly reduced when compared to STW newgens, and especially when compared to those that try to cap their pause times by keeping the Eden size to a few tens or hundreds of MB. This translates into even more things dying in newgen and avoiding both promotion and copying work, which in turn translates to less copying and less promotion work than ParallelGC would incur on a similar workload, along with a reduced rate of oldgen GCs as a result as well.

With all that said, we usually don't bother much with characterizing the efficiency side of things, or looking too much at promotion rates, because most productions configs have the GC active for only 1-5% of the overall time... I don't think we've ever had anyone (using Zing) try to tune for a change in promotion rate or analyze to closely, because it just doesn't matter. Since each extra empty GB adds efficiency to the collectors, most people just add enough to keep the collectors (both newgen and oldgen) relatively idle and stop where they feel comfortable with the % of time that GC is active. This is usually driven by a wish for "GC headroom" rather than CPU consumption concerns, since (with Zing) the collectors would generally need to be active 100% of the time before application delays start popping up.

sent from my phone

On Aug 26, 2015 10:45 PM, "Michael Barker" <mik...@gmail.com> wrote:

Hi Vitaly,

I'm mostly working from memory at the moment. The test was done a few of years ago when we were making the decision whether or not to move to Zing, so set up harness is probably gotten out of date and been deleted. Also worth mentioning that Azul continue to optimise the LVB so it is probably faster than when I tested.

As for CPUs in Zing we allocated 2 threads for new collections and 2 for old to prevent them from contending with the application for CPU resource (to this day, that is only tuning option we've applied to the GC). With Hotspot, I think we just used the default. Worth noting that our Hotspot set up was tuned such that we wouldn't run into old GCs, so most of the time C4 was actually competing against ParNew. Given that ParNew is a STW collector the default CPU count (#thread == #core IIRC) was probably the most appropriate thing.

Mike.

On 27 August 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:

Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.

Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?

sent from my phone

On Aug 26, 2015 7:57 PM, "Michael Barker" <mik...@gmail.com> wrote:

Which Hotspot GC is the single digit % in reference to?
We compared with ParNew + iCMS. Although the difference probably doesn't have much to do with the actual GC and more to do with how the VM handles dereferencing.

As a bit more background, the single digit % result was from our only legitimate case of running our system flat out, which is to test how long it takes to restore the system after a crash. It would read in a days worth of journal files to recover its state. We measured the total time taken to complete the restore and there was very little difference.

Mike.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 27, 2015, 12:11:34 PM8/27/15

to mechanica...@googlegroups.com

Some details below.

On Thursday, August 27, 2015 at 6:15:53 AM UTC-7, Vitaly Davidovich wrote:

Thanks Gil (and Mike).

Gil, so what are the costs of the LVB and write barriers in Zing? What does an LVB look like in pseudo-assembly?

As you can imagine, we've gone through many optimizations of the LVB and what it's fast path test looks like over the years. Much of that has to do with delicately designing collector state representations and phase transitions to make the fast path LVB test as cheap as possible. We've probably gone through 15+ implementations of the same logical LVB over the past decade.

While the logical LVB test always enforces the LVB invariants that the C4 paper describes, in current x86 implementations we've managed to devolve the fast path to a simple TEST and JMP combination. Depending on register allocation decisions made by the JIT, the test is either reg vs. reg or reg vs. thread-local memory location (which is hot and always L1-hitting), . This translates to a single u-op (in the reg. vs. reg test) or two u-ops (in the reg vs. mem test), a jump that is (literally) 99.9999999% predictable and (in the reg. vs. mem case) L1-local. If the fast path triggers (that 0.000000001% of the time thing), the slow path is still "fast" but has some real work to do depending on the triggering conditions and GC phase (it actually has multiple "fast slow path" levels before devolving to the slowest thing).

Summary: The LVB fast path is a single ultimately-predictable branch on a test that never incurs a cache miss.

As far as impact, LVB "cost" varies with the program's (orthogonal to LVB) IPC. The two instructions and resulting 1 or 2 u-ops (and branch) certainly consume processor resources. The cost of consuming those resources "grows" at high IPCs when the processor would be able to keep its pipeline and execution units entirely full, and "shrinks" at low IPCs, e.g. where cache misses come into play. It is basically undetectable in pointer changing situations, and can show a handful of % in u-benchmarks with tight numeric L1-hitting loops. Most applications fall somewhere in the middle.

Are write barriers using card marking?

For clarity, I like to refer to these barriers as "reference store barriers" to avoid ambiguity in the term "write barrier". They are barriers that are executed when a reference is stored to a memory location. They exist in all generational collectors, but are also needed for some non-generational purpose for some (e.g. G1 uses it to enforce SATB invariants and track cross-region remembered sets, and includes tests both before and after the actual reference store). Zing's reference store barriers are there purely for generational remembered set tracking, and apply ahead of the reference store itself.

Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.

While it's hard to compare, Zing's reference-store barrier cost for single-threaded execution probably falls somewhere between the "blind" and -XX:+UseCondCardMark HotSpot barrier costs, while it's multi-threaded execution cost is probably somewhat lower than both (if/when false sharing in the card table were an issue for HotSpot). Zing's memory bandwidth cost is significantly lower (up to 2x less write bandwidth to memory in streaming cases), but modern x86 sockets tend to have memory bandwidth to spare, so this may not matter as much. All of these statements become very application-behavior-dependent though...

The reason Zing's reference-store barrier can be faster in the presence of false sharing is that the fast path generational-test condition (which HotSpot does't do in the -XX:+UseCondCardMark test AFAIK) reduces the cost of conditional testing on the fast path because it involves no memory access (it is based purely on the values of the reference being store and the target address it is being stored to). The memory access and the "is it already dirty?" test that goes with it are only needed if the store creates an oldgen->newgen reference, which is dynamically rare, typically down to handful of % (or less) of reference stores.

Note that due to the precise nature (1 bit per heap word) of the card table, actual dirtying stores to the card table are atomic (an atomic OR), but that these stores are dynamically much more rare than the cheaper blind store in HotSpot.

Are they susceptible to false sharing (like Hotspot) on the card?

As explained above, Zing's card marking is not susceptible to false-sharing contention. C4 was initially developed for machines with several hundreds of cpu cores, and for environments where both cache coherency bandwidth and memory bandwidth could become the real bottleneck (even with 64 memory controllers humming in parallel), so we had to deal with that one very early on...

Are there cpu fences after the write barrier (given that the GC is concurrent and not STW)?

If the barrier chooses to dirty a card, there is a logical StoreStore fence between the barrier's card dirtying store operation and the actual reference store that follows it. What this translates to would depend on the CPU involved. On x86 it's a no-op.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 27, 2015, 12:54:54 PM8/27/15

to mechanical-sympathy

Thanks for the details Gil.

The LVB lowering sounds roughly equivalent to doing an array range check on each memory reference (modulo an LVB possibly doing reg vs reg, whereas range check is always loading the length from memory, barring other optimizations in scope). Is the LVB done on each access to a reference field or only the first one and then uses register? E.g.:

if (someObject.ref != null) { // LVB here, I assume

System.out.println(someObject.ref); // is there LVB here or no?

System.out.println(someObject.ref.hashCode()); // how about here

}

Basically, does the JIT common out the reads and the LVB? I assume so(!) but wanted to double check.

Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.

So a 32GB heap will use a 256MB card table (assuming min object is 16 bytes)? Hotspot would use 64MB. What's the reason for such precision? Also, I believe Hotspot doesn't check whether a store is from old gen object or not for performance reasons -- you did not find this to be a problem in Zing? Or you feel you make up for it by reducing writeback traffic?

Note that due to the precise nature (1 bit per heap word) of the card table, actual dirtying stores to the card table are atomic (an atomic OR), but that these stores are dynamically much more rare than the cheaper blind store in HotSpot.

Why much more rare? This implies there aren't many oldgen->younggen references, but why is that rare? I'd expect this to depend on application, and not some generalized thing.

As explained above, Zing's card marking is not susceptible to false-sharing contention. C4 was initially developed for machines with several hundreds of cpu cores, and for environments where both cache coherency bandwidth and memory bandwidth could become the real bottleneck (even with 64 memory controllers humming in parallel), so we had to deal with that one very early on...

I'm not sure I see how the precise card table avoids false sharing. Or you mean due to reduced dirtying to begin with? Or because 64 byte cacheline of the card table "addresses" fewer objects in your implementation?

If the barrier chooses to dirty a card, there is a logical StoreStore fence between the barrier's card dirtying store operation and the actual reference store that follows it. What this translates to would depend on the CPU involved. On x86 it's a no-op.

Ok, but the dirtying uses an atomic instruction though, right?

On Thu, Aug 27, 2015 at 12:11 PM, Gil Tene <g...@azulsystems.com> wrote:

Some details below.

On Thursday, August 27, 2015 at 6:15:53 AM UTC-7, Vitaly Davidovich wrote:
Thanks Gil (and Mike).

Gil, so what are the costs of the LVB and write barriers in Zing? What does an LVB look like in pseudo-assembly?

As you can imagine, we've gone through many optimizations of the LVB and what it's fast path test looks like over the years. Much of that has to do with delicately designing collector state representations and phase transitions to make the fast path LVB test as cheap as possible. We've probably gone through 15+ implementations of the same logical LVB over the past decade.

While the logical LVB test always enforces the LVB invariants that the C4 paper describes, in current x86 implementations we've managed to devolved the fast path to a simple TEST and JMP combination. Depending on register allocation decisions made by the JIT, the test is either reg vs. reg or reg vs. thread-local memory location (which is hot and always L1-hitting), . This translates to a single u-op (in the reg. vs. reg test) or two u-ops (in the reg vs. mem test), a jump that is (literally) 99.9999999% predictable and (in the reg. vs. mem case) L1-local. If the fast path triggers (that 0.000000001% of the time thing), the slow path is still "fast" but has some real work to do depending on the triggering conditions and GC phase (it actually has multiple "fast slow path" levels before devolving to the slowest thing).

Summary: The LVB fast path in an single ultimately-predictable branch on a test that never incurs a cache miss.

As far as impact, LVB "cost" varies with the program's (orthogonal to LVB) IPC. The two instructions and resulting 1 or 2 u-ops (and branch) certainly consume processor resources. The cost of consuming this resources "grows" at high IPCs when the processor would be able to keep its pipeline and execution units entirely full, and "shrinks" at low IPCs, e.g. where cache misses come into play. It is basically undetectable in pointer changing situations, and can show a handful of % in u-benchmarks with tight numeric L1-hitting loops. Most applications fall somewhere in the middle.

Are write barriers using card marking?

For clarity, I like to refer to these barriers as "reference store barriers" to avoid ambiguity in the term "write barrier". They are barriers that are executed when a reference is stored to a memory location. They exist in all generational collectors, but are also needed for some non-generational purpose for some (e.g. G1 uses it to enforce SATB invariants and track cross-region remembered sets, and includes tests both before and after the actual reference store). Zing's reference store barriers are there purely for generational remembered set tracking, and apply ahead of the reference store itself.

Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.

While it's hard to compare, Zing's reference-store barrier cost for single-threaded execution probably falls somewhere between the "blind" and -XX:+UseCondCardMark HotSpot barrier costs, while it's multi-threaded execution cost is probably somewhat lower than both (if/when false sharing in the card table were an issue for HotSpot). Zing's memory bandwidth cost is significantly lower (up to 2x less write bandwidth to memory in streaming cases), but modern x86 sockets tend to have memory bandwidth to spare, so this may not matter as much. All of these statements become very application-behavior-dependent though...

The reason Zing's reference-store barrier can be faster in the presence of false sharing is that the fast path generational-test condition (which HotSpot does't do in the -XX:+UseCondCardMark test AFAIK) reduces the cost of conditional testing on the fast path because it involves no memory access (it is based purely on the values of the reference being store and the target address it is being stored to). The memory access and the "is it already dirty?" test that goes with it is only needed if the store creates an oldgen->newgen reference, which is dynamically rare, typically down to handful of % (or less) of reference stores.

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 27, 2015, 8:02:41 PM8/27/15

to mechanica...@googlegroups.com

On Thursday, August 27, 2015 at 9:54:54 AM UTC-7, Vitaly Davidovich wrote:

Thanks for the details Gil.

The LVB lowering sounds roughly equivalent to doing an array range check on each memory reference (modulo an LVB possibly doing reg vs reg, whereas range check is always loading the length from memory, barring other optimizations in scope).

Sort of, but the value that the loaded reference value is being tested against is "extremely stable". It does not depend on anything except for the GC phase, so it changes only once per GC phase shift. That's why it can live for a long time in registers (and in the L1 cache): it usually gets changed only once every many billions of instructions.

Is the LVB done on each access to a reference field or only the first one and then uses register? E.g.:

if (someObject.ref != null) { // LVB here, I assume
System.out.println(someObject.ref); // is there LVB here or no?
System.out.println(someObject.ref.hashCode()); // how about here
}

The LVB sits between loading a reference value from memory and the first use of that reference value. Hence the name Loaded Value Barrier... You can think of it more as part of the load of a reference than as part of it's use, but it can be decoupled and scheduled anywhere between the load and the first use (subsequent uses of the same loaded value do not need additional checks, as a tiggered LVB will "fix" the value to anon-triggering one). Certain optimizations can also bypass the LVB in some uses of the value. E.g. depending on LVB and GC implementation, null checks don't need to have the checked valued LVB'ed before the check. Some comparison uses (== and != ) also don't require an LVBs ahead of use because of strictly maintained GC invariants.

So for the above code, it's actually:

if (someObject.ref != null) { // No LVB here, null checks don't require an LVB

System.out.println(someObject.ref); // LVB here (between reading someObject.ref and using it)

System.out.println(someObject.ref.hashCode()); // No LVB here (someObject.ref is already LVB'ed).

}

Basically, does the JIT common out the reads and the LVB? I assume so(!) but wanted to double check.

Sort of yes. It's not really a "common out" optimization, since there is only one LVB per reference getfield(). If the compiler is able to avoid re-loading the same reference from memory multiple times (e.g. hoist it out of a loop), LVBs just go away with the loads.

Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.

So a 32GB heap will use a 256MB card table (assuming min object is 16 bytes)? Hotspot would use 64MB.

It's actually 1 bit per word, so for a 32GB heap Zing would use 512MB of card table material (1.563%), compared to HotSpot's 64MB (0.2%). In Zing, we account for card tables as part of the heap (i.e.g they come out of Xmx), so you can think of it as "Zing objects have a 1.36% larger footprint than HotSpot objects." Shrug. [It's actually more than that, because we also maintain a separate liveness bit per word, but so does HotSpot (in some collectors)].

What's the reason for such precision?

There are multiple benefits in both simplicity and avoiding concurrency issues. Here are two examples:

- Precise card tables make for very simple card scanning, since (unlike imprecise variants) you don't need to go back to find the beginning of the object that spans your card and scan forwards from there, and you therefore don't need to maintain an object start array to be able to support that functionality. With the object start array gone, lots of other hairy things go too.

- A "bigger picture" benefit is that card scanning never needs to linearly scan the heap looking for the "next object start", which results in the subtle but very powerful quality: Newgen does not need to be able to find out the size of any non-live object. This quality allows us to perform concurrent class unloading without disallowing newgens in the middle (the PermGen in Zing is simply collected concurrently as part of OldGen, and there are no more PermGen pauses).

Also, I believe Hotspot doesn't check whether a store is from old gen object or not for performance reasons -- you did not find this to be a problem in Zing? Or you feel you make up for it by reducing writeback traffic?

The write back traffic reduction is pretty massive, but as I noted, it only matters when memory bandwidth is an issue, and that's rare in today's x86 processors.

HotSpot's blind stores are certainly efficient, as long as they do not run into false sharing issues. However, since HotSpot uses a 1 byte card for every 512 of heap, a single cache line (64 bytes on x86) contains card table material for 32KB region of the heap. And in multithreaded workloads any two stores from separate threads into the same 32KB region will result in a card table cache line collision. That's why HotSpot added the -XX:+UseCondCardMark option, which is often used by people tuning multi-threaded workloads.

Once you add a conditional check ahead of the card table store, you start looking at it's cost. And a generational check (only trigger on creating an oldgen->newgen reference in the heap) is cheaper to perform than a "is card already dirty?" check for the simple reason that the generational check requires no memory access (and the other check includes a potential cache miss). We find that the generational filter ahead of the "is dirty?" check results in an overall win. Ofcourse, we were careful to make our generational check cheap (cmp,shift,jmp) by carefully applying invariants to the possible relationships between olden and newgen reference values.

Note that due to the precise nature (1 bit per heap word) of the card table, actual dirtying stores to the card table are atomic (an atomic OR), but that these stores are dynamically much more rare than the cheaper blind store in HotSpot.

Why much more rare? This implies there aren't many oldgen->younggen references, but why is that rare? I'd expect this to depend on application, and not some generalized thing.

This rarity (of creating of oldgen->newgen references in the heap, dynamically) can't be proven, but it tends to apply wherever the weak generational hypothesis applies, or at least wherever it is useful for generational collection (which tends to be nearly universally in Java, usually with a 10-20x efficiency benefit). Think of it this way: the efficiency of generational collection relies not only on the quality of "most objects die young". It also relies on the quality of "the remembered set tracking potential oldgen->newgen references is small", which equates to "the card table is very sparse"... If creation of oldgen->newgen references was dynamically common, the card table would end up being very dirty, the remembered set would be very large. This would in turn lead to very long card table scans (due to the dramatically higher work needed for a dirty card compared to clean one). And we just don't see those in the wild much, regardless of JVM and application...

We don't collect numbers for this regularly, and I'm not making claims about ratios being the same across applications, but when we modeled this several years ago across several apps, we were seeing a 20:1 ratio between reference stores in general and reference stores that were creating oldgen->newgen reference in the heap.

As explained above, Zing's card marking is not susceptible to false-sharing contention. C4 was initially developed for machines with several hundreds of cpu cores, and for environments where both cache coherency bandwidth and memory bandwidth could become the real bottleneck (even with 64 memory controllers humming in parallel), so we had to deal with that one very early on...

I'm not sure I see how the precise card table avoids false sharing. Or you mean due to reduced dirtying to begin with? Or because 64 byte cacheline of the card table "addresses" fewer objects in your implementation?

It's not the precise card table that avoids false sharing. It's the fact that we don't dirty already-dirty cards (precise or not wouldn't matter for this). Same thing that HotSpot's -XX:+UseCondCardMark does, but at a somewhat lower dynamic cost due to the generational filter that precedes it.

If the barrier chooses to dirty a card, there is a logical StoreStore fence between the barrier's card dirtying store operation and the actual reference store that follows it. What this translates to would depend on the CPU involved. On x86 it's a no-op.

Ok, but the dirtying uses an atomic instruction though, right?

Yes. Actual dirtying is an atomic OR.

And after chatting with our compiler guys, I need to correct the fence statement above: There is no StoreStore fence involved (or none any more, anyway). The only instruction scheduling requirement is for the two stores (the reference store and the potential card table dirtying store associated with it) to not have safepoint taken between them. So no CPU fence of any kind; only a compiler scheduling requirement that has to deal with safepoint boundaries in the code. [This is pretty much the same requirement that HotSpot has to maintain for it's card table store].

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jean-Philippe BEMPEL

unread,

Aug 28, 2015, 2:43:21 AM8/28/15

to mechanical-sympathy

On Friday, August 28, 2015 at 2:02:41 AM UTC+2, Gil Tene wrote:

So for the above code, it's actually:

if (someObject.ref != null) { // No LVB here, null checks don't require an LVB
System.out.println(someObject.ref); // LVB here (between reading someObject.ref and using it)
System.out.println(someObject.ref.hashCode()); // No LVB here (someObject.ref is already LVB'ed).
}

Gil,

I surely missed something, but as C4 is concurrent with the application, could a relocation happen between the 2 ref accesses and then breaking the C4 invariants ?

Thanks

Gil Tene

unread,

Aug 28, 2015, 11:54:45 AM8/28/15

to mechanical-sympathy

That would be considered "a bad thing" (tm).

The LVB and the collector together enforce a very strong invariant that can basically be described as "it is impossible for the mutator to observe a reference to a from-space object, and it is impossible for the mutator to observe a reference that has not yet been guaranteed to be marked through (during marking)". A useful side effect of this invariant (and another invariant on its own) is that no such references (to from space objects or to not-yet-marked-through references) can ever be propagated to the heap, because to be propagated to memory they would first need to be observed...

A summerized notion of what from space and to space are (which pages are in which), and what "already marked through" means is used for the fast path of LVB testing. This summered notion can only change (from the point of view of each thread) at a safepoint in the thread's code. If it does change, the thread will repair any references held in the frame (as part of safepoint handling) before it returns to normal execution. (Note that "at a safepoint in the thread's code" doesn't have to mean "at a global safepoint" here. Zing has the ability to separately safepoint individual threads.)

So to answer the actual question above ("could a relocation happen between the 2 ref accesses..."): Relocations (of any portion of the heap) basically start by changing the notion of what pages (aka regions in other collectors) constitute from space and to space. This change is only visible to threads at safepoint locations in their code, and those safepoints (when taken) make sure existing references already in the frames are also corrected, maintaining the invariant mentioned above. So if a relocation starts between two reference accesses, the reference state acquired before the relocation started would be fixed to be consistent with new LVB expectations before the second access occurs.

Bottom line: any reference sitting in a register or on the stack that has been LVB'ed once does not have to be LVB'ed again.

Thanks

Jean-Philippe BEMPEL

unread,

Aug 31, 2015, 3:34:36 AM8/31/15

to mechanical-sympathy

Gil,

Thanks again for the gory details, very much appreciated!

Reply all

Reply to author

Forward