Hello,I have a couple of questions / doubts regarding the impact on latency of the GC promotion process from new generation to old generation.1. Is there anyway to log / monitor when this is happening? Does the -verbose:gc help in any way here?
% java -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx1g -jar HeapFragger.jar
1.365: [GC (Allocation Failure) [PSYoungGen: 65536K->10741K(76288K)] 65536K->46194K(251392K), 0.0315796 secs] [Times: user=0.18 sys=0.04, real=0.03 secs]
3.911: [GC (Allocation Failure) [PSYoungGen: 76277K->10747K(141824K)] 111730K->46968K(316928K), 0.0123873 secs] [Times: user=0.07 sys=0.01, real=0.02 secs]
8.965: [GC (Allocation Failure) [PSYoungGen: 141819K->10746K(141824K)] 178040K->48336K(316928K), 0.0123209 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]
13.945: [GC (Allocation Failure) [PSYoungGen: 141818K->10747K(141824K)] 179408K->49712K(316928K), 0.0104540 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]
18.922: [GC (Allocation Failure) [PSYoungGen: 141819K->10723K(137728K)] 180784K->51128K(312832K), 0.0107468 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
2. What is the best GC algorithm and tuning strategy to minimize these promotions during the program execution?
3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?
4. Is the latency cost of promoting really relevant?
Thx!== Derek
>> 3. Is it possible to somehow force all promotions to happen early in the game during the code warmup phase?> Only if your application makes zero use of long-lived-but-eventually-dying objects.Applications that resort to object pooling and create no garbage for the GC would guarantee zero use of long-lived-but-eventually-dying objects as you mentioned, correct?
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You have to size your young gen such that you don't GC until next downtime; if that's achievable, then even lazily growing pools are fine. However, it's still better to preallocate intelligently so you don't waste time on initial allocs and you can keep the pooled objects close in memory (assuming that's beneficial in the use case).
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
On Aug 25, 2015, at 4:47 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/Sdpx9qCxyEs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
It's harder to write zero GC java apps, sure, but there are advantages to doing that over "half-way" pseudo no GC implementations. At the end of the day, if you want stable performance that you control (with respect to memory), you treat the memory manager with utmost care, whether it's GC or native. The quality and implementation details of the manager dictates what you can get away with, but it doesn't change the big picture.
sent from my phone
While "fit within the younggen until next scheduled downtime and never GC" applications certainly exist, they represent a tiny fraction even of low-latency applications, and a much tinier applications of applications that you'd just call "latency sensitive". I've encountered only a handful or so of such applications. For the vast-vast-vast majority of applications build in Java, it is completely impractical to live in a static enough working set such that no GC of any sort would be required for the lifetime of the process.
Then there is the next class of "fit within the oldgen util the next scheduled downtime and never incur an oldgen GC of any kind (but do incur young gen GCs)" applications. In the low latency space, there are certainly more of these than there are in the first class, and I've certainly encountered several tens of them. But the vast majority of latency-sensitive applications don't fit into this category either, and will encounter multiple old gen GCs between scheduled downtimes.
A telltale sign is the choice of garbage collector in HotSpot. When an application chooses CMS over ParallelGC, it is *always*(or maybe"should always be") because an oldgen collections are expected to be encountered during normal operation. If no oldgen GC's are expected, ParallelGC's newgen tends to do better in terms of both throughput and latency compared to CMS's ParNew (they are both parallel and monolithic STW newgens, but ParallelGC's seems to do more try-to-be-shorter tricks, e.g. capping the card scanning range so it only cover the currently used part of oldgen. The same trick would be ineffective in CMS past the first oldgen because CMS does not compact the oldgen...).
So for the vast-vast-vast majority of Java applications young gen GC cycles are very real, and for the smaller but still vast majority of applications old gen GC cycles are also very real...
On Aug 25, 2015, at 4:47 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
I agree except I don't think it's necessary to pool *everything*. One needs to consider available young gen (budget), allocation rate, and time span between down times.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/Sdpx9qCxyEs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Right. Invariably, every app will have long lived eternal data, a ton of short lived per request type of data, and then some that's medium term (worst kind). Pooling the short lived and medium term objects is worthwhile; these drive your old gen sizing, and the short lived pooled + slow path allocs drive the young gen size. One can certainly try to tune the GC, but having spent sufficient time doing this myself before and seeing traffic on hotspot-gc-use mailing list made me turn away from that. You'll ramp up quicker by not being as frugal/strict with allocations, but then may spend an inordinate amount of time playing the "turn the knob and see what happens" game, looking at GC impl code, becoming intimately familiar with each GCs log info, implementation details, tuning parameters (some that aren't documented all that well), checking if throughout degrades, etc. It's much easier in some sense to put more control in your hands and then only decide on your young and old gen sizes. From an operational and capacity planning standpoint, it's easier to reason about memory usage that's constant or near constant rather than zig zags. GC is good in some ways, but it doesn't remove the need to being frugal with allocations (if latency/perf/jitter is a concern).
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
What is the throughput hit from C4?
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
What is the throughput hit from C4?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Even if you are using a "bad" third-party library that produces garbage you can use some instrumentation techniques to spot where the "garbage leak" is and fix the source code if it is available. Not many people are aware that it is possible to intercept any Java allocation done with the 'new' keyword anywhere in your code through the use of a Java Agent (-javaagent JVM command-line option). If interested in this technique you can check CoralBits' MemorySampler class.
Mike,
Which Hotspot GC is the single digit % in reference to?
sent from my phone
--
I hope nobody is *relying* on escape analysis to eliminate allocations in fast paths.
sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
I hope nobody is *relying* on escape analysis to eliminate allocations in fast paths.
Sure, but I think the original point you were replying to was about detecting allocations in an application that's either trying to avoid them entirely or, more likely, avoiding them in fast paths. Accurate measurement isn't necessary in that case as it's more boolean "yes, allocates on fast path" or "no allocs on fast path".
sent from my phone
--
And while we're discussing this, the problem isn't passing a ref to another method but passing a ref to a method not inlined. This is part of the "brittleness" of EA.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
And while we're discussing this, the problem isn't passing a ref to another method but passing a ref to a method not inlined. This is part of the "brittleness" of EA.
I'm aware of that, but thanks for raising the issue. The reason I didn't mention this before is that from the memory profiler's point of view it won't know whether the methods it instruments in get inlined at the time that a java agent instruments. Or at least I can't see any way to do so that's not even more brittle than EA. If I'm missing something then I'm happy to be corrected on the matter.
--
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Which Hotspot GC is the single digit % in reference to?
Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.
Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?
sent from my phone
--
From memory is fine :).
Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)
sent from my phone
From memory is fine :).
Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)
sent from my phone
On Aug 26, 2015 10:45 PM, "Michael Barker" <mik...@gmail.com> wrote:
Hi Vitaly,I'm mostly working from memory at the moment. The test was done a few of years ago when we were making the decision whether or not to move to Zing, so set up harness is probably gotten out of date and been deleted. Also worth mentioning that Azul continue to optimise the LVB so it is probably faster than when I tested.As for CPUs in Zing we allocated 2 threads for new collections and 2 for old to prevent them from contending with the application for CPU resource (to this day, that is only tuning option we've applied to the GC). With Hotspot, I think we just used the default. Worth noting that our Hotspot set up was tuned such that we wouldn't run into old GCs, so most of the time C4 was actually competing against ParNew. Given that ParNew is a STW collector the default CPU count (#thread == #core IIRC) was probably the most appropriate thing.Mike.
On 27 August 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:
Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.
Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?
sent from my phone
On Aug 26, 2015 7:57 PM, "Michael Barker" <mik...@gmail.com> wrote:
--Which Hotspot GC is the single digit % in reference to?
We compared with ParNew + iCMS. Although the difference probably doesn't have much to do with the actual GC and more to do with how the VM handles dereferencing.As a bit more background, the single digit % result was from our only legitimate case of running our system flat out, which is to test how long it takes to restore the system after a crash. It would read in a days worth of journal files to recover its state. We measured the total time taken to complete the restore and there was very little difference.Mike.
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On Wednesday, August 26, 2015 at 7:52:34 PM UTC-7, Vitaly Davidovich wrote:From memory is fine :).
Was your app promoting anything of consequence even though you weren't causing full gc? Was concurrent GC running and just keeping up? I'm not entirely certain what exactly you meant. CMS promotion can be more expensive than parallel old because it's free list based - I don't know how Zing manages old and young regions (what *does* it do?)
Zing is a pure mark/compact collector for both young and old generations, so promotion is done into nice contiguous allocations blocks (normally 2MB blocks). No free lists involved. In that sense it's probably similar to ParallelGC in the total amount of promotion work per promoted unit.A key difference in promotion logic is that Zing does not use promotion thresholds counted in terms on # of newgen cycles. Instead, it's promotion threshold is time based. [By default] objects younger than 2 seconds are new, and ones that are older than 2 seconds are old. This age boundary is configurable, I don't know of anyone who has had to configure it in 2+ years...This simpler time-based age decision is possible because Zing does not split the heap between the new and old generations. nNewgen gets to use the entire heap and as a result newgen frequencies are usually significantly reduced when compared to STW newgens, and especially when compared to those that try to cap their pause times by keeping the Eden size to a few tens or hundreds of MB. This translates into even more things dying in newgen and avoiding both promotion and copying work, which in turn translates to less copying and less promotion work than ParallelGC would incur on a similar workload, along with a reduced rate of oldgen GCs as a result as well.With all that said, we usually don't bother much with characterizing the efficiency side of things, or looking too much at promotion rates, because most productions configs have the GC active for only 1-5% of the overall time... I don't think we've ever had anyone (using Zing) try to tune for a change in promotion rate or analyze to closely, because it just doesn't matter. Since each extra empty GB adds efficiency to the collectors, most people just add enough to keep the collectors (both newgen and oldgen) relatively idle and stop where they feel comfortable with the % of time that GC is active. This is usually driven by a wish for "GC headroom" rather than CPU consumption concerns, since (with Zing) the collectors would generally need to be active 100% of the time before application delays start popping up.sent from my phone
On Aug 26, 2015 10:45 PM, "Michael Barker" <mik...@gmail.com> wrote:
Hi Vitaly,I'm mostly working from memory at the moment. The test was done a few of years ago when we were making the decision whether or not to move to Zing, so set up harness is probably gotten out of date and been deleted. Also worth mentioning that Azul continue to optimise the LVB so it is probably faster than when I tested.As for CPUs in Zing we allocated 2 threads for new collections and 2 for old to prevent them from contending with the application for CPU resource (to this day, that is only tuning option we've applied to the GC). With Hotspot, I think we just used the default. Worth noting that our Hotspot set up was tuned such that we wouldn't run into old GCs, so most of the time C4 was actually competing against ParNew. Given that ParNew is a STW collector the default CPU count (#thread == #core IIRC) was probably the most appropriate thing.Mike.
On 27 August 2015 at 14:00, Vitaly Davidovich <vit...@gmail.com> wrote:
Thanks Mike. I'd be interested in a throughput comparison between Parallel GC and Zing; reason being is parallel will have the cheapest write barrier in hotspot, and I'm curious what tax the load and store barriers in Zing impose.
Also, how many cpus did you allocate to Zing and Hotspot GCs in your experiment?
sent from my phone
On Aug 26, 2015 7:57 PM, "Michael Barker" <mik...@gmail.com> wrote:
--Which Hotspot GC is the single digit % in reference to?
We compared with ParNew + iCMS. Although the difference probably doesn't have much to do with the actual GC and more to do with how the VM handles dereferencing.As a bit more background, the single digit % result was from our only legitimate case of running our system flat out, which is to test how long it takes to restore the system after a crash. It would read in a days worth of journal files to recover its state. We measured the total time taken to complete the restore and there was very little difference.Mike.
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Thanks Gil (and Mike).Gil, so what are the costs of the LVB and write barriers in Zing? What does an LVB look like in pseudo-assembly?
Are write barriers using card marking?
Are they susceptible to false sharing (like Hotspot) on the card?
Are there cpu fences after the write barrier (given that the GC is concurrent and not STW)?
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.
Note that due to the precise nature (1 bit per heap word) of the card table, actual dirtying stores to the card table are atomic (an atomic OR), but that these stores are dynamically much more rare than the cheaper blind store in HotSpot.
As explained above, Zing's card marking is not susceptible to false-sharing contention. C4 was initially developed for machines with several hundreds of cpu cores, and for environments where both cache coherency bandwidth and memory bandwidth could become the real bottleneck (even with 64 memory controllers humming in parallel), so we had to deal with that one very early on...
If the barrier chooses to dirty a card, there is a logical StoreStore fence between the barrier's card dirtying store operation and the actual reference store that follows it. What this translates to would depend on the CPU involved. On x86 it's a no-op.
Some details below.
On Thursday, August 27, 2015 at 6:15:53 AM UTC-7, Vitaly Davidovich wrote:Thanks Gil (and Mike).Gil, so what are the costs of the LVB and write barriers in Zing? What does an LVB look like in pseudo-assembly?As you can imagine, we've gone through many optimizations of the LVB and what it's fast path test looks like over the years. Much of that has to do with delicately designing collector state representations and phase transitions to make the fast path LVB test as cheap as possible. We've probably gone through 15+ implementations of the same logical LVB over the past decade.
While the logical LVB test always enforces the LVB invariants that the C4 paper describes, in current x86 implementations we've managed to devolved the fast path to a simple TEST and JMP combination. Depending on register allocation decisions made by the JIT, the test is either reg vs. reg or reg vs. thread-local memory location (which is hot and always L1-hitting), . This translates to a single u-op (in the reg. vs. reg test) or two u-ops (in the reg vs. mem test), a jump that is (literally) 99.9999999% predictable and (in the reg. vs. mem case) L1-local. If the fast path triggers (that 0.000000001% of the time thing), the slow path is still "fast" but has some real work to do depending on the triggering conditions and GC phase (it actually has multiple "fast slow path" levels before devolving to the slowest thing).Summary: The LVB fast path in an single ultimately-predictable branch on a test that never incurs a cache miss.As far as impact, LVB "cost" varies with the program's (orthogonal to LVB) IPC. The two instructions and resulting 1 or 2 u-ops (and branch) certainly consume processor resources. The cost of consuming this resources "grows" at high IPCs when the processor would be able to keep its pipeline and execution units entirely full, and "shrinks" at low IPCs, e.g. where cache misses come into play. It is basically undetectable in pointer changing situations, and can show a handful of % in u-benchmarks with tight numeric L1-hitting loops. Most applications fall somewhere in the middle.
Are write barriers using card marking?For clarity, I like to refer to these barriers as "reference store barriers" to avoid ambiguity in the term "write barrier". They are barriers that are executed when a reference is stored to a memory location. They exist in all generational collectors, but are also needed for some non-generational purpose for some (e.g. G1 uses it to enforce SATB invariants and track cross-region remembered sets, and includes tests both before and after the actual reference store). Zing's reference store barriers are there purely for generational remembered set tracking, and apply ahead of the reference store itself.Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.While it's hard to compare, Zing's reference-store barrier cost for single-threaded execution probably falls somewhere between the "blind" and -XX:+UseCondCardMark HotSpot barrier costs, while it's multi-threaded execution cost is probably somewhat lower than both (if/when false sharing in the card table were an issue for HotSpot). Zing's memory bandwidth cost is significantly lower (up to 2x less write bandwidth to memory in streaming cases), but modern x86 sockets tend to have memory bandwidth to spare, so this may not matter as much. All of these statements become very application-behavior-dependent though...
The reason Zing's reference-store barrier can be faster in the presence of false sharing is that the fast path generational-test condition (which HotSpot does't do in the -XX:+UseCondCardMark test AFAIK) reduces the cost of conditional testing on the fast path because it involves no memory access (it is based purely on the values of the reference being store and the target address it is being stored to). The memory access and the "is it already dirty?" test that goes with it is only needed if the store creates an oldgen->newgen reference, which is dynamically rare, typically down to handful of % (or less) of reference stores.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Thanks for the details Gil.The LVB lowering sounds roughly equivalent to doing an array range check on each memory reference (modulo an LVB possibly doing reg vs reg, whereas range check is always loading the length from memory, barring other optimizations in scope).
Is the LVB done on each access to a reference field or only the first one and then uses register? E.g.:if (someObject.ref != null) { // LVB here, I assumeSystem.out.println(someObject.ref); // is there LVB here or no?System.out.println(someObject.ref.hashCode()); // how about here}
Basically, does the JIT common out the reads and the LVB? I assume so(!) but wanted to double check.
Reference-store barriers in Zing/C4 do card mark. But the card table is a bit different: Due to various considerations, Zing uses a precise card table (1 bit per heap word) as opposed to HotSpot's imprecise table (1 byte per 512 bytes of heap space). Zing also uses a double-conditional card mark barrier (only dirties when storing newgen refs into oldgen fields, and only when the card is not already dirty) as opposed to HotSpot's variants (the default "blind" unconditional dirtying, which is susceptible to false sharing, or -XX:+UseCondCardMark which does not dirty already-dirty cards and avoids false sharing, but still dirties on oldgen-to-oldgen stores, or the much more complicated G1 reference-store barrier). You can some good detail on the HotSpot variants in Nitsan's blog post on the subject.So a 32GB heap will use a 256MB card table (assuming min object is 16 bytes)? Hotspot would use 64MB.
What's the reason for such precision?
Also, I believe Hotspot doesn't check whether a store is from old gen object or not for performance reasons -- you did not find this to be a problem in Zing? Or you feel you make up for it by reducing writeback traffic?
Note that due to the precise nature (1 bit per heap word) of the card table, actual dirtying stores to the card table are atomic (an atomic OR), but that these stores are dynamically much more rare than the cheaper blind store in HotSpot.Why much more rare? This implies there aren't many oldgen->younggen references, but why is that rare? I'd expect this to depend on application, and not some generalized thing.
As explained above, Zing's card marking is not susceptible to false-sharing contention. C4 was initially developed for machines with several hundreds of cpu cores, and for environments where both cache coherency bandwidth and memory bandwidth could become the real bottleneck (even with 64 memory controllers humming in parallel), so we had to deal with that one very early on...I'm not sure I see how the precise card table avoids false sharing. Or you mean due to reduced dirtying to begin with? Or because 64 byte cacheline of the card table "addresses" fewer objects in your implementation?
If the barrier chooses to dirty a card, there is a logical StoreStore fence between the barrier's card dirtying store operation and the actual reference store that follows it. What this translates to would depend on the CPU involved. On x86 it's a no-op.Ok, but the dirtying uses an atomic instruction though, right?
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
So for the above code, it's actually:if (someObject.ref != null) { // No LVB here, null checks don't require an LVBSystem.out.println(someObject.ref); // LVB here (between reading someObject.ref and using it)System.out.println(someObject.ref.hashCode()); // No LVB here (someObject.ref is already LVB'ed).}
Thanks