Minimum realistic GC time for G1 collector on 10GB

848 views
Skip to first unread message

Ivan Kelly

unread,
Dec 12, 2016, 7:09:01 AM12/12/16
to mechanical-sympathy
Hi all,

Is it possible to get ~10ms minor collection times for on Hotspot/OpenJDK JVMs with a 10GB heap? We've been doing some profiling on our application and we're at a stage where the major latency bottleneck is collection of short lived objects. We're processing 10,000s of requests per second, and each of these requests is creating a bunch of temporary objects, which go out of scope once processing finishes. 

Using the G1GC and a 10GB heap, we see ~100ms pauses for minor collection. Nothing is tenured in these collections. We know we're allocating a lot of junk(the test I ran was allocating ~300MB/s), but even with higher allocation rates the GC pause time is about the same, albeit less frequent. I'd like to know if it makes sense to try an reduce these by reducing the amount of junk allocated, or whether a more fundamental rearchitecture is in order (the part of the application that needs low latency could probably fit in a 100MB heap).

So is it possible to have low pause times on a 10GB heap using any of the collectors in Hotspot/OpenJDK if nothing is tenured?

Cheers,

Ivan

Chris Newland

unread,
Dec 12, 2016, 7:58:09 AM12/12/16
to mechanical-sympathy
Hi Ivan,

Without commenting on whether a 10ms pause time is achievable with HotSpot I'd say getting an understanding of your allocation rates is always a good thing.

Once you've tackled the low hanging fruit you could dig deeper by getting HotSpot to output its JIT compiler logs (-XX:+UnlockDiagnosticVMOptions and -XX:+LogCompilation) to check that inlining and escape analysis are working as expected.

If HotSpot fails to inline methods to which you pass locally allocated objects then those allocations won't be eliminated by the escape analysis optimisations. This can make a big difference on allocation rates and minor collection frequency.

There's also a commercial JVM that is well-suited to low pause times at high allocation rates ;)

Cheers,

Chris

Wojciech Kudla

unread,
Dec 12, 2016, 8:19:32 AM12/12/16
to mechanical-sympathy

It might also make sense to check the time to safepoint, paging (esp. during the collection) and also if there's any numa-related stalls involved.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martijn Verburg

unread,
Dec 12, 2016, 8:25:08 AM12/12/16
to mechanica...@googlegroups.com
Hi Ivan/All,

In addition to JitWatch to check your JIT compilation, grab a copy of Censum (it's free for 7 days, disclaimer, I work there) and have -XX:+PrintGCDetails -XX:+PrintTenuringDistribution and -xloggc:<log location> switch on to produce a GC log.

AFAIK Censum and Oracle's Flight Recorder / Mission Control are the only two products that give you all of the internal breakdown of where time is spent inside the G1 algorithm and time to safepoint.

All that said, the copy cost and reference processing of G1 for a heap as busy / sized as yours is unlikely to get under 10ms, as Chris mentioned Gil will now be summoned ;-).



Cheers,
Martijn

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ivan Kelly

unread,
Dec 12, 2016, 9:30:21 AM12/12/16
to mechanica...@googlegroups.com
Thanks all for the suggestions. I'll certainly check the safepoint
stuff. I suspect that a lot of EPA isn't happening where it could also
due to stuff being pushed through disruptor. Even if it is happening,
or I can make it happen, and also clear up all the low hanging fruit,
if I can't get below 100ms pauses on the 10G heap, then it would all
be for nothing.

Censum is interesting, I'll take a look. I have flight recordings
right now. Is it possible to find the time to safepoint in that?

> All that said, the copy cost and reference processing of G1 for a heap as
> busy / sized as yours is unlikely to get under 10ms, as Chris mentioned Gil
> will now be summoned ;-).
Yes, really what I'm looking for is someone to tell me "no, you're
nuts, not possible" so I can justify going down the rearchitecture
route. Unfortunately Zing isn't an option, since it would double the
cost of our product.

As I said in the original email though, the low latency part of the
application can probably fit in 100MB or less. The application takes
netlink notification, does some processing and caches the result for
invalidation later. The netlink notification and processing is small
but needs to be fast. The cache can pause, as long as entries for it
can be queued. It's a prime candidate for being moved to another
process.

Which brings me to another question? What are good java shared mem IPC
queues? Something like cronicle-queue, but without the persistence.
I'd prefer to not role my own. The road to hell has enough paving
stones.

-Ivan

Francesco Nigro

unread,
Dec 12, 2016, 10:56:37 AM12/12/16
to mechanical-sympathy
Hi ivan!
 
What are good java shared mem IPC 
queues?

Agrona (many to one, one to one and broadcast variable sized) and JCtools (many to one, one to one fixed sized) have off-heap implementations well suited to be used for IPC but consider that depending on how you've already implemented your system, you'll need to handle "new" failure cases, as dead publisher/receivers etc. 
IMHO, designing for failure it's worth anyway...

About the GC pauses, consider this article hints and an old answer of Mr. G. (Mr T. is Martin T.!) about generational collectors..
AFAIK and i'm now a GC expert at all, an "hidden" factor that plays a role on GC pauses is the card marking phase, citing an article on G1:
If you allocate more objects during concurrent marking than you end up collecting, you will eventually exhaust your heap. During the concurrent marking cycle, you will see young collections continue as it is not a stop-the-world event.
 
 Hence, considering what could slow down the marking time (or its cleaning), it's worth to check(as other have suggested):
  • allocation rate
  • what are the most frequently collected data structures during minor GCs (eg: linked-lists?)
  • false sharing issues during card marking (i'm not sure of it)
Anyway I've found this tool from Mr. G that could help..
About any measurement tool you're considering to use, please read this post to choose it properly..

Regards,
Francesco

ki...@kodewerk.com

unread,
Dec 12, 2016, 4:16:26 PM12/12/16
to mechanica...@googlegroups.com
Yes, configure a smaller Eden and you will get there for this type of profile. I did this for a video system with Java 7 a few years ago. I had a 16.7ms time budget for video rendering along with GC.

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Ivan Kelly

unread,
Dec 12, 2016, 4:25:33 PM12/12/16
to mechanica...@googlegroups.com
On Mon, Dec 12, 2016 at 10:16 PM, ki...@kodewerk.com <ki...@kodewerk.com> wrote:
> Yes, configure a smaller Eden and you will get there for this type of
> profile. I did this for a video system with Java 7 a few years ago. I had a
> 16.7ms time budget for video rendering along with GC.
I assume this was with CMS + parallel? I've been playing with the
MinorGC app Francesco mentioned, and so far parallel completely beats
out g1gc.

-Ivan

ki...@kodewerk.com

unread,
Dec 13, 2016, 3:30:28 AM12/13/16
to mechanica...@googlegroups.com

> On Dec 12, 2016, at 10:25 PM, Ivan Kelly <iv...@midokura.com> wrote:
>
> On Mon, Dec 12, 2016 at 10:16 PM, ki...@kodewerk.com <ki...@kodewerk.com> wrote:
>> Yes, configure a smaller Eden and you will get there for this type of
>> profile. I did this for a video system with Java 7 a few years ago. I had a
>> 16.7ms time budget for video rendering along with GC.
> I assume this was with CMS + parallel?

It works for G1 also.

> I've been playing with the
> MinorGC app Francesco mentioned, and so far parallel completely beats
> out g1gc.

For small heaps parallel will always beat G1. G1 is not intended for small heaps and comes with some very high overheads.

Kind regards,
Kirk

ki...@kodewerk.com

unread,
Dec 13, 2016, 3:48:44 AM12/13/16
to mechanica...@googlegroups.com
Hi,

Censum charts safe-point times. And if you collect extra safe-pointing data it will chart that also.

For reference processing ensure that parallel reference processing is enabled if you have a ton of references.

Kind regards,
Kirk Pepperdine
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ki...@kodewerk.com

unread,
Dec 13, 2016, 4:02:33 AM12/13/16
to mechanica...@googlegroups.com

 
 Hence, considering what could slow down the marking time (or its cleaning), it's worth to check(as other have suggested):
  • allocation rate

Mutation rates are important. You can see the effects of mutation rates in the RSet refinement phase. The default value is to have 10% of RSet refinement (white zone of the reset refinement queue) costs picked up by the GC threads. In Censum you can see this cost and if it exceeds 10% than the RSet refinement threads can’t keep up. You can tune them to be more aggressive or lower the threshold for the red zone of the rset refinement queue. This will have an impact on application throughput and will result in lower mutation rates.

  • what are the most frequently collected data structures during minor GCs (eg: linked-lists?)
  • false sharing issues during card marking (i'm not sure of it)

I don’t think false sharing is possible with RSets.


Anyway I've found this tool from Mr. G that could help..

Do know that Mr. G tools are designed to break OpenJDK’s garbage collectors :-)  to demonstrate how Zing is more resilient. To be fair, Zing is more resilient but…

Regards,
Kirk

Ivan Kelly

unread,
Dec 13, 2016, 4:15:53 AM12/13/16
to mechanica...@googlegroups.com
> Do know that Mr. G tools are designed to break OpenJDK’s garbage collectors
> :-) to demonstrate how Zing is more resilient. To be fair, Zing is more
> resilient but…
I've been playing with MinorGC for a few hours, and it's lead me to
the conclusion that I just need to make my heap smaller if I want sub
10ms pauses. Which is fine. Exactly the kind of guidance I was looking
for :)

-Ivan

Gil Tene

unread,
Dec 13, 2016, 11:02:24 PM12/13/16
to mechanical-sympathy
Wow. I forgot I wrote that tool. Glad to see it helped you in your analysis Ivan.

And Kirk, [most] of my GC related tools are not designed to break OpenJDK's GCs. The GC-realted tools I build are usually designed to exhibit normal GC behaviors more quickly in order to allow practical test-based observations (trying hard to not exacerbate the GC stall lengths, and only make them happen more frequently). This is true even for stressor tools like HeapFragger.

MinorGC is specifically designed to demonstrate OpenJDK's collectors BEST possible full lifecycle pause behaviors, not their "worst" or "broken" modes (which would have much worse pause times). Looking at the discussion that prompted the tool's writing (https://groups.google.com/d/msg/mechanical-sympathy/frVwfX8g6Gw/PWqDLoXTxdYJ), there is no mention of Zing or a comparison to it's behavior. The original reason I wrote and posted MinorGC was to address questions and misconceptions about newgen pause times that were often projecting from misleading short runs. Specifically, people would commonly run short (e.g. <1hr) tests and observe that newgen pause time were short (when oldgen was still at low occupancy), falsely believing that those results reflect the long term or typical newgen pause time behavior for their application. MinorGC lets you fast forward through [what would otherwise take] days of testing, to see the full lifecycle range of newgen pause times, which [for many collectors] is cyclical in nature. When used as a java agent it lets you see what your actual application pause times are going to look like (it makes them happen more rapidly, but does not make them bigger) without having to wait days for the results. I think the README description is honest, and the tool is clearly useful.

That the tool can also be used to quickly compare actual GC pause times between OpenJDK and  Zing's is a just a nice bonus... ;-) It was not what is was written for.

Gil Tene

unread,
Dec 13, 2016, 11:20:57 PM12/13/16
to mechanical-sympathy


On Monday, December 12, 2016 at 9:30:21 AM UTC-5, Ivan Kelly wrote:
Thanks all for the suggestions. I'll certainly check the safepoint
stuff. I suspect that a lot of EPA isn't happening where it could also
due to stuff being pushed through disruptor. Even if it is happening,
or I can make it happen, and also clear up all the low hanging fruit,
if I can't get below 100ms pauses on the 10G heap, then it would all
be for nothing.

Censum is interesting, I'll take a look. I have flight recordings
right now. Is it possible to find the time to safepoint in that?

> All that said, the copy cost and reference processing of G1 for a heap as
> busy / sized as yours is unlikely to get under 10ms, as Chris mentioned Gil
> will now be summoned ;-).
Yes, really what I'm looking for is someone to tell me "no, you're
nuts, not possible" so I can justify going down the rearchitecture
route. Unfortunately Zing isn't an option, since it would double the
cost of our product.

Did you actually check on the cost of EOM'ing Zing with your application? Or are you just assuming it will be expensive?

You might be surprised. Somehow Zing seems too be getting a bad rap, with people assuming it must be expensive. Maybe because the value seems "too high to be sold cheaply". Don't confuse high value with high price. Yes, Zing allows some very profitable high margin businesses to make even more money (think trading systems). But it even more widely used in very low margin businesses (think online consumer retailers) with a reputation for penny-pinching.

Please note that I'm not making any claims that you can't find some other way to get to your target behaviors. I'm a great believer in the potential of duct tape in the hands of talented engineers.

ki...@kodewerk.com

unread,
Dec 14, 2016, 1:45:40 AM12/14/16
to mechanica...@googlegroups.com
On Dec 14, 2016, at 5:02 AM, Gil Tene <g...@azul.com> wrote:

Wow. I forgot I wrote that tool. Glad to see it helped you in your analysis Ivan.

And Kirk, [most] of my GC related tools are not designed to break OpenJDK's GCs. The GC-realted tools I build are usually designed to exhibit normal GC behaviors more quickly in order to allow practical test-based observations (trying hard to not exacerbate the GC stall lengths, and only make them happen more frequently). This is true even for stressor tools like HeapFragger.

Well, lets just say they do a great job of quickly exhibiting “normal” GC behaviors to the point that I make use of them in my workshop because nothing else I’ve used messes things up to the extent that these tools do. So, I’m very happy for them.

Regards,
Kirk

Ivan Kelly

unread,
Dec 14, 2016, 4:05:19 AM12/14/16
to mechanica...@googlegroups.com
> Did you actually check on the cost of EOM'ing Zing with your application? Or
> are you just assuming it will be expensive?
>
> You might be surprised. Somehow Zing seems too be getting a bad rap, with
> people assuming it must be expensive.
The sum of my research on pricing was going to
https://www.azul.com/products/zing/ and scanning for the first $ with
a number beside it. I guess the internet has destroyed my mind,
because I completely missed the next line about startups and ISV.

I think I'll give the ducttape a go first though. I'll definitely try
our product with the zing trial also, if for no other reason than to
see how good things could be.

-Ivan

Vitaly Davidovich

unread,
Dec 14, 2016, 9:19:07 AM12/14/16
to mechanical-sympathy
Have you tried just setting -Xmx10g and -XX:MaxGCPauseMillis=10? This is typically a good baseline to start with for G1; it'll use the pause time goal to adaptively size the young gen based on evacuation cost statistics it maintains.  With a 10ms goal, it'll size it pretty conservatively and you'll get frequent young evacs.  If there's sufficient survivors, they'll start seeping into the old gen.  You'll want to make sure concurrent marking can keep up with the promotion rate.  All of this will be visible in the gc logs (make sure to turn on PrintAdaptiveSizePolicy).

Once you have results for that baseline, and if it's not hitting your goals, there may be ways to tune to them.

I would suggest using the hotspot-gc-use openjdk mailing list for this though, not sure this list is appropriate for back-and-forth G1 tuning advice.


-Ivan

Reply all
Reply to author
Forward
0 new messages