Shenandoah: A pauseless GC for OpenJDK

886 views
Skip to first unread message

Michał Warecki

unread,
Jun 14, 2013, 3:45:38 AM6/14/13
to mechanica...@googlegroups.com
Hi!

Interesting, guys from Redhat are working on a new "Pauseless" (there still will be pause times but short) GC for OpenJDK:
http://rkennke.wordpress.com/2013/06/10/shenandoah-a-pauseless-gc-for-openjdk/

Looking forward for the concept and source :-)

Cheers,
Michał

Michael Barker

unread,
Jun 14, 2013, 4:06:20 AM6/14/13
to Michał Warecki, mechanica...@googlegroups.com
I will be interested to see what they come up with. However, I would
like to see them be a bit more ambitious. For a low latency system a
10ms is quite a long pause, we're around 5ms consistently with iCMS
and hopefully MUCH less once we get Azul into production.

Mike.
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Michał Warecki

unread,
Jun 14, 2013, 4:18:30 AM6/14/13
to Michael Barker, mechanica...@googlegroups.com
I hope 10 ms is the worst case taking into account application which is not GC friendly.

PS.
As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.


2013/6/14 Michael Barker <mik...@gmail.com>

Michael Barker

unread,
Jun 14, 2013, 4:24:55 AM6/14/13
to Michał Warecki, mechanica...@googlegroups.com
> PS.
> As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK
> 9.

Yes and I think it is a mistake on the part of Oracle. One of the
MANY reasons we (LMAX) purchased Zing.

Mike.

Richard Warburton

unread,
Jun 14, 2013, 5:07:25 AM6/14/13
to Michael Barker, Michał Warecki, mechanica...@googlegroups.com
Hi,

Yes and I think it is a mistake on the part of Oracle.  One of the
MANY reasons we (LMAX) purchased Zing.

This was discussed quite a bit on hotspot-gc-dev at the time.  It sounded like the decision had already been made, and despite a number of people with latency sensitive production systems or GC tuning expertise commenting on why this was a bad decision it went ahead.  There was a desire, stated at Javaone last year, for them to kill CMS itself and replace it with G1.  Despite there being a lot of ongoing debate as to whether G1 will ever fulfil its stated goals.

regards,

  Dr. Richard Warburton

Martin Thompson

unread,
Jun 14, 2013, 5:24:48 AM6/14/13
to mechanica...@googlegroups.com, Michael Barker, Michał Warecki
On Friday, June 14, 2013 10:07:25 AM UTC+1, Richard Warburton wrote:
Hi,

Yes and I think it is a mistake on the part of Oracle.  One of the
MANY reasons we (LMAX) purchased Zing.

This was discussed quite a bit on hotspot-gc-dev at the time.  It sounded like the decision had already been made, and despite a number of people with latency sensitive production systems or GC tuning expertise commenting on why this was a bad decision it went ahead.  There was a desire, stated at Javaone last year, for them to kill CMS itself and replace it with G1.  Despite there being a lot of ongoing debate as to whether G1 will ever fulfil its stated goals.

Given the mechanics of how G1 operates I do not believe it will ever be suitable for low-latency applications.  For example, maintaining the Remembered Sets makes minor GC more expensive and the stop-the-world regional compactions are always going to cost 10s-100s of milliseconds in the best case.

Low-latency can mean more than financial trading applications.  I'm seeing customers who cannot tolerate pauses that users can see in reactive interfaces (>150ms) and clustered environments where pauses make it appear that nodes are dead and have to be removed from a cluster.

Martin... 

Richard Warburton

unread,
Jun 14, 2013, 5:40:13 AM6/14/13
to Martin Thompson, mechanica...@googlegroups.com, Michael Barker, Michał Warecki
Hi,

Low-latency can mean more than financial trading applications.  I'm seeing customers who cannot tolerate pauses that users can see in reactive interfaces (>150ms) and clustered environments where pauses make it appear that nodes are dead and have to be removed from a cluster.

Yes - I've seen this before as well.  This can cause quite bad failover scenarios sometimes, especially if it happens when clusters are under heavy load or if there's a chain reaction to nodes being declared dead.  For example if the cluster has some concept of a master which is declared dead due to GC pause and it generates some kind of election protocol.  Now you can argue that there are architectural alternatives to scenarios like this, but well tuned GC is probably easier to achieve than rewriting your entire clustering technology.

Given the mechanics of how G1 operates I do not believe it will ever be suitable for low-latency applications.  For example, maintaining the Remembered Sets makes minor GC more expensive and the stop-the-world regional compactions are always going to cost 10s-100s of milliseconds in the best case.

You can tune the size of regions in G1.  If you make them smaller then your remembered sets will be smaller as well.  Of course, smaller regions have other knock-on effects on terms of G1 performance.

Martin Thompson

unread,
Jun 14, 2013, 5:51:05 AM6/14/13
to mechanica...@googlegroups.com, Martin Thompson, Michael Barker, Michał Warecki

On Friday, June 14, 2013 10:40:13 AM UTC+1, Richard Warburton wrote:

Given the mechanics of how G1 operates I do not believe it will ever be suitable for low-latency applications.  For example, maintaining the Remembered Sets makes minor GC more expensive and the stop-the-world regional compactions are always going to cost 10s-100s of milliseconds in the best case.

You can tune the size of regions in G1.  If you make them smaller then your remembered sets will be smaller as well.  Of course, smaller regions have other knock-on effects on terms of G1 performance.

In the spirit of this thread the point is that G1 is not really a concurrent collector (best way to be pauseless).  It is not concurrent for the minor collections or the evacuation of regions, both of which are extremely impactful in a low-latency environment.  Also we have not covered how bad it can be for large objects (>50% region size).  And by tuning down region size more objects fall into the large category.  By tuning we are just putting lipstick and pearls on the pig but there is no disguising it is still a pig :-)

Michał Warecki

unread,
Jun 14, 2013, 5:53:03 AM6/14/13
to Richard Warburton, Martin Thompson, mechanica...@googlegroups.com, Michael Barker
You can tune the size of regions in G1.  If you make them smaller then your remembered sets will be smaller as well.  Of course, smaller regions have other knock-on effects on terms of G1 performance.

Hmm, smaller? If you make them smaller then you have to increase number of regions. Increasing number of regions will increase probability of creating inter-generational pointers (maintained in RS). In the worst case there is probability (n-1 / n) * 100% of creating inter-generational pointer, where n is the number of regions.
I may be wrong (and probably I am) :-).

Gil Tene

unread,
Jun 14, 2013, 1:14:51 PM6/14/13
to
You are only wrong in that the worst case is 100%, with no (n-1)/n thing in front.

But your intuition is right. For oldgen, cross-region behavior is quite simple:

- There is no magic fairy that somehow makes (or keeps) it so surviving oldgen objects happen to be in the same region as the things that point to them.

- The smaller the region size is as a % of the heap, the more regions will tend to be pointing into any one region. It's a simple game of "roll the dice".

- Larger heaps and smaller regions both have the same effect: both will tend to increase the number of regions with references into any one region.

This is why capped-time incremental compaction work that is based on coarse which-region-points-to-which remembered set data tends to grow to the square of the heap size (assuming you are not willing to increase your increment pause time to match heap size growth). You can combat/cap that with per-object (instead of per-region) remembered sets, in which case the size of the remembered set remains the same for a given amount of object reference fields in the heap, but that gets expensive in other ways (and AFAIK is not what region-based remembered sets do at scale).

At this point, I would usually expect the common arguments of "but most of the time the object layout is friendly to X, and the stats are better than these degenerate cases would make you think" to come out.

Sadly, these only apply when what you do "most of the time" is run the SPECjbb benchmark. As a general notion, whenever people start to model oldgen behavior in some ways and assume that things outside the model are "only degenerate cases" terms, things tend to go very very bad for the poor people who live in the real world.

For example (to my knowledge) there is no documented systematic observation for inter-region relationship behaviors and tendencies for oldgen. There is certainly no magic bullet like the weak generational hypothesis thing for olden. There is (by definition) no locality starting point there, and the "most things die young" intuition is completely gone when it comes to the bulk of the heap (the young gen milks it for all it's got before the oldgen ever sees it). Once objects get past the generational filter, common movement patterns through oldgen includes things like "oldest things die first" (think LRU-like cache), "things live in chucks that die together but live for a long time" (think big computation units and sessions), and "object lifetime is semi-evenly spread across all live matter" (think of random evacuation in caches and of random mutation in stable-sized data sets). All of these make "degenerate" happen a lot.

Kirk Pepperdine

unread,
Jun 14, 2013, 4:03:33 PM6/14/13
to Michał Warecki, Michael Barker, mechanica...@googlegroups.com

On 2013-06-14, at 10:18 AM, Michał Warecki <michal....@gmail.com> wrote:

> I hope 10 ms is the worst case taking into account application which is not GC friendly.
>
> PS.
> As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.

It wouldn't have made it to 8 'cept I complained very loudly..

-- Kirk

Kirk Pepperdine

unread,
Jun 14, 2013, 4:04:52 PM6/14/13
to Michael Barker, Michał Warecki, mechanica...@googlegroups.com
This echoed my message... but they've drunk the G1GC purple koolaid.. but still not to late to change things if anyone is interested in making more noise.

-- Kirk

Kirk Pepperdine

unread,
Jun 14, 2013, 4:06:35 PM6/14/13
to Richard Warburton, Michael Barker, Michał Warecki, mechanica...@googlegroups.com
I think I've figured out how to get G1 to perform about as good as CMS but it's still not as good as iCMS. But now there are some upcoming commits to CMS that are going to move the bar at least for the initial-mark. so....

-- Kirk

Michał Warecki

unread,
Jun 14, 2013, 6:09:28 PM6/14/13
to Kirk Pepperdine, mechanica...@googlegroups.com, Michael Barker, Richard Warburton

Yes, I've tested parallel initial mark patch and it seems to work. Very nice and very trivial improvement.

Scott Carey

unread,
Jun 15, 2013, 1:09:14 AM6/15/13
to mechanica...@googlegroups.com, Michael Barker


On Friday, June 14, 2013 1:18:30 AM UTC-7, Michał Warecki wrote:
I hope 10 ms is the worst case taking into account application which is not GC friendly.

PS.
As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.



Well that's horrible.  G1GC has less than half the throughput on my app and WORSE median, and 99th percentile latency.   It does avoid the rare full, very slow GC that CMS can end up with.
 
2013/6/14 Michael Barker <mik...@gmail.com>
I will be interested to see what they come up with.  However, I would
like to see them be a bit more ambitious.  For a low latency system a
10ms is quite a long pause, we're around 5ms consistently with iCMS
and hopefully MUCH less once we get Azul into production.

Mike.

On 14 June 2013 19:45, Michał Warecki <michal....@gmail.com> wrote:
> Hi!
>
> Interesting, guys from Redhat are working on a new "Pauseless" (there still
> will be pause times but short) GC for OpenJDK:
> http://rkennke.wordpress.com/2013/06/10/shenandoah-a-pauseless-gc-for-openjdk/
>
> Looking forward for the concept and source :-)
>
> Cheers,
> Michał
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Kirk Pepperdine

unread,
Jun 15, 2013, 11:22:32 AM6/15/13
to Scott Carey, mechanica...@googlegroups.com, Michael Barker


PS.
As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.



Well that's horrible.  G1GC has less than half the throughput on my app and WORSE median, and 99th percentile latency.   It does avoid the rare full, very slow GC that CMS can end up with.

Why?

-- Kirk

Scott Carey

unread,
Jun 15, 2013, 11:52:08 AM6/15/13
to Kirk Pepperdine, Michael Barker, mechanica...@googlegroups.com

Weak references.  

Mani Sarkar

unread,
Jun 15, 2013, 1:08:17 PM6/15/13
to mechanica...@googlegroups.com, Michael Barker, Michał Warecki
I was suggesting lobbying for the return of iCMS - but since my voice won't be heard I'll stay quiet!

Cheers
mani

Kirk Pepperdine

unread,
Jun 15, 2013, 3:05:24 PM6/15/13
to Mani Sarkar, mechanica...@googlegroups.com, Michael Barker, Michał Warecki
Hi Mani,

On it's on.. no.. my voice wasn't loud enough on it's own either and I tried to get others to chime in but no one really showed up... I intend to bring this up as an issue during our annual breakfast with Oracle execs just before JavaONE. and I'd say the more voices we have the more pull we will have.

Imagine if there were people in Oracle that were interested in implementing a pauseless collector and imagine if for some reason they were prevented from working on it. If that were the case how much assistance do you think Redhat might get. I say this because hotspot is to interwound into the memory management and the JIT that it's unlikely that Redhat will have much luck without tearing into that technology also. And it just might be that they will need the cooperation of some people that might have scuttled the plans for this other group in Oracle to build another collector.

Just a thought.
Kirk

Scott Carey

unread,
Jun 15, 2013, 3:56:51 PM6/15/13
to mechanica...@googlegroups.com, Kirk Pepperdine, Michael Barker
More details:

Imagine you have 400GB of data on disk (SSD) representing data that is needed to service requests.  Some of this access is rather random but some of the data is very 'hot' and frequently accessed.  The hot data is not specifically of any particular kind that you can easily partition off, its simply that the access patterns are not random, but skewed strongly.

Reading this data into an on heap LRU is a self-defeating proposition, the data set is significanly larger than RAM and although this improves average application latency and throughput somewhat due to avoiding work, garbage collection becomes very heavy-weight since an LRU by definition puts a lower bound on object lifetimes.  With the throughput collector or CMS, this causes lots of thrashing in the young generation and survivor spaces.  A larger heap makes GC times longer, not shorter, and eats into the OS's cache for the data on disk.

A soft reference cache is better than an LRU, but suffers from many of the same problems on the GC side.

In this application, a weak reference cache for such data is a massive performance win for the throughput collector and (i)CMS.  It incurs no extra GC overhead and young generation collections are easy to keep below target latency goals -- the throughput : GC-time tradeoff curve as a function of young generation size is smooth.

The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

(i)CMS has a more tunable young collector -- by tuning the size of the eden spaces and how many bounces before tenure, less is tenured.  (i)CMS lives much longer without a full GC than the throughput collector, but a full GC is over 10 seconds long, which is entirely unacceptable.

G1GC just doesn't seem to like this workload at all.

For this application, the ideal would be the ParNew young collector (or an upgraded throughput young collector that has functional tuning of tenuring thresholds), backed by a tenured heap that can avoid full GCs.  This tenured heap could be G1GC or a new concurrent collector.   With what is available now, a ParNew grafted in front of a G1GC for tenured space would probably be great for this application.  However I don't think G1's design works with a separate young generation in front of it.

I have little confidence that any voices outside of the ivory tower will be listened to.

Granted, I haven't tried to use and tune G1GC for this in over a year and a half, and perhaps it is better now.


On Saturday, June 15, 2013 8:52:08 AM UTC-7, Scott Carey wrote:

Weak references.  

Gil Tene

unread,
Jun 15, 2013, 5:05:13 PM6/15/13
to mechanica...@googlegroups.com, Kirk Pepperdine, Michael Barker
Silly question: Have you tried Zing? This is one of the prototypical workload cases C4 shines on, in both NewGen and Oldgen behavior.

And another question on the data:

The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

Does this really mean you see a 2 second GC breaking your 100msec bar only once in 5+ hours?

-- Gil.

Peter Lawrey

unread,
Jun 15, 2013, 6:23:42 PM6/15/13
to Scott Carey, mechanica...@googlegroups.com
The way I would deal with this is to keep the bulk of your data in memory mapped files.  This allows the OS to drop data in LRU style asynchronously and load data on demand transparently.  Most significantly it means most of your data is off heap. It is also very fast to remap on a restart. (A few 100 GBs per second)  I have worked on systems with 1 - 2 GB heap (mostly Eden) and 200 - 800 GB off heap in memory mapped files and by recycling objects there is little or no GC collections.  In this situation, the only tuning I did was to reduce the heap size to ensure I had the fastest Compressed Oops translation.

Peter.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Jun 15, 2013, 6:40:35 PM6/15/13
to Scott Carey, mechanica...@googlegroups.com, Michael Barker
I'm not sure that I completely understand what you're saying here. Comments below.

More details:

Imagine you have 400GB of data on disk (SSD) representing data that is needed to service requests.  Some of this access is rather random but some of the data is very 'hot' and frequently accessed.  The hot data is not specifically of any particular kind that you can easily partition off, its simply that the access patterns are not random, but skewed strongly.

There are reasonable techniques for managing this. Getting things off-heap using one technique or another helps.


Reading this data into an on heap LRU is a self-defeating proposition, the data set is significanly larger than RAM and although this improves average application latency and throughput somewhat due to avoiding work, garbage collection becomes very heavy-weight since an LRU by definition puts a lower bound on object lifetimes.  With the throughput collector or CMS, this causes lots of thrashing in the young generation and survivor spaces.  A larger heap makes GC times longer, not shorter, and eats into the OS's cache for the data on disk.

See Peter's comments for getting things off-heap.


A soft reference cache is better than an LRU, but suffers from many of the same problems on the GC side.

In this application, a weak reference cache for such data is a massive performance win for the throughput collector and (i)CMS.  It incurs no extra GC overhead and young generation collections are easy to keep below target latency goals -- the throughput : GC-time tradeoff curve as a function of young generation size is smooth.

iCMS only works in tenured space. If your cache only uses weak references these objects will be caught by the parnew. The overall handling costs for weak (soft, phantom, final) reference is some what more expensive than for a normal object.


The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

(i)CMS has a more tunable young collector -- by tuning the size of the eden spaces and how many bounces before tenure, less is tenured.  (i)CMS lives much longer without a full GC than the throughput collector, but a full GC is over 10 seconds long, which is entirely unacceptable.

I think there is more tuning you can do to either push out or eliminate the CMF. I've worked on (low latency) apps that ran for 2 weeks without a CMF.


G1GC just doesn't seem to like this workload at all.

Not with default configurations it won't.


For this application, the ideal would be the ParNew young collector (or an upgraded throughput young collector that has functional tuning of tenuring thresholds), backed by a tenured heap that can avoid full GCs.  This tenured heap could be G1GC or a new concurrent collector.   With what is available now, a ParNew grafted in front of a G1GC for tenured space would probably be great for this application.  However I don't think G1's design works with a separate young generation in front of it.

G1 is what I'd call a hybrid regional generational collector which completely different supporting data structures.


I have little confidence that any voices outside of the ivory tower will be listened to.

You'd be surprised. Two years ago we managed to influence Oracle to revamp a number of thing w.r.t. FX.

There are two beefs against iCMS. Test cycles and complexity in the implementation (which leads back into testing cycles).


Granted, I haven't tried to use and tune G1GC for this in over a year and a half, and perhaps it is better now.

They have made significant gains with G1. The biggest has been in backing off of some questionable decision like deciding to call for a mixed collection in every 4th young gen collection. The other is the smallest size eden can become given a max heap size. Both have bitten me when i've attempted to tune G1. The mixed to young ratio is now 8 but I'd be included to turn that off completely. The resize of eden was only needed for one very special case and I'll be surprised if that case ever comes up again. That said, recent use saw CMS yielding better results than with the G1. I'd attribute part of it to I'm much better at tuning CMS than I am at turning the G1.

-- Kirk

Scott Carey

unread,
Jun 15, 2013, 6:43:01 PM6/15/13
to mechanica...@googlegroups.com, Kirk Pepperdine, Michael Barker


On Saturday, June 15, 2013 2:05:13 PM UTC-7, Gil Tene wrote:
Silly question: Have you tried Zing? This is one of the prototypical workload cases C4 shines on, in both NewGen and Oldgen behavior.

Nope.  Last I heard, it had trouble with WeakReferences too (but that was... 5 years ago?).
 

And another question on the data:

The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

Does this really mean you see a 2 second GC breaking your 100msec bar only once in 5+ hours?


Every 20 to 40 minutes in the real world, every 2 minutes in a load test.

A full GC happens about once every 50,000 requests  -- this varies by a factor of two depending on tuning and the code deployed that week.  The percentile numbers are per request, at 80% capacity or so.  Outliers go down somewhat at real-world loads, because fewer requests are in flight when GC stops the world.
 

Scott Carey

unread,
Jun 15, 2013, 6:52:47 PM6/15/13
to mechanica...@googlegroups.com, Scott Carey


On Saturday, June 15, 2013 3:23:42 PM UTC-7, Peter Lawrey wrote:
The way I would deal with this is to keep the bulk of your data in memory mapped files.  This allows the OS to drop data in LRU style asynchronously and load data on demand transparently.  Most significantly it means most of your data is off heap. It is also very fast to remap on a restart. (A few 100 GBs per second)  I have worked on systems with 1 - 2 GB heap (mostly Eden) and 200 - 800 GB off heap in memory mapped files and by recycling objects there is little or no GC collections.  In this situation, the only tuning I did was to reduce the heap size to ensure I had the fastest Compressed Oops translation.

Peter.

What the WeakReferences hold is decoded, deserialized data.   This is where the big performance boost for using weak references comes from, by avoiding both the read from the OS or disk and the deserialization costs.  Heavily used objects will likely already be in processor caches.

Peter Lawrey

unread,
Jun 15, 2013, 6:59:49 PM6/15/13
to Scott Carey, mechanica...@googlegroups.com
I use the memory mapped files in such a way that the data can be used without deserializing it.  There are interfaces with getters and setters so the calling code is normal Java but in reality they are just pointing to places in the file.  You can have a look at Javolution Structs to get an idea of what I mean, but I roll my own ;)  Notionally all the data is in memory at once, all the time, so there is no need for WeakReferences AFAICS.  Looking up data this way avoids needing deserialization, creating objects, systems calls or worrying about when they might go away.


Mani Sarkar

unread,
Jun 15, 2013, 7:11:26 PM6/15/13
to Kirk Pepperdine, mechanica...@googlegroups.com, Michael Barker, Michał Warecki
Kirk, 

now you know I'm also vocal about it and supporting the same cause, I also oppose the removal of PermGen, although not fully sure if I know why I'm opposing it, there's mixed messages about it. From what you say its all political rather than being developer-friendly reasons.

Please do mention at your meet that you have a whole lot of people behind you (including Mani from London) that stand for revival of iCMS! Its not smart to remove something that works and gives great results and replace it with something that isn't credible yet - its like fixing something that isn't broken!

Cheers,
mani
--
Twitter: @theNeomatrix369          Blog: http://neomatrix369.wordpress.com
JUG activity: LJC Advocate (@adoptopenjdk & @adoptajsr programs)
Devoxx UK 2013 was a grand success: http://www.devoxx.com/display/UK13/Home

Don't chase success, rather aim for "Excellence", and success will come chasing after you!

Scott Carey

unread,
Jun 15, 2013, 8:06:38 PM6/15/13
to mechanica...@googlegroups.com, Scott Carey, Michael Barker


On Saturday, June 15, 2013 3:40:35 PM UTC-7, Kirk Pepperdine wrote:
I'm not sure that I completely understand what you're saying here. Comments below.

More details:

Imagine you have 400GB of data on disk (SSD) representing data that is needed to service requests.  Some of this access is rather random but some of the data is very 'hot' and frequently accessed.  The hot data is not specifically of any particular kind that you can easily partition off, its simply that the access patterns are not random, but skewed strongly.

There are reasonable techniques for managing this. Getting things off-heap using one technique or another helps.


Already done, this is the performance boost above and beyond that -- reading data, unless it is very plain data, is not cheap, compared to accessing a weak reference to already decoded data.   Additionally, the process of decoding creates a little garbage in the process, so locating objects in the cache reduces the garbage created.   The extra memory taken up by the weak references and cache structures is less than the garbage saved not allocating identical copies of objects.  Less garbage leads to less frequent minor gcs and therefore longer life of weakly referenced objects, and therefore more garbage savings.  Overall, this change had a large, profound, compound effect in making our application use less CPU on the application side, and less time in GC by reducing the work required there.

In this application, a weak reference cache for such data is a massive performance win for the throughput collector and (i)CMS.  It incurs no extra GC overhead and young generation collections are easy to keep below target latency goals -- the throughput : GC-time tradeoff curve as a function of young generation size is smooth.

iCMS only works in tenured space. If your cache only uses weak references these objects will be caught by the parnew. The overall handling costs for weak (soft, phantom, final) reference is some what more expensive than for a normal object.

Let me state things more clearly, I believe I have not explained well:
There are two things going on.  The first is long application pause times, caused by full GCs and to a lesser extend minor GCs.  The second is overall application performance, accelerated by heavy use of caches with weakly referenced values.  These are related, but depending on which collector we are talking about they may be more or less intertwined.

The reason why this works so well is because the objects in the cache are never tenured.  The caches do not have poor GC side-effects for CMS or the throughput collector.  It is _cheaper_ to throw these out and regenerate as a consequence of minor GCs than to tenure them and collect them later.  The young-gc options available outside of G1 allow for weakly referenced objects to live long enough to provide big caching benefits.  The lifetime of weakly reference objects for these collectors is controllable by proxy of the size of the young generation (and by optimizing our code to reduce garbage creation).  

I think G1GC removes weakly referenced objects too fast, negating the benefit of the caching.  Something else is going pathologically wrong as well, because gc induced pauses were longer with G1GC than the throughput collector.

I don't know of an alternate strategy to get the same caching benefit without giving the GC a lot more work to do.

The cost of an LRU or other 'long life' cache is that the objects in the cache, even those rarely accessed, are highly likely to pass through to the tenured heap.  This adds GC overhead, and so the benefit of caching the average item must outweigh this cost.
The cost of a weak reference cache is very, very small -- a slight increase in heap use and young collection cost.  The benefit is also less, since the lifetime of a cache entry is shorter.  However, if the cost of generating a cache entry is small enough, and there is enough skew in the access patterns to provide good hit rate in a short period of time.

I guess it is a little of a goldilocks situation-- the cost of populating cache entires has to be not so high that expelling it every young GC is a sane thing to do, yet high enough that the amortized savings of avoiding populating it are a win.


The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

(i)CMS has a more tunable young collector -- by tuning the size of the eden spaces and how many bounces before tenure, less is tenured.  (i)CMS lives much longer without a full GC than the throughput collector, but a full GC is over 10 seconds long, which is entirely unacceptable.

I think there is more tuning you can do to either push out or eliminate the CMF. I've worked on (low latency) apps that ran for 2 weeks without a CMF.

Probably, I have a go at it about once a year.   But if if it is going to be removed, I am less inclined to try.   At this point, I know we'll have to re-architect some bits of the system that are causing fragmentation and churn in the tenured space (which have nothing to do with the weak reference caches).     If the full-gc in CMS was closer to the performance of the throughput collectors full GC, it would be worth trying again.   Just one 10 second+ full GC is game over.

G1GC just doesn't seem to like this workload at all.

Not with default configurations it won't.

What would you suggest?  I tried quite a few a while back with no luck.  That was about 18 months ago, just after it stopped core-dumping on this workload (yes, I filed bugs)  :)

For this application, the ideal would be the ParNew young collector (or an upgraded throughput young collector that has functional tuning of tenuring thresholds), backed by a tenured heap that can avoid full GCs.  This tenured heap could be G1GC or a new concurrent collector.   With what is available now, a ParNew grafted in front of a G1GC for tenured space would probably be great for this application.  However I don't think G1's design works with a separate young generation in front of it.

G1 is what I'd call a hybrid regional generational collector which completely different supporting data structures.

I have little confidence that any voices outside of the ivory tower will be listened to.

You'd be surprised. Two years ago we managed to influence Oracle to revamp a number of thing w.r.t. FX.

There are two beefs against iCMS. Test cycles and complexity in the implementation (which leads back into testing cycles).

Granted, I haven't tried to use and tune G1GC for this in over a year and a half, and perhaps it is better now.

They have made significant gains with G1. The biggest has been in backing off of some questionable decision like deciding to call for a mixed collection in every 4th young gen collection. The other is the smallest size eden can become given a max heap size. Both have bitten me when i've attempted to tune G1. The mixed to young ratio is now 8 but I'd be included to turn that off completely. The resize of eden was only needed for one very special case and I'll be surprised if that case ever comes up again. That said, recent use saw CMS yielding better results than with the G1. I'd attribute part of it to I'm much better at tuning CMS than I am at turning the G1.

There certainly are a lot more resources available that describe how to tune CMS.  I've mostly touched the big ones -- getting it to start to gc in the background earlier, and a lot of tuning of the ParNew in front of it to limit what makes it to the tenured space.

With G1, I might be hitting something pathological.  If I could convince it to have a reasonably large eden space and hold on to WeakReferences longer that would help.  I have long suspected that since such a large percentage of what is in eden is objects only weakly referenced, its heuristics see it all as garbage leading to very short lifetimes for weakly referenced objects.   However, whatever is going on is more than that, it could not even hit a 100ms pause time goal.  Since the throughput collector does that for young gcs (~50ms, 1GB young gen, SandyBridge) that is what we're using for now.

I'll probably give it a spin again in a few moths time during another profile/tune pass of the application.

Scott Carey

unread,
Jun 15, 2013, 8:26:30 PM6/15/13
to mechanica...@googlegroups.com, Scott Carey


On Saturday, June 15, 2013 3:59:49 PM UTC-7, Peter Lawrey wrote:
I use the memory mapped files in such a way that the data can be used without deserializing it.  There are interfaces with getters and setters so the calling code is normal Java but in reality they are just pointing to places in the file.  You can have a look at Javolution Structs to get an idea of what I mean, but I roll my own ;)  Notionally all the data is in memory at once, all the time, so there is no need for WeakReferences AFAICS.  Looking up data this way avoids needing deserialization, creating objects, systems calls or worrying about when they might go away.


Unfortunately (or fortunately?) much of my data is encoded in a compact form (e.g. variable length integers), and other bits are strings which either take up too much space or require decoding, or arrays of complex objects.  Computing where a field is in a set of bytes is expensive.  The benefit is that it takes up half the space on disk as it would with straight-forward serialization of types with fixed sizes, etc.  SSDs are expensive, getting the most out of their space (and the OS cache) is critical.   Another issue is objects where schemas can evolve over time -- so that data and code changes do not need to be in sync -- this adds even more complexity to locating the bytes corresponding to a field.  It ends up being cheaper to read a set of fields at once than probe them one at a time in this case, so the whole thing is stuffed into an object for later re-use.

I do have some use cases that would work with the sort of access you describe, it might come in handy later.

 

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Jun 15, 2013, 10:03:54 PM6/15/13
to mechanica...@googlegroups.com, Kirk Pepperdine, Michael Barker
On Saturday, June 15, 2013 6:43:01 PM UTC-4, Scott Carey wrote:


On Saturday, June 15, 2013 2:05:13 PM UTC-7, Gil Tene wrote:
Silly question: Have you tried Zing? This is one of the prototypical workload cases C4 shines on, in both NewGen and Oldgen behavior.

Nope.  Last I heard, it had trouble with WeakReferences too (but that was... 5 years ago?).

Well, you heard wrong....

In fact, to my knowledge C4 is the only collector available right now that actually deals well with WeakReferences at volume, and at velocity, as all other collectors available for Java today will process wekarefs in a pause that grows with the number of refs. In contrast, C4 does weak (and soft and phantom) reference processing completely concurrently while supporting full ref semantics. As a result, no matter how many weak references you keep and how fast you access them or have them die off, you won't see growing stop-the-world pauses. We've got people using Zing with exactly the WeakRef cache setup you talk about, and easily holding multi-GB stuff (even up to 10s or 100s of GB) in their caches, with tens or hundreds of millions of weak ref objects, and with lots of stuff moving through in LRU or LRU-like behaviors. Both OldGen and NewGen in C4 handle this stuff concurrently without breaking a sweat.

So you might want to experiment with it. Most people that try it find that they just stop worrying about all this other stuff we're talking about, including on-heap vs. off heap issues. On heap ends up simply being faster through sheer work elimination (natural objects, no work done for i/o or deserialization/hydration, etc.). "Tuning" also becomes a breeze in the sense that it mostly disappears; especially those "depending on the code deployed that week" issues.
 
 

And another question on the data:

The application is latency sensitive, but not terribly so -- the median request is about 3ms.  Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms.   With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.

Does this really mean you see a 2 second GC breaking your 100msec bar only once in 5+ hours?


Every 20 to 40 minutes in the real world, every 2 minutes in a load test.

A full GC happens about once every 50,000 requests  -- this varies by a factor of two depending on tuning and the code deployed that week.  The percentile numbers are per request, at 80% capacity or so.  Outliers go down somewhat at real-world loads, because fewer requests are in flight when GC stops the world.

The reason I asked is that the numbers don't add up up (the 99.99% thing and the rest of the statements) unless you don't count the pause time in the thing you measure %'iles on. Unfortunately, many measurement systems (including jMeter, grinder, HP LoadRunner, etc.) make exactly that "simple" mathematical omission mistake (assuming that this small "noise" has no significant effect). I call this measurement issue "coordinated omission" - the artificial/"accidental" erasure or avoidance of measuring of bad behavior during a complete pause.

The sanity checking for this is simple: using your numbers above (2 second pauses every 20-40 minutes), you basically know that at least 0.08% to 0.16% of your overall wall clock time is spent in a 2 second complete-program pause (that's without counting anything else, like the multitude of newgen pauses that occur in those 20 minutes, or your actual processing and queuing behavior under load outside of any pauses). This means that your starting point for  ~99.9% (and not the 99.99%), even if your work outside of pauses amounted to zero, and even if you had no newgen pauses at all, simply can't be better than ~1-2 seconds, and you are failing your 100msec bar in at least 1 per 1,000. Reality is probably much worse, but it won't show in the measurements unless your load generator or stats keeping keeps measuring request latencies at a stable rate even during pauses.

Another way to see the same effect would be in your "once every 50,000 requests" number. Assuming that represents 20-40 minutes of runtime, I extrapolate 20-40+ requests per second. This means that a 2 second pause would affect 40-80 requests, which (out of ~50,000 requests) would similarly account for the same percentile estimates...

I you are interested, you can watch me rant on this and related latency measurement issues at http://www.infoq.com/author/Gil-Tene under "How NOT to measure latency". You can also try to use jHiccup (which I talk about in that talk) to establish a "can't possibly be better than this" percentile baseline for your JVM as it runs. 

Gil Tene

unread,
Jun 15, 2013, 11:01:30 PM6/15/13
to mechanica...@googlegroups.com, Scott Carey, Michael Barker
Wow. That's quite a bit of work trying to keep things from tenuring, but still keep them available enough in a cache to have an impact. And you probably end up with those spikes in load every time newgen batch-clears all your nicely cached values, as cache misses will abound then and sustainable throughput will drop until the caches are warmed up again.

There is a better, simpler way, but it starts with giving up on working around and "tuning" broken GC mechanisms, and just using one that naturally works for long lived object scenarios. Once pauses are no longer something you worry about, you stop worrying about tenuring (it improves efficiency rather than reduce it at that point), and the actual CPU cost of GC becomes trivial to deal with. With Zing you get to directly control GC effort for a given live set size and allocation rate by making sure a good amount of empty memory is there in the heap to keep the collector as efficient as you want it to be. This doesn't actually take much modeling or analysis. you just look at the % of wall clock time during which GC is actually [concurrently] active, and increase the amount of empty heap until the ratio is one that you like [i.e. one where the CPU spent on GC no longer bothers you]. We call this "GC utilization %", and most people get pretty happy when the GC utilization is at 5-10% or less. Since GC utilization % (in Zing) tends to drop in half for every doubling of empty memory in the heap, it usually doesn't take much to get to a happy state. 

I've seen all sorts of variations in how people do this sort of in-heap, ready-to-work-on object caching with with Zing. Anything from a simple fixed size (but very large) cache with random or LRU-like eviction, to multi-level cache things that tend to perform extremely well (OS kernel file caches tend to use the latter technique type for a good reason). The scalable LRU-like caches I see usually use a victim cache of some sort and not a proper LRU, since a victim cache is usually just as effective but much cheaper in runtime and contention co