Yes and I think it is a mistake on the part of Oracle. One of the
MANY reasons we (LMAX) purchased Zing.
Hi,Yes and I think it is a mistake on the part of Oracle. One of the
MANY reasons we (LMAX) purchased Zing.This was discussed quite a bit on hotspot-gc-dev at the time. It sounded like the decision had already been made, and despite a number of people with latency sensitive production systems or GC tuning expertise commenting on why this was a bad decision it went ahead. There was a desire, stated at Javaone last year, for them to kill CMS itself and replace it with G1. Despite there being a lot of ongoing debate as to whether G1 will ever fulfil its stated goals.
Low-latency can mean more than financial trading applications. I'm seeing customers who cannot tolerate pauses that users can see in reactive interfaces (>150ms) and clustered environments where pauses make it appear that nodes are dead and have to be removed from a cluster.
Given the mechanics of how G1 operates I do not believe it will ever be suitable for low-latency applications. For example, maintaining the Remembered Sets makes minor GC more expensive and the stop-the-world regional compactions are always going to cost 10s-100s of milliseconds in the best case.
Given the mechanics of how G1 operates I do not believe it will ever be suitable for low-latency applications. For example, maintaining the Remembered Sets makes minor GC more expensive and the stop-the-world regional compactions are always going to cost 10s-100s of milliseconds in the best case.
You can tune the size of regions in G1. If you make them smaller then your remembered sets will be smaller as well. Of course, smaller regions have other knock-on effects on terms of G1 performance.
You can tune the size of regions in G1. If you make them smaller then your remembered sets will be smaller as well. Of course, smaller regions have other knock-on effects on terms of G1 performance.
Yes, I've tested parallel initial mark patch and it seems to work. Very nice and very trivial improvement.
I hope 10 ms is the worst case taking into account application which is not GC friendly.
PS.
As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.
2013/6/14 Michael Barker <mik...@gmail.com>
I will be interested to see what they come up with. However, I would
like to see them be a bit more ambitious. For a low latency system a
10ms is quite a long pause, we're around 5ms consistently with iCMS
and hopefully MUCH less once we get Azul into production.
Mike.
On 14 June 2013 19:45, Michał Warecki <michal....@gmail.com> wrote:
> Hi!
>
> Interesting, guys from Redhat are working on a new "Pauseless" (there still
> will be pause times but short) GC for OpenJDK:
> http://rkennke.wordpress.com/2013/06/10/shenandoah-a-pauseless-gc-for-openjdk/
>
> Looking forward for the concept and source :-)
>
> Cheers,
> Michał
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsub...@googlegroups.com.
PS.
As far as I know iCMS will be deprecated in OpenJDK 8 and removed in OpenJDK 9.
Well that's horrible. G1GC has less than half the throughput on my app and WORSE median, and 99th percentile latency. It does avoid the rare full, very slow GC that CMS can end up with.
Weak references.
Weak references.
The application is latency sensitive, but not terribly so -- the median request is about 3ms. Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms. With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
More details:
Imagine you have 400GB of data on disk (SSD) representing data that is needed to service requests. Some of this access is rather random but some of the data is very 'hot' and frequently accessed. The hot data is not specifically of any particular kind that you can easily partition off, its simply that the access patterns are not random, but skewed strongly.
Reading this data into an on heap LRU is a self-defeating proposition, the data set is significanly larger than RAM and although this improves average application latency and throughput somewhat due to avoiding work, garbage collection becomes very heavy-weight since an LRU by definition puts a lower bound on object lifetimes. With the throughput collector or CMS, this causes lots of thrashing in the young generation and survivor spaces. A larger heap makes GC times longer, not shorter, and eats into the OS's cache for the data on disk.
A soft reference cache is better than an LRU, but suffers from many of the same problems on the GC side.
In this application, a weak reference cache for such data is a massive performance win for the throughput collector and (i)CMS. It incurs no extra GC overhead and young generation collections are easy to keep below target latency goals -- the throughput : GC-time tradeoff curve as a function of young generation size is smooth.
The application is latency sensitive, but not terribly so -- the median request is about 3ms. Tolerance at the 99th percentile is about 40ms, and ideally nothing should ever take more than 100ms. With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.
(i)CMS has a more tunable young collector -- by tuning the size of the eden spaces and how many bounces before tenure, less is tenured. (i)CMS lives much longer without a full GC than the throughput collector, but a full GC is over 10 seconds long, which is entirely unacceptable.
G1GC just doesn't seem to like this workload at all.
For this application, the ideal would be the ParNew young collector (or an upgraded throughput young collector that has functional tuning of tenuring thresholds), backed by a tenured heap that can avoid full GCs. This tenured heap could be G1GC or a new concurrent collector. With what is available now, a ParNew grafted in front of a G1GC for tenured space would probably be great for this application. However I don't think G1's design works with a separate young generation in front of it.
I have little confidence that any voices outside of the ivory tower will be listened to.
Granted, I haven't tried to use and tune G1GC for this in over a year and a half, and perhaps it is better now.
Silly question: Have you tried Zing? This is one of the prototypical workload cases C4 shines on, in both NewGen and Oldgen behavior.
And another question on the data:The application is latency sensitive, but not terribly so -- the median request is about 3ms. Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms. With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.Does this really mean you see a 2 second GC breaking your 100msec bar only once in 5+ hours?
The way I would deal with this is to keep the bulk of your data in memory mapped files. This allows the OS to drop data in LRU style asynchronously and load data on demand transparently. Most significantly it means most of your data is off heap. It is also very fast to remap on a restart. (A few 100 GBs per second) I have worked on systems with 1 - 2 GB heap (mostly Eden) and 200 - 800 GB off heap in memory mapped files and by recycling objects there is little or no GC collections. In this situation, the only tuning I did was to reduce the heap size to ensure I had the fastest Compressed Oops translation.Peter.
I'm not sure that I completely understand what you're saying here. Comments below.More details:
Imagine you have 400GB of data on disk (SSD) representing data that is needed to service requests. Some of this access is rather random but some of the data is very 'hot' and frequently accessed. The hot data is not specifically of any particular kind that you can easily partition off, its simply that the access patterns are not random, but skewed strongly.There are reasonable techniques for managing this. Getting things off-heap using one technique or another helps.
In this application, a weak reference cache for such data is a massive performance win for the throughput collector and (i)CMS. It incurs no extra GC overhead and young generation collections are easy to keep below target latency goals -- the throughput : GC-time tradeoff curve as a function of young generation size is smooth.iCMS only works in tenured space. If your cache only uses weak references these objects will be caught by the parnew. The overall handling costs for weak (soft, phantom, final) reference is some what more expensive than for a normal object.
The application is latency sensitive, but not terribly so -- the median request is about 3ms. Tolerance at the 99th percentile is about 40ms, and ideally nothing should ever take more than 100ms. With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.
(i)CMS has a more tunable young collector -- by tuning the size of the eden spaces and how many bounces before tenure, less is tenured. (i)CMS lives much longer without a full GC than the throughput collector, but a full GC is over 10 seconds long, which is entirely unacceptable.I think there is more tuning you can do to either push out or eliminate the CMF. I've worked on (low latency) apps that ran for 2 weeks without a CMF.
G1GC just doesn't seem to like this workload at all.Not with default configurations it won't.
For this application, the ideal would be the ParNew young collector (or an upgraded throughput young collector that has functional tuning of tenuring thresholds), backed by a tenured heap that can avoid full GCs. This tenured heap could be G1GC or a new concurrent collector. With what is available now, a ParNew grafted in front of a G1GC for tenured space would probably be great for this application. However I don't think G1's design works with a separate young generation in front of it.G1 is what I'd call a hybrid regional generational collector which completely different supporting data structures.
I have little confidence that any voices outside of the ivory tower will be listened to.You'd be surprised. Two years ago we managed to influence Oracle to revamp a number of thing w.r.t. FX.There are two beefs against iCMS. Test cycles and complexity in the implementation (which leads back into testing cycles).
Granted, I haven't tried to use and tune G1GC for this in over a year and a half, and perhaps it is better now.They have made significant gains with G1. The biggest has been in backing off of some questionable decision like deciding to call for a mixed collection in every 4th young gen collection. The other is the smallest size eden can become given a max heap size. Both have bitten me when i've attempted to tune G1. The mixed to young ratio is now 8 but I'd be included to turn that off completely. The resize of eden was only needed for one very special case and I'll be surprised if that case ever comes up again. That said, recent use saw CMS yielding better results than with the G1. I'd attribute part of it to I'm much better at tuning CMS than I am at turning the G1.
I use the memory mapped files in such a way that the data can be used without deserializing it. There are interfaces with getters and setters so the calling code is normal Java but in reality they are just pointing to places in the file. You can have a look at Javolution Structs to get an idea of what I mean, but I roll my own ;) Notionally all the data is in memory at once, all the time, so there is no need for WeakReferences AFAICS. Looking up data this way avoids needing deserialization, creating objects, systems calls or worrying about when they might go away.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On Saturday, June 15, 2013 2:05:13 PM UTC-7, Gil Tene wrote:Silly question: Have you tried Zing? This is one of the prototypical workload cases C4 shines on, in both NewGen and Oldgen behavior.
Nope. Last I heard, it had trouble with WeakReferences too (but that was... 5 years ago?).
And another question on the data:The application is latency sensitive, but not terribly so -- the median request is about 3ms. Tolerance at the 99th percentile is about 40ms, and ideally >nothing should ever take more than 100ms. With the throughput collector the first two goals are met, but the last one is not -- a full GC of about 2 seconds leads to high latency somewhere past the 99.99th percentile.Does this really mean you see a 2 second GC breaking your 100msec bar only once in 5+ hours?
Every 20 to 40 minutes in the real world, every 2 minutes in a load test.
A full GC happens about once every 50,000 requests -- this varies by a factor of two depending on tuning and the code deployed that week. The percentile numbers are per request, at 80% capacity or so. Outliers go down somewhat at real-world loads, because fewer requests are in flight when GC stops the world.
I use the memory mapped files in such a way that the data can be used without deserializing it.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Using weak refs to make sure things that are not otherwise held strongly live actually die is very reliable across all available JVMs and collectors I know of, and will produce the exact same behavior.Kirk is correct in that there are subtle concurrent variances in weakref handling, but these exit across ALL collectors. They have to do with how WeakRefs are handled when the objects they point to are made intermittently "strongly alive" during an actual GC cycle (this is not a real problem for caches). Those subtle differences only show up when concurrency races in the program's own treatment of the objects arise. E.g. when the program keeps flipping the liveness strength of a weakly referenced object back and forth while GC is ongoing, by doing get() calls on the weak refs and by storing the result in object fields which are later overridden. In such situation, where the liveness itself moves rapidly enough to change during a GC cycle, weak refs may-or-may-not be cleared during that cycle depending on how your dice roll in the concurrency race.
Behavior for this race differs between ParralelGC, CMS, G1, J9's concurrent and balanced collectors, and Zing. The complete stop-the-world-for-everything collectors (ParallelGC) kept things "simple", as nothing is changing during a GC cycle, ever, so no change in liveness strength during collection can occur. All the other collectors listed use at least some form of a partially concurrent marking pass, exposing them to the program's concurrent liveness changes. If the liveness strength of your weakref'd object changes back and forth during the GC cycle itself (e.g. when your program actively strengthens a weak ref by doing a get() and storing the result somewhere in a strong ref temporarily, removing it later), the weakref has some level of likelihood of being considered by the collector to be pointing to a strongly live object, and being kept alive in that specific GC cycle. The likelihood of this happening varies by programming pattern, by collector algorithm, and by heap size and load. It's basically a concurrent race and a roll on the dice on timing. G1, for example, has a very higher likelihood of exposing this than CMS.
Luckily, you don't see this sort of concurrency race issue (of rapidly kept alive objects still wanting to die) in any of the Caching use patterns, since caches [almost by definition] want those survive. You only see this sort of race in systems that use weak refs for registering interest in event delivery but want to have the interested objects still be able to die without having to de-register their interest, and even when they are receiving rapid event notifications through temporary string references. For such applications, rapid enough event delivery can effectively make the dice never fall the right way to clear the refs to the receiving object.
On Sunday, June 16, 2013 5:38:54 AM UTC-4, Kirk Pepperdine wrote:
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.