How does object graph layout impact GC performance?

357 views
Skip to first unread message

Chris Vest

unread,
Dec 22, 2014, 1:35:55 PM12/22/14
to mechanica...@googlegroups.com
This is perhaps not something one would normally think about, but I guess it might be worth considering if you have single data structures in the multiple hundreds of MBs of memory or larger, and also want low GC pause times regardless of what GC might be configured.

For instance, if I have an array with millions of objects - the whole thing effectively eternal wrt. the lifetime of the application - what effect on GC pause times might I reasonably expect if I turned it into a linked list by giving each object a next-pointer?

Cheers,
Chris

Jean-Philippe BEMPEL

unread,
Dec 23, 2014, 9:42:46 AM12/23/14
to mechanica...@googlegroups.com
Are you considering minor GC pause time or Full GC pause time ?

Thanks

Chris Vest

unread,
Dec 23, 2014, 9:53:19 AM12/23/14
to mechanica...@googlegroups.com
I’m mostly considering the full GC pauses. As I hinted at, I have a case where these objects would have a lifetime that is pretty much as long as that of the application process. They would tenure quickly in the early life of the application, and then not impact the newgen anymore. Well, they can have short-lived references to objects that (hopefully) live and die in the newgen.

Cheers,
Chris

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Philippe BEMPEL

unread,
Dec 23, 2014, 9:59:53 AM12/23/14
to mechanica...@googlegroups.com
All live objects in tenured space impact minor GC because of the Card Scanning/Marking.



--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/frVwfX8g6Gw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Dec 23, 2014, 10:08:23 AM12/23/14
to mechanica...@googlegroups.com
For old gen GC, I think you simply need to keep the object count down to reduce marking time (once the eternal objects are compacted in the tenured space, they're not moved around anymore so their size doesn't play a big role in GC cost itself, but will of course make less room available in old gen, causing more frequent full GCs if promotion rate is sufficiently large); the object graph clearly plays a role here since if your objects are dense with pointers, there're more heap objects to traverse.

In the case of an array of references vs intrusively linked objects, I don't see much difference except perhaps card marking would require more cards to be dirtied in the case of the list objects (I'm assuming storing into the array would dirty the one card where the array oop lives, but I'd need to double check that).

Old gen objects having references to younger gen will involve card marking, but that would be no different whether they're stored in the array or intrusively linked.

Olivier Bourgain

unread,
Dec 23, 2014, 10:49:05 AM12/23/14
to mechanica...@googlegroups.com
> All live objects in tenured space impact minor GC because of the Card Scanning/Marking.All live objects in tenured space impact minor GC because of the Card Scanning/Marking.

I may be wrong, but I think that card are marked only if a reference in old is written and refer to an object in young space.
Scanning the cards will always take some time,  but if no cards are marked, there is no need to scan the related objects for old to young ref.

So I think that have a huuge number of objects in old with no reference to young is not costly.
It may be costly at application startup when these immortal objects are promoted to old.

Vitaly Davidovich

unread,
Dec 23, 2014, 10:57:39 AM12/23/14
to mechanica...@googlegroups.com
So I think that's correct in theory, but AFAIK, the hotspot collectors unconditionally do card marking without checking whether the reference is in old gen (for performance reasons since there's already some overhead with simply calculating the card index and marking it -- doing an additional check for which gen the reference is from would just add to that).

Jean-Philippe BEMPEL

unread,
Dec 23, 2014, 11:05:15 AM12/23/14
to mechanica...@googlegroups.com
As pointed by Vitaly, Hotspot always marks the card on write barrier (writing a reference).
But the most dominant factor here is the scanning of the cards during minor GC even if all cards are clean, it takes a non negligible time proprtional to Old gen consumed (for Serial/Parallel Old collector) or total old (for CMS collector).

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/frVwfX8g6Gw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Dec 23, 2014, 11:42:10 AM12/23/14
to mechanica...@googlegroups.com
Hi,

The costs of GC come hits your from many different directions. Certainly a linear scan of the cards is one of them and that cost is proportional to the size of tenured, but it’s not the only one. Collecting tenured is an “in-place” (as apposed to an evacuating) collection which requires one to deal with a free-list. Scanning for open space in the free list can also be a problem. Compaction is yet another cost/issue...

Other notes, the addition of the conditional card marking will cause the application to run scalar (as apposed to vector). Not adding the conditional card marking can be a source of false sharing. So this is a dammed if you do, dammed if you don’t kind of problem. 

Other than that, you’re in the same pointer chasing world that *all* GC’ed managed memory systems suffer from. So unless you accidentally get predicable strides you’re not going to see the benefits of pre-fetching. I don’t think there is much you can do about this other than pack the data yourself into buffers.

Regards,
Kirk

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
signature.asc

Richard Warburton

unread,
Dec 23, 2014, 12:36:09 PM12/23/14
to mechanica...@googlegroups.com
Hi,

Other than that, you’re in the same pointer chasing world that *all* GC’ed managed memory systems suffer from. So unless you accidentally get predicable strides you’re not going to see the benefits of pre-fetching. I don’t think there is much you can do about this other than pack the data yourself into buffers.

 * Use arrays of primitives or primitive specialised collections, rather than boxed types.
 * Don't create unnecessary levels of abstraction in your application than increase pointer indirection costs.
 * Prefer taking arguments from method parameters, rather than fields. This makes your code simpler to understand and test as well IMO.

I mean there's a load more things, but those are just a few that come to mind. I think there's also a strong programming thought exercise if you're calculating anything of any consequence of "can I convert this object-graph into some kind of alternative data structure?" Even structures that naturally look like they're going to cause a load of pointer indirection can be refactored. For example using adjacency or incidence matrices to represent graphs.

I'm not saying that everyone should run off and do those things willy-nilly but there are often available options if you get stuck.

regards,

  Richard Warburton

Heath Borders

unread,
Dec 23, 2014, 2:31:02 PM12/23/14
to mechanica...@googlegroups.com

What about moving the object into permgen by making it a field of a ClassLoader? (I assume there's only one instance, and it has a known size, so you could increase permgen space by that known size.)

-Heath Borders
Sent from my Nexus 5

Vitaly Davidovich

unread,
Dec 23, 2014, 2:47:32 PM12/23/14
to mechanica...@googlegroups.com
Other than that, you’re in the same pointer chasing world that *all* GC’ed managed memory systems suffer from. So unless you accidentally get predicable strides you’re not going to see the benefits of pre-fetching. I don’t think there is much you can do about this other than pack the data yourself into buffers.

Besides what Richard mentioned, not *all* GC'd languages suffer the same way; C# for example has had value types for 10+ years, allowing some level of abstraction without incurring indirection costs.  Hopefully that will be alleviated in java with value types.

Kirk Pepperdine

unread,
Dec 23, 2014, 3:34:05 PM12/23/14
to mechanica...@googlegroups.com
Other than that, you’re in the same pointer chasing world that *all* GC’ed managed memory systems suffer from. So unless you accidentally get predicable strides you’re not going to see the benefits of pre-fetching. I don’t think there is much you can do about this other than pack the data yourself into buffers.

ie prefer primitives (as Richard clarified). The other steps do imply a loss of abstraction in the code. For most applications, readability counts more than performance so….


Besides what Richard mentioned, not *all* GC'd languages suffer the same way; C# for example has had value types for 10+ years, allowing some level of abstraction without incurring indirection costs.  Hopefully that will be alleviated in java with value types.

Not the intent of java value types AFAIU.. And C# value types aren’t really want I’d like to see show up in Java.. performance or not. We have a struct, it’s called a class.  We just need better mechanisms for packing them (hence Object Layout from Gil et el..)

Regards,
Kirk
signature.asc

Vitaly Davidovich

unread,
Dec 23, 2014, 3:44:29 PM12/23/14
to mechanica...@googlegroups.com
Not the intent of java value types AFAIU.. And C# value types aren’t really want I’d like to see show up in Java.. performance or not. We have a struct, it’s called a class.  We just need better mechanisms for packing them (hence Object Layout from Gil et el..)

It's not *the* intent, but it's a huge motivator; look at http://cr.openjdk.java.net/~jrose/values/values-0.html, but here's a choice quote (with my emphasis):

Thesis

The goal of this effort is to explore how small immutable user-defined aggregate types without identity can be surfaced in the language and the JVM instruction set to support memory- and locality-efficient programming idioms without sacrificing encapsulation.


As for Object Layout vs .NET struct vs Value Types, we've had this discussion on this list with Gil already :).  I don't want to get into that again, but my point is that irrespective of how java's VTs will differ from .NET, it *is* a goal to be more memory efficient.

Chris Vest

unread,
Dec 23, 2014, 4:16:43 PM12/23/14
to mechanica...@googlegroups.com
I’ve not heard of this “trick” before. I wonder if works with the new metaspace in Java 8 as well. The size of the array isn’t hard coded, though. It’s computed from user supplied configurations.

I thought the permgen was also processed as part of a full GC. How else would classes be unloaded?

Cheers,
Chris

Kirk Pepperdine

unread,
Dec 23, 2014, 4:17:49 PM12/23/14
to mechanica...@googlegroups.com
I’m going to cherry pick another paragraph before heading out for some Christmas cheer…

Object identity has footprint and performance costs, which is a major reason Java, unlike other many object oriented languages, has primitives. These costs are most burdensome for small objects, which do not have many fields to amortize the extra costs.

Smalltalk doesn’t have primitives yet it has an internal representation for “small objects” that make sense. It based on the idea that language is our interface and the compilers translate that to some implementation. So, Smalltalk has what are known as immediate directs. That is, from a language perspective these things walk like a duck and quack like a duck.. but, in the runtime, they’re not ducks, they are primitives. So, for some wonky historical reason, we have an interface that is exposes implementation and we continue to go down that path…. 

Here is another problem with the paper, it’s based on an old premise that everything should be immutable. I reject this premise as a starting point for any reasonable discussion. Without getting into it too deeply, immutability is *not* a natural state from many of our data structures and even though I continue to see projects bear the costs of this dogmatic decision people keep banging on that drum. Another point, with identity comes singularity and with that referential integrality. Break that and you’ve got the intractable O/R mapping problem in your everyday domain. So, beware, there be dragons in there… ;-)

Now for some Christmas cheer!

Regards,
Kirk
signature.asc

Kirk Pepperdine

unread,
Dec 23, 2014, 4:18:44 PM12/23/14
to mechanica...@googlegroups.com
it’s called a static ;-) and those things have been tossed into Java heap so I’m not sure how it would work.

Regards,
Kirk
signature.asc

Vitaly Davidovich

unread,
Dec 23, 2014, 4:31:50 PM12/23/14
to mechanica...@googlegroups.com

They want VTs to be immutable to avoid a class of errors whereby you modify a copy of a VT (think method argument).  C# allows passing structs by reference so one can avoid this if careful, but java has no plans to allow passing args by reference.

Sent from my phone

Heath Borders

unread,
Dec 23, 2014, 6:13:56 PM12/23/14
to mechanica...@googlegroups.com
>I thought the permgen was also processed as part of a full GC. How else would classes be unloaded?

It is processed as part of full GC, but I imagine there wouldn't be as much copying or traversal over the structure in permgen compared to old gen. This is strictly a guess, and I figured more experienced people on the list could confirm or deride it.

-Heath Borders
heath....@gmail.com
Twitter: heathborders
http://heath-tech.blogspot.com

Gil Tene

unread,
Dec 23, 2014, 7:40:25 PM12/23/14
to
Given your question, the actual title of the thread should be "How does object graph layout impact GC pause time". GC performance outside of a pause mostly deals with the rate at which the collector is able to do concurrent [not during pause] work, and the amount of CPU the GC will spend per allocation unit. But those are usually of much lesser concern than pause durations, because most GC setups have tons of performance headroom. As in: they are only active for tiny percent of the overall wall clock time, and are therefore far from caring about efficiency outside of a pause.

The answer to your actual question ends up depending (dramatically) on the GC mechanism used.

E.g. focusing on Oldgen pause times (and ignoring newgen for now):

[Assumption: you are comparing an array of refs to a linked list going *through* the objects. Not a linked list "on top of" the objects (i.e. not. LinkedList<Object>) ]

Marking/Tracing:

1. Concurrent and Mostly-concurrent Olden markers/tracers will incur the cost of traversing the references during their concurrent phases, but that will generally not increase the pause times (at the edges of the concurrent phases). This applies to CMS and G1 (and Shenandoah when it comes) as long as their markers remain mostly-concurrent (The collector did not fail over to a full GC). It also applies to C4 obviously. It does not apply to ParallelGC, SerialGC or to a Full-GC fallback of any mostly-concurrent collector.

2. Stop-the-world marking (E.g. ParallelGC) will incur the cost of traversing the various reference *during* a pause. This first order thing the cost is dominated by is the number of reference fields that need to be visited, regardless of "shape" (flat array of refs vs. linked list). A second order effect will be the number of those references that are non-null, and the number of those references that are found to refer to already-marked-live objects (e.g. working on a reference to a popular object that has already been marked live is cheaper than working on a reference to a not-yet-marked-live object).

3. Parallelism during STW marking: Another stop-the-world marking consideration is parallelize-ability of the GC work during the pause. The refs in array of refs *could* be chucked up and given to different GC threads to mark down in parallel [I don't know if ParallelGC or FullGC fallbacks actually do this when ParallelOldGen is on or not. C4 does, but it is not STW so who cares...]. But a linked list will not be parralelizeable from a tracing perspective. So *IF* your olden marker/tracer is STW *and* if it can use parallel threads for marking *and* it knows how to chunk up arrays of refs to distribute their marking work across parallel threads *then* a linked list layout may present a much longer STW pause than a flat array of refs would.

Sweeping:

1. For oldgen collectors that sweep and manage free lists (e.g. CMS as long as it stays in it's mostly-concurrent mode), the sweep cost will not be dramatically different between a layout that has a huge array of refs and one that has a linked list going *through* the objects [the number of objects to skip over is +/-1]. But the sweeper in such collectors is usually concurrent, so no STW-affecting behavior either way.

Compaction:

1. For STW compact-in-place collectors (e.g. ParallelGC and CMS when it eventually has to compact, G1 when it falls back to a Full GC), the compaction work is not going to be significantly different between the array-of-refs and the linked-list-thrugh-the-objects case. The number of objects to forward refs for and move remains the same (+/- 1). The amount of memory to move remains the same (+/- array header). The number of refs that need fixing up remains the same (+/- 1). It's a big bad STW pause, but it doesn't change much with the shape variants you describe.

2. For incremental-STW compactors (E.g. G1) the variation between an array of refs and linked-list-through-objects can be huge if the data structure in 100s of MB in size (as you indicate) and it is not lucky enough to have been allocated and compacted in order and to have retained these grouping even through repeated evacuations (e.g. it was populated over time, interspersed with other work, and/or had elements replaced or shuffled early on). The reason for this big impact is that the work done to evacuate one region with pointed-to objects in it depends primarily on the number of other regions that contain references to that region. In the huge-array-of-refs case, the references pointing to the various objects would be concentrated in the array (only a few regions), while in the linked-list-through-the-objects case the references to the objects will be spread event all over the 100s of MB (and 100s of regions) that the structure occupies. Each region that contains many (100s) of individual member objects of the overall data structure will therefore have many (100s) regions referring to it, leading to much higher incremental compaction cost compared to the array-of-refs case. Incremental compaction cost can end up growing by N^2 where N is the number of objects in the overall data structure.

3. For Concurrent Compactors (C4, future Shenandoah) the compaction/evacuation cost will not be significantly impacted by the choice of layout (same amount of objects to move, same number of refs to eventually fix up). And since the compaction is not STW, we don't really care.

Moving to Newgen collection behavior:

For the other side (newgen pauses), several people have already commented on the card-table scanning cost incurred during a newgen pause in a pausing newgen collector. But here are some thoughts on the variants (assuming the data structure in question is "old and stable" as described:

1. Pausing, copying newgen (PatallelGC, CMS, G1, etc.): 

1.1 The cleanliness (or lack thereof) of the card table will make a difference to newgen pause times. And different collectors end up cleaning up the card table in different ways. E.g. a STW compactor can (and generally will) produce a clean, no-false-positives card table on every compaction. Other collectors may perform some amount of concurrent card table cleaning [I'm not sure which do and which don't]. Without cleaning the card table periodically, the length of newgen pauses with gradually increase over time.

1.2 The means and logic by which a dirty card are processed can end up with cost that dramatically depends on the choice of layout. E.g. when a dirty card is detected in a non-precise-card-marking-table (in HotSpot a card correlates to a 512-bytes of the heap), the first step in finding the references to scan for is to look for the start of the first object that falls within the card (and then scan forward from there). In the linked-list-through-the-objects case, this is never a big deal. But in the array-of-refs case the start of the huge array will be "far far away" from a dirty card when it is encountered. Depending on whether the GC has special-case-optimization for card scanning a large array-of-refs case, this can end up with a much higher newgen pause effect in the array-of-refs case vs the linked-list case. 

2. Concurrent Newgen collectors (e.g. C4): card scanning is not done in a STW pause, so who cares...

-- Gil. 

Chris Vest

unread,
Dec 24, 2014, 9:14:37 AM12/24/14
to mechanica...@googlegroups.com
Thanks, Gil. This is the sort of guidance for the intuition I was after. I hadn’t made the connection that layout mostly impacts the marking phase, and that this is the phase that all the concurrent collectors do concurrently with the application, and therefor they care a lot less about layout. We use CMS by default, and have already put in effort elsewhere to reduce the frequency of big STW GCs.

It was exactly the paralleliseability of the array I was worried about loosing, but it’s good to hear that this isn’t really a concern for concurrent collectors.

We have considered G1 a couple of times in the past, but performance has generally been lower, so we’ve stuck to CMS. These objects are all allocated in one go, in the early life of the application, so I think chances are good that G1 will be fine with a big linked list as well. If it’s really lucky, I guess the regions could end up arranging themselves in another linked list. Whatever the case, we don’t shuffle the links around after they’ve been allocated, so it sounds like an ideal case. I could imagine G1 would be pretty unhappy with a giant skip-list, where things are moved around all the time.

Our application comes in both clustered and single instance variants, and this makes me think that it might be a good idea to consider G1 as a default for the single-instance mode, where there are no replicas to step in when the STW CMS compaction hits.

Our newgen pauses, whenever I’ve probed them in our testing, have tended to hover between 1 and 2 ms. We haven’t been modifying the references in the array, only fields of the referenced objects, so I don’t think this will make any difference to us.

Cheers,
Chris

On 24 Dec 2014, at 01:40, Gil Tene <g...@azulsystems.com> wrote:

Given your question, the actual title of the thread should be "How does object graph layout impact GC pause time”.

Yes :)

GC performance outside of a pause mostly deals with the rate at which the collector is able to do concurrent [not during pause] work, and the amount of CPU the GC will spend per allocation unit. But those are usually of much lesser concern than pause durations, because most GC setups have tons of performance headroom. As in: they are only active for tiny percent of the overall wall clock time, and are therefore far from caring about efficiency outside of a pause.

The answer to your actual question ends up depending (dramatically) on the GC mechanism used.

E.g. focusing on Oldgen pause times (and ignoring newgen for now):

[Assumption: you are comparing an array of refs to a linked list going *through* the objects. Not a linked list "on top of" the objects (i.e. not. LinkedList<Object>) ]

Exactly.


Marking/Tracing:

1. Concurrent and Mostly-concurrent Olden markers/tracers will incur the cost of traversing the references during their concurrent phases, but that will generally not increase the pause times (at the edges of the concurrent phases). This applies to CMS and G1 (and Shenandoah when it comes) as long as their markers remain mostly-concurrent (The collector did not fail over to a full GC). It also applies to C4 obviously. It does not apply to ParallelGC, SerialGC or to a Full-GC fallback of any mostly-concurrent collector.

2. Stop-the-world marking (E.g. ParallelGC) will incur the cost of traversing the various reference *during* a pause. This first order thing the cost is dominated by is the number of reference fields that need to be visited, regardless of "shape" (flat array of refs vs. linked list). A second order effect will be the number of those references that are non-null, and the number of those references that are found to refer to already-marked-live objects (e.g. working on a reference to a popular object that has already been marked live is cheaper than working on a reference to a not-yet-marked-live object).

3. Parallelism during STW marking: Another stop-the-world marking consideration is parallelize-ability of the GC work during the pause. The refs in array of refs *could* be chucked up and given to different GC threads to mark down in parallel [I don't know if ParallelGC or FullGC fallbacks actually do this when ParallelOldGen is on or not. C4 does, but it is not STW so who cares...]. But a linked list will not be parralelizeable from a tracing perspective. So *IF* your olden marker/tracer is STW *and* if it can use parallel threads for marking *and* it knows how to chunk up arrays of refs to distribute their marking work across parallel threads *then* a linked list layout may present a much longer STW pause than a flat array of refs would.

Sweeping:

1. For oldgen collectors that sweep and manage free lists (e.g. CMS as long as it stays in it's mostly-concurrent mode), the sweep cost will not be dramatically different between a layout that has a huge array of refs and one that has a linked list going *through* the objects [the number of objects to skip over is +/-1]. But the sweeper in such collectors is usually concurrent, so no STW-affecting behavior either way.

Compaction:

1. For STW compact-in-place collectors (e.g. ParallelGC and CMS when it eventually has to compact, G1 when it falls back to a Full GC), the compaction work is not going to be significantly different between the array-of-refs and the linked-list-thrugh-the-objects case. The number of objects to forward refs for and move remains the same (+/- 1). The amount of memory to move remains the same (+/- array header). The number of refs that need fixing up remains the same (+/- 1). It's a big bad STW pause, but it doesn't change much with the shape variants you describe.

2. For incremental-STW compactors (E.g. G1) the variation between an array of refs and linked-list-through-objects can be huge if the data structure in 100s of MB in size (as you indicate) and it is not lucky enough to have been allocated and compared in order and to have retained these grouping even through repeated evacuations (e.g. it was populated over time, interpreted with other work, and/or had elements replaced or shuffled early on). The reason for this big impact is that the work done to evacuate one region with pointed-to objects in it depends primarily on the number of other regions that contain references to that region. In the huge-array-of-refs case, the references pointing to the various objects would be concentrated in the array (only a few regions), while in the linked-list-through-the-objects case the references to the objects will be spread event all over the 100s of MB (and 100s of regions) that the structure occupies. Each region that contains many (100s) of individual member objects of the overall data structure will therefore have many (100s) regions referring to it, leading to much higher incremental compaction cost compared to the array-of-refs case. Incremental compaction cost can end up growing by N^2 where N is the number of objects in the overall data structure.

3. For Concurrent Compactors (C4, future Shenandoah) the compaction/evacuation cost will not be significantly impacted by the choice of layout (same amount of objects to move, same number of refs to eventually fix up). And since the compaction is not STW, we don't really care.

Moving to Newgen collection behavior:

For the other side (newgen pauses), several people have already commented on the card-table scanning cost incurred during a newgen pause in a pausing newgen collector. But here are some thoughts on the variants (assuming the data structure in question is "old and stable" as described:

1. Pausing, copying newgen (PatallelGC, CMS, G1, etc.): 

1.1 The cleanliness (or lack thereof) of the card table will make a difference to newgen pause times. And different collectors end up cleaning up the card table in different ways. E.g. a STW compactor can (and generally will) produce a clean, no-false-positives card table on every compaction. Other collectors may perform some amount of concurrent card table cleaning [I'm not sure which do and which don't]. Without cleaning the card table periodically, the length of newgen pauses with gradually increase over time.

1.2 The means and logic by which a dirty card are processed can end up with cost that dramatically depends on the choice of layout. E.g. when a dirty card is detected in a non-precise-card-marking-table (in HotSpot a card correlates to a 512-bytes of the heap), the first step in finding the references to scan for is to look for the start of the first object that falls within the card (and then scan forward from there). In the linked-list-through-the-objects case, this is never a big deal. But in the array-of-refs case the start of the huge array will be "far far away" from a dirty card when it is encountered. Depending on whether the GC has special-case-optimization for card scanning a large array-of-refs case, this can end up with a much higher newgen pause effect in the array-of-refs case vs the linked-list case. 

2. Concurrent Newgen collectors (e.g. C4): card scanning is not done in a STW pause, so who cares...

-- Gil. 

On Monday, December 22, 2014 10:35:55 AM UTC-8, Chris Vest wrote:

Gil Tene

unread,
Dec 24, 2014, 2:54:42 PM12/24/14
to


On Wednesday, December 24, 2014 6:14:37 AM UTC-8, Chris Vest wrote:
Thanks, Gil. This is the sort of guidance for the intuition I was after.

Glad to help.

...
Our newgen pauses, whenever I’ve probed them in our testing, have tended to hover between 1 and 2 ms. We haven’t been modifying the references in the array, only fields of the referenced objects, so I don’t think this will make any difference to us.

Are you sure it stays at only 1-2msec? Did you look at it for more than just a few minutes?

People often base their estimates of newgen pause time on only a few minutes of observation. The discussion that JP linked to above (https://groups.google.com/forum/#!msg/mechanical-sympathy/-nZt3N8X-GI/EzXvy6bwX1AJ) explains and demonstrates the behavior of newgen pauses over time. Then part of newgen pause time that relates to card scanning grows linearly with the amount of used memory (dead or alive, doesn't matter) in oldgen. It starts off low (since oldgen is known to be mostly empty), and will max out right before an oldgen collection (when oldgen is almost full, and there is no knowledge of what is live or dead within it). Since many applications tend/try to reduce oldgen pressure and oldgen collection frequency, it can take a long time to actually see how big your newgen collection times really get, and what typical times will be.

I built a java agent variant of the example program in that thread (which can also be run in a standalone mode). As a java agent, you can add it to your program and it will force rapid newgens and a relatively rapid growth in oldgen (all without keeping more than ~64MB alive at any given time), so you can see what newgen pauses will look like in the real world if you did wait for oldgen to slowly fill up.


The best possible newgen pause times in Oracle HotSpot and OpenJDK (CMS, G1, ParallelGC) are demonstrated by this MinorGC test (it shows the GC pause time for a newgen with no live objects, and nothing referring into it, with an oldgen that is pretty much empty as well, and a perfectly clean card table). These time obviously vary by machine, heap size, and GC mode selected. My measurements show about 2msec per GB of heap size on the fast side (nice fat servers with lots of cores and memory bandwidth), and up to 10msec per GB of oldgen (e.g. my Haswell based laptops). 

MinorGC also gives you a way to control the percentage of oldgen heap space that contains references (and will therefore have cards that are dirtied at some point). This
defaults to 0, but when the [-r refsFraction] option is used, you can see the strong effect these dirty cards have on some collectors. And you can also see the effects of concurrent pre-cleaning. E.g. ParallelGC appears to do no card pre-cleaning ahead of a pause, and it's newgen pause times are therefore strongly susceptible to the percentage of refs in oldgen (newgen pauses growing by ~10x with just 5% of refs in oldgen). In contrast, CMS seems to be relatively insensitive to having refs (that don't refer to newgen) in the oldgen heap. This indicates that it likely
pre-cleans the card table ahead of the newgen pause.
 

Cheers,
Chris

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,
Dec 26, 2014, 7:48:32 PM12/26/14
to mechanica...@googlegroups.com
I have done some experiments in this area:



Especially non-compating collectors suffer a lot from pointer chasing. Compaction usually does an implicit locality optimization. Cardmarking behaves somewhat unexpected. It seems the VM increases the size of a card depending on overall heap size. At the time of the experiment I could only test on a 16GB machine, larger sizes would be interesting.
Reply all
Reply to author
Forward
0 new messages