Concurrent heap compaction

344 views
Skip to first unread message

Michał Warecki

unread,
Feb 8, 2013, 5:11:17 AM2/8/13
to mechanica...@googlegroups.com
Hi guys,

I see that some experts in GC area read this group. I am a rookie, so I have two questions about concurrent heap compaction.
A lot of applications suffer from long STW heap compaction in OpenJDK and Oracle Hotspot.

1. What is the problem with implementation of concurrent version of compaction in Hotspot?
There is some work done on this (The Compressor: Concurrent, Incremental, and Parallel Compaction by Haim Kermany and Erez Petrank).

2. Are there any issues with objects order during concurrent compaction? Is there any way to apply heuristics or it is better to preserve original order? I read a few articles about impact of objects reordering on Level 2 cache misses and TLB misses but the results are confusing and not consistent.

Regards,
Michał

Gil Tene

unread,
Feb 10, 2013, 12:59:36 AM2/10/13
to
1. What is the problem with implementation of concurrent version of compaction in Hotspot?
There is some work done on this (The Compressor: Concurrent, Incremental, and Parallel Compaction by Haim Kermany and Erez Petrank).

Well, Zing is a HotSpot-based JVM with a concurrent compacting collector...

To quote Michael Wolf from a recent conversation: "It's the difference between Computer Science and Engineering..."

Concurrent compacting collectors (in both copy and other object moving forms) have been around in academic works for over 30 years. The variations of Baker's single generation copying collectors included both incremental and concurrent forms. The variations on collectors that can concurrently move objects without stopping the world abound.

AFAIK, *ALL* forms of concurrent moving collectors apply some type of read barrier. Something that appears to be "hard" for most commercial runtimes to do for some reason. Also AFAIK, Azul's Pauseless and C4 collectors are the only concurrent moving collectors to ever actually ship in commercial JVMs - Vega and Zing, which are both HotSpot-based JVMs. Both Pauseless and C4 employ and strongly depend on a new [as in "previously unknown"] form of read barrier - the Loaded Value Barrier [LVB]. So it *may* be that the self-healing quality of the LVB bridges the practicality gap. Or it may be just plain coincidence, and we may see other forms of read-barrier based moving collectors appear in the future.

One of the simplest examples of the gap between academic work and engineering as it relates to concurrent moving collectors can be found in the  metrics of memory manipulation. Pauseless was the first collector algorithm to apply what we call the "quick release" technique (manipulating virtual-to-physical mappings and out-of-page forwarding matter to support efficient, single-pass full heap or space compaction without requiring double the physical memory) that was later used by The Compressor [a year later, although they seemed to have been unaware of the Pauseless publication at the time]. The same technique is employed in the C4 collector (which can be thought of as a generational variant of Pauseless, supporting dramatically higher sustainable allocation rates). The sustainable rate with which the C4 collector implementation in Zing (running on regular x86 hardware and Linux OS platforms) currently performs the required mapping manipulation operations is literally 4-6 orders of magnitude  (as in 10,000x to 1,000,000x faster) than the speed found in alternative/academic implementations [You can see some details of this gap in section 5.2 of the C4 paper]. That's the sort of gap that makes the difference between theory and practice; as in between a 20 usec phase flip in an otherwise concurrent system, and a 20 second stop-the-world pause - pretty important when you are trying to maintain concurrent mutator operation.

All this doesn't mean that relatively recent compacting collector work like The Compressor cannot be reduced to practice in a shipping, HotSpot based JVM. It's just an observation that there is a lot of work involved, and a few orders of magnitude of performance gap may need to be covered for it to be practical. The dynamic costs of triggering read barriers during collection [one of the many things that LVB's design addresses] is another practicality gap. There are at least 2-3 other things involved in the actual engineering of a moving collector into a well performing JVM, and it is my suspicion is that these sort of practical implementation gaps are what's kept them from appearing more widely in full blown commercial runtimes in the past 20 years.



Gil Tene

unread,
Feb 9, 2013, 1:33:26 AM2/9/13
to mechanica...@googlegroups.com
2. Are there any issues with objects order during concurrent compaction? Is there any way to apply heuristics or it is better to preserve original order? I read a few articles about impact of objects reordering on Level 2 cache misses and TLB misses but the results are confusing and not consistent.

IMO, much of the actual "beneficial effect" of original-order and access-order layout-during compaction stuff out there is dominated by a single Java class: String. For one, it's a fairly hot class in many applications. And since most implementations represent String as two separate heap objects (a String and a char array), which live and die together and are accessed together in a clear pattern, co-locating the two object parts of a String linearly in memory provides a clear benefit (over splitting them) - in potential cache line sharing between the objects if nothing else. Co-locating "in allocation order" happens to do the right thing here. There are similar benefits for some other well known classes (StringBuffer, ByteBuffer, etc.), but with what is probably a much lower dynamic impact in most programs.

But (again IMO) once you take care of those easy "known to want to go together" types (which are typically easily special-cased without needing to track or maintain either allocation or or access patterns), it's not clear at all that maintaining original order has any siginificant benefit for cache locality (or page locality). There are probably many better ways to decide co-location. E.g. compact hot-accessed objects into common "hotly accessed" pages to dynamically improve page locality and resulting TLB hit rates. But then again you might find yourself asking if you actually want to stripe this stuff around instead, because overly concentrated hot pages can result in seriously bottlenecking specific memory channels and cross-socket links.

The hardest thing about this area is drawing conclusions from the relatively poor benchmarks in the field (poor in terms of representing "real world" stuff).



Michał Warecki

unread,
Feb 9, 2013, 4:43:27 AM2/9/13
to mechanica...@googlegroups.com
Wow, many thanks for a very comprehensive answer. Great topic for prototyping and further practical research.

Michał
Reply all
Reply to author
Forward
0 new messages