1. What is the problem with implementation of concurrent version of compaction in Hotspot?
There is some work done on this (The Compressor: Concurrent, Incremental, and Parallel Compaction by Haim Kermany and Erez Petrank).
Well, Zing is a HotSpot-based JVM with a concurrent compacting collector...
To quote Michael Wolf from a recent conversation: "It's the difference between Computer Science and Engineering..."
Concurrent compacting collectors (in both copy and other object moving forms) have been around in academic works for over 30 years. The variations of Baker's single generation copying collectors included both incremental and concurrent forms. The variations on collectors that can concurrently move objects without stopping the world abound.
AFAIK, *ALL* forms of concurrent moving collectors apply some type of read barrier. Something that appears to be "hard" for most commercial runtimes to do for some reason. Also AFAIK, Azul's
Pauseless and
C4 collectors are the only concurrent moving collectors to ever actually ship in commercial JVMs - Vega and Zing, which are both HotSpot-based JVMs. Both
Pauseless and
C4 employ and strongly depend on a new [as in "previously unknown"] form of read barrier - the Loaded Value Barrier [LVB]. So it *may* be that the self-healing quality of the LVB bridges the practicality gap. Or it may be just plain coincidence, and we may see other forms of read-barrier based moving collectors appear in the future.
One of the simplest examples of the gap between academic work and engineering as it relates to concurrent moving collectors can be found in the metrics of memory manipulation.
Pauseless was the first collector algorithm to apply what we call the "quick release" technique (manipulating virtual-to-physical mappings and out-of-page forwarding matter to support efficient, single-pass full heap or space compaction without requiring double the physical memory) that was later used by
The Compressor [a year later, although they seemed to have been unaware of the Pauseless publication at the time]. The same technique is employed in the
C4 collector (which can be thought of as a generational variant of Pauseless, supporting dramatically higher sustainable allocation rates). The sustainable rate with which the
C4 collector implementation in Zing (running on regular x86 hardware and Linux OS platforms) currently performs the required mapping manipulation operations is literally 4-6 orders of magnitude (as in 10,000x to 1,000,000x faster) than the speed found in alternative/academic implementations [You can see some details of this gap in section 5.2 of the C4 paper]. That's the sort of gap that makes the difference between theory and practice; as in between a 20 usec phase flip in an otherwise concurrent system, and a 20 second stop-the-world pause - pretty important when you are trying to maintain concurrent mutator operation.
All this doesn't mean that relatively recent compacting collector work like The Compressor cannot be reduced to practice in a shipping, HotSpot based JVM. It's just an observation that there is a lot of work involved, and a few orders of magnitude of performance gap may need to be covered for it to be practical. The dynamic costs of triggering read barriers during collection [one of the many things that LVB's design addresses] is another practicality gap. There are at least 2-3 other things involved in the actual engineering of a moving collector into a well performing JVM, and it is my suspicion is that these sort of practical implementation gaps are what's kept them from appearing more widely in full blown commercial runtimes in the past 20 years.