Apologies, Matthew, if you know this all already. This is really more
for Paulo's benefit, but your wish list provided a convenient framework.
On 9/13/2019 9:28 AM, Matthew Flatt wrote:
> At Thu, 12 Sep 2019 19:20:08 -0400, Philip McGrath wrote:
> > "Flonum unboxing: accept mismatch, for now" [...] I could imagine
> > that this might not be what you're looking for, since it seems like
> > it would be much more about creating garbage than collecting it per
> > se
>
> Yes, flonum unboxing entirely a compiler project.
>
>
> Some things I'd like to have from the GC, but these are large projects:
>
> * GC implemented/generated in Scheme.
>
> A common problem with adding new things to the GC (including
> scalable object locking, ordered finalization, and vfasl saving) is
> repeating/copying the code that traverses objects. Each copy is a
> little different, and it's difficult to abstract over the variation
> and get good performance, especially in C. Implementing the GC
> itself in Scheme may be too tricky to be worthwhile, but generating
> a C implementation in Scheme might be the way to go. The pointer and
> object layouts in "cmacro.ss" is already a lot of the needed
> information.
Working in C (or similar languages) it is best not to abstract very
much wrt pointer identification. Each different object type should have
a custom pointer finding function generated by the compiler directly
from the object's layout. You then somehow associate the function (or
functions, see below) with the object - using BiBOP page metadata, or an
additional pointer in the object header, etc. Stack frames (assuming the
language has a notion of a contiguous stack) can be handled in very much
the the same manner as heap objects.
For a non-moving collector, the pointer finding function can simply
queue all the pointers for later examination and then mark the object as
scanned. In the case of a union, you would queue the "potential"
pointer and figure out later whether it really points to a heap object
(mistakes due to conservative pointer finding are fairly rare if true
heap allocations can be positively identified).
For a moving collector, it's a bit more complex - the pointer finding
function needs to be able to enumerate pointers one-by-one, and you also
need a corresponding pointer setting function.
But these functions tend to be rather simple to generate - the majority
of objects are not very complex, and in any case it is all done
mechanically from the object's layout.
For generational (or parallel, below) collection, you'll need to
implement moving anyway.
> * Support for non-moving objects that are collected when unreferenced
> but not moved while referenced.
Best handled with one or more separate logical heaps. Relatively easy
in a BiBOP system if you base page residence on object size rather than
on type. Quite a bit more complicated if you demand heaps be contiguous.
> Racket CS implements non-moving objects by locking objects and using
> a finalizer to unlock. Although I made changes to Chez Scheme so
> that object locking is more scalable, it would be nicer and likely
> perform better to have an allocation arena that is GCed but not
> compacted/moved.
Ouch ! "Pinning" objects in an moving system is a PITA ... it doesn't
matter whether you are compacting or (semi-heap) copying. It would be
better if "pinning" an object relocated it to the non-moving heap.
> * Parallel GC
Generally involves a complete rethink of the heap structure. Parallel
collection is one thing that actually IS made easier by contiguous
(semi) heaps. In a BiBOP system a "heap" is a logical construct of
(maybe) discontiguous pages, and the extra chasing around makes
collecting in parallel more complicated.
Note that the mutator(s) typically are stopped in a parallel collection
(but see below).
> * Incremental GC
Depends on what you mean by "incremental".
One possible "incremental" system would use the same N (semi) heaps as
the parallel collector, but rotates the pair of heaps used for current /
FROM and TO spaces. The mutator is stopped just long enough to collect
the filled FROM space, emptying it completely so it can become a
TO-space target for the next collection. [I believe the GC handbook
calls this "train" allocation.]
Another would leverage the VMM system, collecting page by page as
FROM-space pages are touched. This can be done moving or non-moving,
but it is complicated by the need to positively identify object
boundaries ... because the identity of the object being written to has
been lost when the VMM-assisted collector gets control ... and by
objects that span multiple VMM pages.
Collecting object by object while the mutator continues requires much
more complexity and generally requires more space - regardless whether
GC actually is interleaved with or done partly in parallel with the
mutator. It requires for each object, one of:
A) an extra metadata pointer in the header [kiss of death for small
objects: boxes, pairs, flonums, etc.], or
B) a minimum allocation size at least equal to that of a cons pair.
Either case requires access barriers on both read and write that can
recognize that an object has been forwarded and then perform the
operation on the TO-space copy instead of the original in FROM-space.
The barrier may be required to copy other objects because FROM-space
originals generally can't be modified after having been copied [at least
not without introducing a lot of extra complexity].
If you put a reasonable limit on the size of thread (Racket place)
private heaps, and limit the number of generations in the private heaps
to just 1 or 2, then you can get good performance with a simpler
Stop-the-Thread collection - done asynchronously whenever the thread
pleases - and leave more complex techniques like incremental collection
to the shared heap, which is likely to be large (maybe last generation
sized) and could be collected in parallel with the mutator(s), often at
a slower rate.
One thing to remember about GC is that useful real-world systems tend to
be hybrids that make use multiple allocation/collection techniques for
different situations. Trying to use the same technique everywhere might
make for a good research paper, but it is unlikely to make for a highly
performant runtime system. No one technique is optimal everywhere (if
it even is optimal anywhere), so some amount of tailoring to the
specifics generally will be better.
George