So, while I'm heaping much grumpiness on threads (though I suppose, as I've been out of touch for a bit maybe you've all solved the problem. That'd be nice) I've also been thinking about DOD, as there's a fair amount of overlap in the things that cause problems for threads and the ones that cause problems for generational garbage collectors.
For a generational DOD to work we have to have a way to note which generation a structure's in, as well as trap writes so we know when you're referencing an old generation from a new, and vice versa. (Threads have to trap mutations of this stuff, as this is in large part the bits that need synchronizing on mutation) A short list of mutating activities that are of issue are:
*) Setting the PMC data pointer *) Resetting the PMC data pointer *) Setting the PMC cache pointer *) Resetting the PMC cache pointer *) Putting a PMC or buffer pointer into a buffer *) Altering a buffer's metadata (size or location)
We don't care about most buffer data, just those buffers whose data are pointers to PMCs or other buffers. (Which argues that strings and buffers that contain raw data should be allocated from different pools than buffers that contain pointers to DOD-able data)
I think, then, that we'll want to intercept all these activities, which means that we need to have API facilities for them. Before we do that, though, I want to make sure there's nothing I missed from that list. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
On Thursday, January 29, 2004, at 08:41 , Dan Sugalski wrote: > So, while I'm heaping much grumpiness on threads (though I suppose, as > I've been out of touch for a bit maybe you've all solved the problem. > That'd be nice) I've also been thinking about DOD, as there's a fair > amount of overlap in the things that cause problems for threads and the > ones that cause problems for generational garbage collectors.
Can you elaborate a bit on your concept of generational collection as they apply to parrot? To my ear, generational collection implies all of this (which is a lot, and a lot of it grossly inapplicable to parrot as the runtime now stands):
GENERATIONAL GARBAGE COLLECTORS
Rationale: Most objects are short-lived, and those which are not are less likely to become garbage. Marking or copying the older is statistically wasted effort. A garbage collector which usually traces only new objects will thus tend to perform better than one which traces all of them.
Implementation: Most generational collectors are copying collectors. They are generally mutations upon a hemispace collector. The variation is that collection always proceeds from a small, fixed-sized space to a larger, growable space, rather than swapping back and forth between two symmetric spaces. These asymmetric spaces are the so-called generations, and the small one containing the younger objects is dubbed the nursery. Moving an object from the nursery is called tenuring. Thus, the "generation" of an object is not so much a property of an object as it is a property of the pool from which the object is presently allocated.
Some generational collectors have intermediate generations, but most do not. (Parrot might benefit from a "younger-than-the-nursery" generation for impatient destructors, but then that property would need to be ascertainable at construction time.)
The nursery is typically explicitly un-fragmented. All live objects in it are tenured from it when it fills. Thus, allocation can be as simple as an atomic pointer increment and filling in an object header. The tenured generation might be a heap.
To avoid tracing from the roots, the generational collector requires a cooperative mutator to maintain a table of references from tenured objects to objects within the nursery. This is often forced upon the mutator through the use of VM write barriers. The write barrier function marks the VM page as dirty and then removes write protection for that page. When invoked, the garbage collector will examine all dirtied pages, updating the table of references.
A truly cooperative mutator will update a table of dirty pages every time it changes a pointer variable. Usually, the pages are smaller than VM pages, and thus called "cards" instead. This operation can be as simple as card_table[((int) &ptr) >> n] = 1; ptr = obj;.[1] An expensive proposition, at least doubling the memory bandwidth overhead of pointer manipulation. But, since cards are smaller than pages, it can reduce the amount of memory which has to be scanned at collection time.
Despite being incremental by nature, a generational collector is not a concurrent collector. That is, it still "stops the world," it just doesn't do it for quite a long as a traditional hemispace or mark-sweep collector might.
Being a copying collector in the traditional mold has its drawbacks. It requires accurate reference tracing, in particular. A traditional copying collector cannot work with a conservative collector. For instance, it cannot operate on the OS stack (unless the program is written very cooperatively), since the collector cannot ascertain for certain whether a value is a pointer (which should be traced and possibly updated) or data (which should be left alone).
But parrot has anchored objects and other sorts of things which are incompatible with this. In particular, the fast-path nursery allocator (the very fast allocation being a major benefit of generational collectors) can't work: If an object were anchored in the nursery, it must stay there after GC rather than being tenured; this leaves the nursery fragmented and complicates the very hot allocation code path. Then there's the whole "PMCs don't move" guarantee. And there's also the conservative tracing of system areas.
At 10:51 PM -0500 1/30/04, Gordon Henriksen wrote:
>On Thursday, January 29, 2004, at 08:41 , Dan Sugalski wrote:
>>So, while I'm heaping much grumpiness on threads (though I suppose, >>as I've been out of touch for a bit maybe you've all solved the >>problem. That'd be nice) I've also been thinking about DOD, as >>there's a fair amount of overlap in the things that cause problems >>for threads and the ones that cause problems for generational >>garbage collectors.
>Can you elaborate a bit on your concept of generational collection >as they apply to parrot? To my ear, generational collection implies >all of this (which is a lot, and a lot of it grossly inapplicable to >parrot as the runtime now stands):
[Description snippage]
From my reading of the literature the generational collectors don't have to be copying collectors -- the only thing that's ultimately required is an identification of which generation an object is living in. Granted, with an absolute minimum support in the data (just a generation count) it makes collection pretty slow, but that's all you *need*.
Having all the objects of a generation in a single pool does make life significantly easier, to be sure, but you don't need to have the objects *themselves* in a single pool -- just have a quick way to identify all the objects in a generation. We could do that with a separate generation pool if we wanted, at the cost of a pointer per PMC plus overhead for the pool segmentation, much the way that we have the freelist for PMCs.
Might ultimately be untenable, of course. On the other hand, the code infrastructure and APIs needed for a generational collector are the same as the ones for incremental and concurrent collectors, so we may well just jump to an incremental collector instead if it turns out to be more worthwhile or less hassle. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <d...@sidhe.org> wrote: > Having all the objects of a generation in a single pool does make > life significantly easier, to be sure, but you don't need to have the > objects *themselves* in a single pool -- just have a quick way to > identify all the objects in a generation. We could do that with a > separate generation pool if we wanted, at the cost of a pointer per > PMC plus overhead for the pool segmentation, much the way that we > have the freelist for PMCs.
The freelist is located inside of unused PMCs, it doesn't take up extra memory. An additional generation pointer needs additional memory. For changing the generations a double linked list could be necessary.
> Might ultimately be untenable, of course. On the other hand, the code > infrastructure and APIs needed for a generational collector are the > same as the ones for incremental and concurrent collectors, so we may > well just jump to an incremental collector instead if it turns out to > be more worthwhile or less hassle.
Incremental collectors are generally slower, but can provide RT capabilities. The stopping-the-world time is bounded. Generational GCs have a signifant overhead too, but depending on workload, this can be outweighed by not scanning old generation pools each time.
We can achieve a similar effect without much overhead, by remembering a previous-live-objects-count per arena and skip "almost all live arenas" each second time or so.
But anyway, we don't need[1] different schemes for PMCs: The copying collection of variable sized Buffer memory is the problem for multi-threading.
We can configure with --gc=libc or --gc=malloc but we have to be sure, that either the system malloc is fast enough or that the supplied malloc.c is portable and thread-safe. The latter isn't the case for the LEA implementation.
[1] experiments, patches, benchmarks are always welcome of course
>> Having all the objects of a generation in a single pool does make >> life significantly easier, to be sure, but you don't need to have the >> objects *themselves* in a single pool -- just have a quick way to >> identify all the objects in a generation. We could do that with a >> separate generation pool if we wanted, at the cost of a pointer per >> PMC plus overhead for the pool segmentation, much the way that we >> have the freelist for PMCs.
>The freelist is located inside of unused PMCs, it doesn't take up extra >memory. An additional generation pointer needs additional memory. For >changing the generations a double linked list could be necessary.
The freelist used to be an array that held free PMC pointers. I presume this changed?
> > Might ultimately be untenable, of course. On the other hand, the code >> infrastructure and APIs needed for a generational collector are the >> same as the ones for incremental and concurrent collectors, so we may >> well just jump to an incremental collector instead if it turns out to >> be more worthwhile or less hassle.
>Incremental collectors are generally slower, but can provide RT >capabilities. The stopping-the-world time is bounded. Generational GCs >have a signifant overhead too, but depending on workload, this can be >outweighed by not scanning old generation pools each time.
The not scanning part is where generational GC systems get their wins. Properly done (depending on workload) a generational system will have less gc overhead because it does less work.
>We can achieve a similar effect without much overhead, by remembering a >previous-live-objects-count per arena and skip "almost all live arenas" >each second time or so.
Well... maybe. I'm not sure we can, since those arenas need to be looked at for pointers to other arenas. If you work around that then you *have* a generational collector.
>But anyway, we don't need[1] different schemes for PMCs: The copying >collection of variable sized Buffer memory is the problem for >multi-threading.
No, it isn't. For several reasons, not the least of it is the fact that we're going to have to shift to a non-copying collector in the threaded case. The bigger problem is tracing the live set in the face of multiple simultaneous mutators, which the incremental gc systems handle much more pleasantly and with less overall impact on the system. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
A generation list either uses extra PMC memory or it is an array of pointers in the objects arena. An array of pointers would need allocation and resizing if generation of an object changes.
>>But anyway, we don't need[1] different schemes for PMCs: The copying >>collection of variable sized Buffer memory is the problem for >>multi-threading. > No, it isn't. For several reasons, not the least of it is the fact > that we're going to have to shift to a non-copying collector in the > threaded case.
Ok. But then lets start here. This *is* the current issue.
> ... The bigger problem is tracing the live set in the face > of multiple simultaneous mutators, which the incremental gc systems > handle much more pleasantly and with less overall impact on the > system.
I doubt that it has less overall impact. Anyway: an incremental GC isn't the only solution. It could be ok, if all PMCs are shared. But this isn't true. All temporary PMCs are *not* shared logically. Considering all temps of all threads as shared (and managing all PMCs of all threads in one PMC pool is for sure suboptimal). Freelist handling could have high contention with this case.
And incremental GC must always run faster then allocating new PMCs. This can't be achieved, if all threads use the one and only pool for its PMCs.
Incremental GC has a lot of overhead by tracking pointer updates. When we have the temps automatically shared, we add locking and this cost to each PMC. This scheme can't win (or will need a ~16 CPU machine to achieve something).
So it could work like this: - temps aren't shared - they live in their interpreters private pools - shared PMCs live in exactly one pool (of the "main" interpreter) - any thread can DOD its own PMCs - shared PMCs are marked live by that process, but nether freed, because that interpreter doesn't own the pool. And they aren't append to the next_for_GC pointer.
- shared PMC DOD needs: - suspend all interpreters, that might have a shared PMC from this pool - the main interpreter marks the root set - all other interpreters run the mark phase, one by one, and append live shared PMCs to the next_for_GC pointer (anchored at the main interpreter) - the main interpreter continues/finishes marking phase 2 - all interpreters can now continue to free_objects (if their statistics state, that they are low on free objects) - and in parallel the main interpreter can free shared objects.
If someone implements a truely paralled incremental GC later, and *if* that's faster, we can switch to it. Above scheme works (almost:) now.