Hello again,
This patch provides a mostly non-moving parallel garbage collector, enabled with the feature :mark-region-gc. The collector retains bump allocation and the current card-marking write barrier, and doesn't require mutator code to update the allocation bitmap. The collector is slightly slower (5-10% wall time) with a single thread, but tends to outperforms the current serial copying GC with multiple threads. My paper for ELS <https://applied-langua.ge/~hayley/swcl-gc.pdf#page=5> has some benchmarks (before compaction was implemented; compaction either doesn't affect performance or produces a slight speedup in less rigorous tests). But I don't think it's quite done; I'd appreciate some review, in particular on these listed points.
Of feature regressions:
- Immobile space still isn't supported. Doing a full GC was tricky when I last tried to support immobile space, as the mark-region collector can keep generation numbers intact, and the immobile space collector cannot (as I don't use fullcgc to do a full GC). But I got inexplicable crashes in TLSF after kludging around that. We should still use TLSF for the immobile space to avoid fragmentation, regardless of what's done in dynamic space.
- traceroot does not work, as it gleans the array of roots that
gencgc finds, and I don't maintain that array.
- Some tests still fail. Tests which check rehashing of hash tables expect that the keys will move, and the collector doesn't move them. I think ADDRESS-INSENSITIVE-EQL-HASH is that kind of test, but it might also be useful (see my last point). (HASH-TABLE GC-SMASHED-CELL-LIST) fails, so I guess I don't get how the hash table free-list works. gc.impure.lisp crashes and futex-wait.test.sh returns an invalid exit status.
- SB-EXT:*GC-REAL-TIME* tracks GC real time, and TIME prints the
mutator/GC breakdown of both real time and run time now. I'm
amused that single-threaded mutator code can use more than 100%
CPU now, but it might come as a surprise - though a correct
surprise. (Contrariwise, a parallel program bottlenecked by GC
reports less CPU used than the programmer hoped for.)
In the internals:
- Somehow FORMAT stopped working in cold-init, which Stas didn't
think would work in the first place
<https://irclog.tymoon.eu/libera/%23sbcl?around=1667040784#1667040784>.
- The C runtime can no longer assume that it can walk the heap as if it was contiguous; instead there is a little iterator protocol in walk-heap.h which works for both contiguous and non-contiguous heaps. The non-contiguous implementation is very naive, but I haven't found it being a performance problem. Though maybe I don't test the right things - it took me a while to find that S-L-A-D was broken once, because I hardly use it. The C runtime also cannot use (anything to the effect of) page_table[find_page_index(object)].gen to determine the generation of small objects, as generations for small objects are stored in line metadata instead. (But using the page table is fine for pseudo-static objects and large objects.)
- The tracing logic in fullcgc.c has been made generic in trace-object.inc, which is customised using the C preprocessor.
- Genesis emits single-object pages correctly when #+mark-region-gc.
- Core files now contain the allocation bitmap, which is stored after the page table. I don't ever compress the allocation bitmap, as it's 1/128 the size of the heap. Not sure if it's okay to just dump the bitmap after the page table either.
- The allocator prefers to reuse partly-used pages for small objects, in order to keep free pages for large objects, which is implemented using another array allow_free_pages. It works, but it doesn't seem right to separate it from the other allocator state in alloc_start_pages.
- I had to rearrange some headers, and I'm not sure if I did it
in a reasonable way. I also started to #ifndef
LISP_FEATURE_MARK_REGION_GC out dead code in gencgc, but I didn't
finish. Should/could any common code be extracted into other
files, rather than having to #ifndef out a large amount of
gencgc.c?
- I've implemented being able to have roots use interior pointers (as mr_preserve_ambiguous calls find_object which allows interior pointers), but honestly I can't think of a purpose for it.
- As a side effect of lines (mostly) having generations rather than pages, the generation breakdown of gc_gen_report_to_file produces wrong and useless data. I'm not sure what to do with it.
With regards to performance:
- Incremental compaction works, but has barely been tuned.
#+mark-region-gc sbcl.core is about 2x the size of a
#-mark-region-gc sbcl.core. Tuning the amount to copy with a time
limit seems useful, moreso if combined with concurrent tracing to
make a pause time target; but even with STW the default
configuration is rather arbitrary.
- Page and heap usage is counted in lines, and the GC has no way
of measuring fragmentation in lines (e.g. if each line in a page
was used by only one cons cell somehow). This is done to allow
sweeping to update usage by subtracting lines it freed; perhaps we
could maintain a table of words used in each page/generation
combination, to be able to keep precise measurements. For a
parallel GC we might want to have a table for each thread to avoid
atomic updates, then merge the tables after tracing. Most
collections just need to track usage of one generation, but full
collections need to track usage of all generations. I haven't
thought this through entirely.
- Currently the number of GC threads is always 4 by default (three helpers and the thread which entered collect_garbage), unless a different number is provided using the GC_THREADS environment variable. I'm not sure what the default should be.
- I added a "panic" trigger which causes collections of older
generations, when more than 90% of the heap is used but the other
collection triggers don't fire. In my experience this helps to
avoid running out of heap, but will likely cause performance
regressions if one manages to keep SBCL running with that much
heap used. The limit on auto_gc_trigger is also lowered a bit, to
compensate for fragmentation causing heap exhaustion to happen
earlier than anticipated.
Many thanks,
Hayley
(Though mmaping in the bitmap might help with startup time too, as is
done for the heap.)
> Can the allocation bitmap be derived, eg by a GC run right after
> starting?
It seems possible in theory. The bitmap is only needed to deal with
conservative roots, which come from threads, and there are no Lisp
threads yet. But doing a full GC at startup doesn't seem like a good
idea for startup time.
>
>> - Currently the number of GC threads is always 4 by default
>> (three helpers and the thread which entered collect_garbage),
>> unless a different number is provided using the GC_THREADS
>> environment variable. I'm not sure what the default should be.
>
> I guess that should be configurable from *within* the running image,
> so that it can depend on some configuration, like
>
> - number of cores allocated to this process,
> - size of request worker pool,
>
> etc.
>
I haven't implemented any way to resize the thread pool, but that does
sound like the right thing to do.
- traceroot does not work, as it gleans the array of roots that gencgc finds, and I don't maintain that array.
Is its only failing that it won't find a path if the root of the path would be a thread stack? Or does it simply not work at all?
I think the former. The following appears to work, involving a symbol-value:
(defvar *x* (list 1 2 3))
(sb-ext:search-roots (sb-ext:make-weak-pointer *x*) :gc t)
-> (SB-IMPL::SYMBOL-HASHSET) #x100031D3D3[2] -> (CONS)
#x1006BFDD27[1] -> ((SIMPLE-VECTOR 307)) #x1006C25E5F[78] ->
(SYMBOL) #x100B1EC5DF[2] -> #x100931C7C7
Somehow I manage to make a loop in the compacting remset, by the
way - or maybe I don't, I'm not sure. I have a program which will
either cause it under a few minutes, or not after a day. Nor do I
see why it should be possible to make a loop, as the remset uses
the same data structure and synchronisation as tracing.
I did consider it. The main issues were that I wasn't sure how to rip
out all the code assuming gencgc stuff, nor how to make genesis dump a
heap that MMTk could manage. MMTk lacks conservative stack scanning and
weak references, which I could have implemented, and the
mutator/allocator state needlessly has state for every collector
plan/allocator, which I could have also modified (to only have the Immix
bump allocator, say). I arguably had to make some tricky changes to make
my collector work, however; but I think I got off easier there.
(Fixed the issue with the loop in the compacting remset. I got
materialising the allocation bitmap to crash once, but my stress testing
program kept running through the night, so we'll see if it happens again.)
> MMTk lacks conservative stack scanning and
weak references
> MMTk lacks conservative stack scanning andIt has it now. See mod.rs which exposes a "valid object" bit for conservative stack
weak references
and similarly add_weak_candidate
It seems like it might be worthwhile revisiting mmtk at some point
Definitely.
- I had to rearrange some headers, and I'm not sure if I did it in a reasonable way. I also started to #ifndef LISP_FEATURE_MARK_REGION_GC out dead code in gencgc, but I didn't finish. Should/could any common code be extracted into other files, rather than having to #ifndef out a large amount of gencgc.c?
- I had to rearrange some headers, and I'm not sure if I did it in a reasonable way. I also started to #ifndef LISP_FEATURE_MARK_REGION_GC out dead code in gencgc, but I didn't finish. Should/could any common code be extracted into other files, rather than having to #ifndef out a large amount of gencgc.c?
In your desired end-state would you think that #+gencgc and #+mark-region-gc would be mutually exclusive? I would.
Some of this is as simple as more #defines. e.g. Adding a C macro for set_allocation_bit_mark that expands to nothing or something would remove a few #ifdefs.
Btw, any user of ELF cores (of which I know none other than Google) won't be able to use mark-region until we come up with a way to save ELF cores that don't depend critically on #+immobile-space. It should be possible to write the text section using code pages from anywhere. I'm not sure if dedicated pages of code were even a thing when I first did ELF cores.
I do want to support immobile space, but somehow I couldn't get
it to work (even for a rather lax definition of "work"). As in my
first email there are many deficiencies that I want to correct in
the future; but I'd prefer not to grow the pile of code for review
any further, nor do I enjoy keeping up with changes made upstream,
moreso when I've had to move code around already (though
#-mark-region-gc passes the test suite, I'm still uneasy about
it).
- The tracing logic in fullcgc.c has been made generic in trace-object.inc, which is customised using the C preprocessor.
That's what I do already, and there's no issues with it at the moment.
But it doesn't quite work for immobile space ...
In addition to your known test failures, a few failures in 'gc.impure' occur but - I think - only under parallel-exec. If I had to guess, it does not stop and restart the GC threads around fork(). So we can probably consider this an error in the test runner and it might be useful to have it fixed, as it will cut your testing time down by N cores.
::: Running :PIN-ALL-CODE-WITH-GC-ENABLED
::: UNEXPECTED-FAILURE :PIN-ALL-CODE-WITH-GC-ENABLED due to SIMPLE-ERROR:
"68751066331 is not a valid argument to SB-KERNEL:MAKE-LISP-OBJ"
::: Running :GC-LOGFILE
fatal error encountered in SBCL pid 3648282 tid 3648282:
Invalid page type 0x12 (p2084)
Part II
* The post_process function in coreparse needed to receive 'spaces' as an argument otherwise we can't build with #+immobile-space.
* page_extensible_p seemed to have no bearing on mark-region but was moved to a common header
* Obsolete comments.
"Note that the majority of GC is single-threaded"
Part III
* The general version of the tracing algorithm is not type-safe.
Consider that in fullcgc.c with immobile-space enabled, the C preprocessor emits:
wrap_mark(layout, &((uint32_t*)(where))[1], SOURCE_NORMAL);
but the second parameter to the ACTION is supposed to be lispobj*. This surely means you can't deference through it, which I presume will constitute a bug if applied to anything that needs the second arg. It's not a bug in mark-region because there are no half-sized pointers without immobile-space.
* Can pinned_p just return a constant 0 instead of pretending that the pinned_objects hash-table ever has >0 items? I never saw anything inserted in the regression suite.
* page_single_obj_p seems to always return 0 in my testing. Is that right?
This discovery came about when attempting to check if 'add_new_area' could be removed, as I tried to cause copy_potential_large_object() fall into the case where it would call add_new_area, but it never did. (And of course add_new_area can be removed, since areas are never read, which I realized after the fact)
* Is the comment above RESET_ALLOC_START_PAGES still right in its entirety, particularly where it says "other than it serves its purpose for picking up where it left off" ?
Does mark-region.c use the start page for that purpose?
* Similarly adjust_obj_ptes is never called. So you won't account for pages freed if a vector shrinks from 1MB to 1KB ? I removed adjust_obj_ptes but maybe you'll put it back if this was the wrong thing to do.
* Does 'walk_generation' remain correct as-is if generation is tracked at a line level? Does it only work if the supplied generation is -1 (therefore selecting all generations)?
* We should probably remove the hook for the statistical profiler in :ALLOCATION mode. It expects to sample a stack once every GENCGC_PAGE_BYTES, but it doesn't really do that with mark region since you don't know how many lines you made available in a new region. (I don't use this feature, so I don't really care. SB-APROF is better)
I suspect we could rig the allocator to count how many lines get
allocated, and ensure that the slow path gets taken after
allocating GENCGC_PAGE_BYTES. But I agree that the hook could just
be removed, if no one uses sprof with :ALLOCATION mode.
I wasn't aware that SBCL could shrink objects in-place. (I saw logic for it in e.g. copy_potential_large_object, but haven't seen anything which shrinks a vector.) No, that's not accounted for.
There are essentially two uses for object size shrinking:1. %SHRINK-VECTOR which is used by sequence functions on arrays where we'll overestimate the result size and then maybe clip the end2. Bignum operations. Normalizing a bignum where all the high words became 0 or all became -1 can remove insignificant sign-extending words
Attached are my changes that apply cleanly at the latest commit. It builds with several combinations of options but I didn't try plain gencgc on anything but x86-64 just yet.I plan to make it where mark-region and gencgc are mutually exclusive choices.It'll look nicer when the base of this change has all the stuff which I moved out of gencgc already moved out in a precursor change.I did NOT apply your second patch ("simplify find_object") in this.Let me know what you think. The tests passed for me except the 4 that you probably know of. And all tests passed without the mark-region-gc feature, and I think with and without immobile-space.
Looks good to me. Is duplicating the functions between pmrgc.c
and gencgc.c going to pose any problems for anyone wanting to
change one of the duplicated functions? pmrgc.c is relatively
small, to be fair.
Looks good to me. Is duplicating the functions between pmrgc.c and gencgc.c going to pose any problems for anyone wanting to change one of the duplicated functions? pmrgc.c is relatively small, to be fair.
Question: why does scanning the binding stack use 'mr_preserve_ambiguous' instead of 'mr_preserve_range'? (TLS does use preserve_range)The binding stack on #+sb-thread architectures looks like a sequence of fixnum and exact pointer. (The "fixnum" is actually the TLS index of the symbol as a raw value, but alignment makes it have a 0 low bit). And for #-sb-thread it looks like a sequence of exact pointer to symbol and exact pointer to value. If you were seeing garbage values on the binding stack, that suggests a bug somewhere.
* immobile symbols - these are the only "complicated" piece. We gain some codegen advantages by placing them sub-2GB. Could you have a part of dynamic space that is mapped discontiguously in your GC? It mostly just means that address-to-page-index and vice-versa become piecewise linear functions.
* immobile layouts - either use the same idea of immobile symbols, or restore to working order the space formerly known as metaspace. (I liked it, though I realize there is a dissenting opinion). The advantage to having 4-byte addressable layouts is very evident: a 3% to 5% reduction in memory use due to all instances being 1 word shorter on average.
Something to do with immobile space makes the mutator about 40% faster in my parallel fuzz tester application FWIW. I'd like to say it's compact headers, but the cache miss rate is only a bit lower (from 17% without immobile space 15% with), so I don't have any strong evidence.
On 15/7/23 11:29, Douglas Katzman wrote:
I just pushed a branch which contains that patch plus the changes necessary for x86-64 macOS.Performance seemed OK on the mac even though it has to make a ton of calls to pthread_getspecific.My guess is that it may be more typical to try to bundle all your thread-local variables into one struct, call pthread_getspecific in some top level location in the GC worker, and then pass the struct around a lot more. But maybe it's fine.
Neat. The thread-locals are nice specifically for not having to pass the state through functions like e.g. scavenge_root_gens_worker -> scavenge_root_object -> trace_object -> mark, but if it doesn't hurt performance then there's no worries. Though the struct-of-state would give us back lvalues I think - on Darwin we might have
#define TLS(name) pthread_getspecific(gc_key)->name
and
#define TLS(name) name
on platforms with _Thread_local, say. Then e.g. TLS(dirty_generation_source) = gen; should work okay. In my experience on Linux (don't have a Mac handy) all of tracing gets inlined together; pity Clang doesn't seem to be able to CSE pthread_getspecific.
I don't think I understand _Thread_local support in Darwin
though;
<https://stackoverflow.com/questions/33358417/how-to-get-support-for-thread-local-on-mac-osx-clang>
says that Xcode 8 and later support _Thread_local annotations.
What versions of macOS and Xcode should SBCL support? But I gather
it's still a problem due to you saying "because each is either the
first [W^X] or second [_Thread_local-less] kind."
I don't think I understand _Thread_local support in Darwin
What's wrong with having #+mark-region-gc imply #+gencgc (as it was in my patch)? I don't have an opinion on the matter, just wondering; nor am I good at naming things.I'm trying to clean up the headers for commiting your stuff to master and I'm wondering about some small changes that nobody else will really care about.* To make the feature mark-region-gc mutually exclusive with gencgc I'm having to add a lot of "if define LISP_FEATURE_MARK_REGION_GC || defined LISP_FEATURE_GENCGC". Do you think it makes sense to define a single feature that subsumes those both and means "not cheneygc"? I'm not sure what the name of the feature should be.
From the archives I found "As Christophe says, cheney is nice because it fits in ones head, keeps GC abstractions more honest" -- the mark-region collector is quite different and discourages too tight coupling to gencgc-isms, perhaps in different ways to cheneygc too (like walking the heap by stepping over contiguous objects). I suspect the same of your concurrent non-moving collector.* Do you think I should rip out all traces of cheney? Every time I threaten to, xof tells me: "I liked cheneygc because it was an algorithm I could fit in my head". He said it again just last week. The point is that he wants at least _some_ other algorithm, but I would guess that mark-region does not satisfy his "fit in head" requirement.
I collect the pseudo-static generation just fine in a full GC, but wouldn't know when to automatically collect it. The usual allocation trigger wouldn't work as we don't allocate into pseudo-static (but I gather pseudo-static objects can become unreachable over time due to mutations).* I want to rename pseudo-static-generation to static-generation. I see nothing "pseudo" about it. It's immortal and doesn't move.Some people view it as a bug that everything is immortal. I suppose you could rectify that in your GCIn my concurrent GC I'm actually just treating it as an extension of static space so it has the same limitation, rightly or wrongly.
I do not use FILLER_WIDETAG. Indeed the sweep only touches the line bytemap and allocation bitmap, and never touches heap data.* Do you still use FILLER_WIDETAG ? Can you just mark the lines as unallocated or restore them to whatever their initial state is?
I have incremental compaction (in the sense that it is in STW but copies only part of the heap at a time), and have pinning; I'm not too sure what's special about STARTING_THREADS, so the thread, function and arguments should still be pinned?* The code in gencgc which pins starting-threads (thread instance, start function, and vector of args to start-function) might be something we can remove in mark-region, unless you think you really will have incremental moving in which case pinning remains relevant.
Rereading traceroot.c I don't think I quite understand how it works. examine_threads scans the TLS, binding stacks, control stacks and pin lists, so I'm not sure what else would be in the gencgc pin table?* Do you see a way to completely remove the C code supporting search-roots from the GC? I suspect the way the Lisp side should work is as follows: try-acquire-gc-lock; if successful, then stop-the-world; synchronize page table usage; call the path finder; restart the world. Also I think the way it should build the inverted heap is not via a linear walk of objects, but actually do a graph trace and record reverse pointers as it does so.
I check IRC, Matrix and Discord regularly enough; open to suggestions for other applications though.Do you have a chat application that you use? I kind of dropped off the sbcl IRC.
> * Do you think I should rip out all traces of cheney? Every time I threaten to, xof tells me: "I liked cheneygc because it was an algorithm I
> could fit in my head". He said it again just last week. The point is that he wants at least _some_ other algorithm, but I would guess that
> mark-region does not satisfy his "fit in head" requirement.
>
> From the archives I found "As Christophe says, cheney is nice because it fits in ones head, keeps GC abstractions more honest" -- the
> mark-region collector is quite different and discourages too tight coupling to gencgc-isms, perhaps in different ways to cheneygc too (like
> walking the heap by stepping over contiguous objects). I suspect the same of your concurrent non-moving collector.
I mean, we shouldn't over-optimize for my nostalgia for GCs that fit in
my head. But I think it would/should be an interesting project for
someone™ to construct the simplest possible GC for SBCL; what does it
look like? (Would it fit in my head?) What are the immovable
constraints?
None of this is urgent, and if the remnants of Cheney are getting in the
way, then by all means remove them. I agree that having N=2 garbage
collectors of any kind helps with the "keeps the abstractions honest"
part.
What's wrong with having #+mark-region-gc imply #+gencgc (as it was in my patch)? I don't have an opinion on the matter, just wondering; nor am I good at naming things.
I have incremental compaction (in the sense that it is in STW but copies only part of the heap at a time), and have pinning; I'm not too sure what's special about STARTING_THREADS, so the thread, function and arguments should still be pinned?
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f6beb938740 (LWP 2126589) "sbcl" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x560b477ee120 <join_semaphore>) at ./nptl/futex-internal.c:57
2 Thread 0x7f6be8e146c0 (LWP 2126636) "Parallel GC" 0x00007f6beba0a345 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f6be8e13e20, rem=rem@entry=0x0)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
3 Thread 0x7f6be97a46c0 (LWP 2126637) "Parallel GC" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x560b477ee100 <start_semaphore>) at ./nptl/futex-internal.c:57
4 Thread 0x7f6bea1346c0 (LWP 2126638) "Parallel GC" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x560b477ee100 <start_semaphore>) at ./nptl/futex-internal.c:57
5 Thread 0x7f6be75276c0 (LWP 2126640) "finalizer" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f6be7730218) at ./nptl/futex-internal.c:57