Racket mostly uses mark-and-sweep for the old generation. It sometimes
compacts, but not usually. (I tried disabling compaction in the GC, and
it doesn't seem to affect the benchmark.)
Although Racket's incremental GC mode can help some, as you noted, it
doesn't help out-of-the-box for the benchmark.
> 2) Can you make it better?
Only a little, in the short term.
> Long version:
Beware that Racket strings are Unicode strings stored as UCS-4. So, I
think the benchmark makes Racket allocate 4 times as much as the OCaml
and Haskell variants.
From a GC perspective, that maybe shouldn't matter too much, because
the data is mostly atomic and not traversed. Still, it reduces
locality. Making matter worse, the GC in Racket v6.5 is more eager than
intended to return pages to the OS (and the size of those pages is
proportional to the size of the message); making the GC less eager to
return pages cuts the peak GC time from 225ms to 160ms on my OS X
machine. The effect is likely OS-specific.
Changing the benchmark to use a byte string reduces the peak GC time to
140ms. The 20ms saved reflects some general locality effects, but maybe
also some remaining weakness in the GC's handling of newly unused
pages.
A side issue is that inserting consecutive numbers into an immutable
hash table is a bad case for Racket's implementation in v6.5. (I've
been meaning to look at this for a while.) It ends up creating
intermediate nodes that are larger than average, which again reduces
locality. After adjusting the implementation to avoid trouble with
sequential hash codes, the peak GC time ends up around 100ms on my
machine.
Meanwhile, incremental GC doesn't help because it doesn't adapt well to
ramping up from stand-still. The current, first cut at incremental mode
is mostly useful in a steady state (typical for a video game). If I
increase the message count by a factor of 10, incremental GC mode
starts helping a little:
% env PLT_INCREMENTAL_GC=y racket -W "debug@GC" main.rkt | & grep MAJ
GC: 0:MAJ @ 51,218K(+37,101K)...; free 3,087K(-3,087K) 17ms @ 331
GC: 0:MAJ @ 107,847K(+29,624K)...; free 9,304K(-19,560K) 42ms @ 483
GC: 0:MAJ @ 209,809K(+38,814K)...; free 6,715K(-22,635K) 56ms @ 781
GC: 0:MAJ @ 416,454K(+44,697K)...; free 9,512K(-6,920K) 86ms @ 1477
GC: 0:MAJ @ 824,407K(+76,520K)...; free 182,074K(-182,074K) 113ms @ 2888
GC: 0:MAJ @ 1,294,379K(+98,068K)...; free 417,187K(-417,187K) 114ms @ 5066
GC: 0:MAJ @ 1,763,263K(+120,704K)...; free 647,816K(-647,816K) 64ms @ 8067
GC: 0:MAJ @ 2,028,264K(+117,847K)...; free 885,256K(-901,640K) 43ms @ 10918
GC: 0:MAJ @ 2,035,484K(+127,011K)...; free 912,992K(-912,992K) 47ms @ 13779
GC: 0:MAJ @ 2,026,836K(+135,659K)...; free 890,614K(-890,614K) 41ms @ 16503
....
A program can request incremental mode and encourage more incremental
work through the `collect-garbage` function. Adding
(when (zero? (modulo i 100))
(collect-garbage 'incremental)
(collect-garbage 'minor))
to the `for/fold` loop brings peak GC time down to around 30ms (also
using the GC and HAMT improvements), but the program takes twice as
long to run. Using 50 instead of 100 in the `modulo` brings the peak
time to 20ms, but then the program takes three times as long to run.