Racket's worst-case GC latencies

Gabriel Scherer

unread,

May 14, 2016, 10:09:12 PM5/14/16

to racke...@googlegroups.com, Max New

Hi racket-devel,

Short version:
Racket has relatively poor worst-case GC pause times on a specific
benchmark:
https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt

1) Is this expected? What is the GC algorithm for Racket's old generation?
2) Can you make it better?

Long version:

## Context

James Fisher has a blog post on a case where GHC's runtime system
imposed unpleasant latencies/pauses on their Haskell program:

https://blog.pusher.com/latency-working-set-ghc-gc-pick-two/

The blog post proposes a very simple, synthetic benchmark that exhibit
the issue -- namely, if the old generation of the GC uses a copying
strategy, then those copies can incur long pauses when many large
objects are live in the old general.

I ported this synthetic benchmark to OCaml, and could check that the
OCaml GC suffers from no such issue, as its old generation uses
a mark&sweep strategy that does not copy old memory. The Haskell
benchmark has worst-case latencies around 50ms, which James Fisher
finds excessive. The OCaml benchmark has worst-case latencies around
3ms.

Max New did a port of the benchmark to Racket, which I later modified;
the results I see on my machine are relatively bad: the worst-case
pause time is between 120ms and 220ms on my tests.

I think that the results are mostly unrelated to the specific edge
case that this benchmark was designed to exercize (copies of large
objects in the old generation): if I change the inserted strings to be
of size 1 instead of 1024, I also observe fairly high latencies --
such as 120ms. So I'm mostly observing high latencies by inserting and
removing things from an immutable hash in a loop.

## Reproducing

The benchmark fills an (immutable) associative structure with strings
of length 1024 (the idea is to have relatively high memory usage per
pointer, to see large copy times), keeping at most 200,000 strings in
the working set. In total, it inserts 1,000,000 strings (and thus
removes 800,000, one after each insertion after the first 200,000). We
measure latencies rather than throughput, so the performance details
of the associative map structure do not matter.

My benchmark code in Haskell, OCaml and Racket can be found here:
https://gitlab.com/gasche/gc-latency-experiment.git
https://gitlab.com/gasche/gc-latency-experiment/tree/master
the Makefile contains my scripts to compile, run and analyze each
language's version.

To run the Racket benchmark with instrumentation

PLTSTDERR=debug@GC racket main.rkt 2> racket-gc-log

To extract the pause times from the resulting log file (in the format
produced by Racket 6.5), I do:

cat racket-gc-log | grep -v total | cut -d' ' -f7 | sort -n

Piping `| uniq --count` after that produces a poor-man histogram of
latencies. I get the following result on my machine:

1 0ms
2 1ms
1 2ms
1 3ms
2 4ms
1 5ms
1 6ms
3 8ms
12 9ms
1 11ms
2 12ms
38 13ms
126 14ms
43 15ms
13 16ms
19 17ms
4 18ms
1 19ms
1 21ms
1 48ms
1 68ms
1 70ms
1 133ms
1 165ms
1 220ms
1 227ms
1 228ms

## Non-incremental vs. incremental GC

We experimented with PLT_INCREMENTAL_GC=1; on my machine, this does
not decrease the worst-case pause time; on Asumu Takikawa's beefier
machine, I think the pause times decreased a bit -- but still well
above 50ms. Because the incremental GC consumes sensibly more memory,
I am unable to test with both PLT_INCREMENTAL_GC and
PLTSTDERR=debug@gc enabled -- my system runs out of memory.

If I reduce the benchmark sizes in half (half the working size, half
the number of iteration), I can run the incremental GC with debugging
enabled. On this half instance, I observe the following results:

for the *non-incremental* GC:

2 1ms
1 2ms
2 3ms
2 4ms
1 5ms
1 6ms
9 8ms
2 9ms
1 10ms
38 13ms
43 14ms
13 15ms
8 16ms
5 17ms
6 18ms
1 44ms
1 66ms
1 75ms
2 126ms
1 136ms
1 142ms

for the *incremental* GC

2 1ms
1 2ms
2 3ms
3 4ms
1 5ms
38 6ms
155 7ms
136 8ms
78 9ms
56 10ms
28 11ms
16 12ms
2 14ms
1 15ms
1 16ms
2 20ms
1 32ms
1 41ms
1 61ms
1 101ms
1 148ms

As you can see, the incremental GC helps, as the distribution of
pauses moves to shorter pause times: it does more, shorter,
pauses. However, the worst-case pauses do not improve -- in fact they
are even a bit worse.

Sam Tobin-Hochstadt

unread,

May 14, 2016, 10:17:51 PM5/14/16

to Gabriel Scherer, racke...@googlegroups.com, Max New

You might be interested in my gcstats package, which will do some of these statistics for you, and may allow you to run larger heaps with data gathering.

Sam

--
You received this message because you are subscribed to the Google Groups "Racket Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-dev+...@googlegroups.com.
To post to this group, send email to racke...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/racket-dev/CAPFanBE6%2BMVxG44nHnw_HCsaNwZ-HSp4L7%2BVT-v-QJ%3Div-EK1g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Matthias Felleisen

unread,

May 15, 2016, 9:16:16 PM5/15/16

to Gabriel Scherer, racke...@googlegroups.com, Max New

I didn’t quite understand your long version. Are you saying that

** on your specific machine **

Haskell’s “worst” is 50ms, OCaml is 3ms, and Racket’s is 120ms?

FWIW, I can’t imagine Haskell code using set! for this benchmark,
so I rewrote it in the obvious way (think monad). On my little laptop,
the “worst” display went down by 25%.

#lang racket/base

(define window-size 200000)
(define msg-count 1000000)

(define get-time current-inexact-milliseconds)
(define (message n)
(make-string 1024 (integer->char (modulo n 256))))

(define (push-msg chan id-high worst)
(define before (get-time))
(define id-low (id-high . - . window-size))
(define added (hash-set chan id-high (message id-high)))
(define result (if (id-low . < . 0) added (hash-remove added id-low)))
(define after (get-time))
(define duration (after . - . before))
(values result (max worst duration)))

(define-values (_ worst)
(for/fold
([chan (make-immutable-hash)]
[worst 0.0])
([i (in-range msg-count)])
(push-msg chan i worst)))
(displayln worst)

Matthew Flatt

unread,

May 15, 2016, 10:05:16 PM5/15/16

to Gabriel Scherer, racke...@googlegroups.com, Max New

At Sat, 14 May 2016 22:08:32 -0400, Gabriel Scherer wrote:
> Short version:
> Racket has relatively poor worst-case GC pause times on a specific
> benchmark:
> https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt
>
> 1) Is this expected? What is the GC algorithm for Racket's old generation?

Racket mostly uses mark-and-sweep for the old generation. It sometimes
compacts, but not usually. (I tried disabling compaction in the GC, and
it doesn't seem to affect the benchmark.)

Although Racket's incremental GC mode can help some, as you noted, it
doesn't help out-of-the-box for the benchmark.

> 2) Can you make it better?

Only a little, in the short term.

> Long version:

Beware that Racket strings are Unicode strings stored as UCS-4. So, I
think the benchmark makes Racket allocate 4 times as much as the OCaml
and Haskell variants.

From a GC perspective, that maybe shouldn't matter too much, because
the data is mostly atomic and not traversed. Still, it reduces
locality. Making matter worse, the GC in Racket v6.5 is more eager than
intended to return pages to the OS (and the size of those pages is
proportional to the size of the message); making the GC less eager to
return pages cuts the peak GC time from 225ms to 160ms on my OS X
machine. The effect is likely OS-specific.

Changing the benchmark to use a byte string reduces the peak GC time to
140ms. The 20ms saved reflects some general locality effects, but maybe
also some remaining weakness in the GC's handling of newly unused
pages.

A side issue is that inserting consecutive numbers into an immutable
hash table is a bad case for Racket's implementation in v6.5. (I've
been meaning to look at this for a while.) It ends up creating
intermediate nodes that are larger than average, which again reduces
locality. After adjusting the implementation to avoid trouble with
sequential hash codes, the peak GC time ends up around 100ms on my
machine.

Meanwhile, incremental GC doesn't help because it doesn't adapt well to
ramping up from stand-still. The current, first cut at incremental mode
is mostly useful in a steady state (typical for a video game). If I
increase the message count by a factor of 10, incremental GC mode
starts helping a little:

% env PLT_INCREMENTAL_GC=y racket -W "debug@GC" main.rkt | & grep MAJ
GC: 0:MAJ @ 51,218K(+37,101K)...; free 3,087K(-3,087K) 17ms @ 331
GC: 0:MAJ @ 107,847K(+29,624K)...; free 9,304K(-19,560K) 42ms @ 483
GC: 0:MAJ @ 209,809K(+38,814K)...; free 6,715K(-22,635K) 56ms @ 781
GC: 0:MAJ @ 416,454K(+44,697K)...; free 9,512K(-6,920K) 86ms @ 1477
GC: 0:MAJ @ 824,407K(+76,520K)...; free 182,074K(-182,074K) 113ms @ 2888
GC: 0:MAJ @ 1,294,379K(+98,068K)...; free 417,187K(-417,187K) 114ms @ 5066
GC: 0:MAJ @ 1,763,263K(+120,704K)...; free 647,816K(-647,816K) 64ms @ 8067
GC: 0:MAJ @ 2,028,264K(+117,847K)...; free 885,256K(-901,640K) 43ms @ 10918
GC: 0:MAJ @ 2,035,484K(+127,011K)...; free 912,992K(-912,992K) 47ms @ 13779
GC: 0:MAJ @ 2,026,836K(+135,659K)...; free 890,614K(-890,614K) 41ms @ 16503
....

A program can request incremental mode and encourage more incremental
work through the `collect-garbage` function. Adding

(when (zero? (modulo i 100))
(collect-garbage 'incremental)
(collect-garbage 'minor))

to the `for/fold` loop brings peak GC time down to around 30ms (also
using the GC and HAMT improvements), but the program takes twice as
long to run. Using 50 instead of 100 in the `modulo` brings the peak
time to 20ms, but then the program takes three times as long to run.

Gabriel Scherer

unread,

May 16, 2016, 11:22:23 AM5/16/16

to Matthew Flatt, racke...@googlegroups.com, Max New

Thanks for the detailed replies. I updated the Racket code and
Makefile to uses gcstats by default instead of debug@GC. In my tests,
this does not change worst-case latencies (I did not expect it
toreason it would), but the generated summary is nice and easier to
parse for non-experts than debug@GC output. The old instrumentation is
still available with run-racket-instrumented,
analyze-racket-instrumented. On my machine I get a realiable 145ms
worst-case latency on the updated benchmark -- with reduce memory
loads thanks to the bytestring suggestion.

I experimented with the case where the message (byte)strings are of
length 1 instead. In this case, I'm able to get a good compromise,
working with Matthew's suggestion that the rampup is difficult for the
incremental GC: if I perform explicit collections only during the
rampup period (when (and (i . < . window-size) (zero? ...))), I get
short pause times (30ms) with little costs in terms of throughput --
especially if I enlarge the total msg-count, which I think is fine.
However, I was unable to reproduce these rampup-only improvements with
1024.
(Another thing I found out is that the worst-case latency is very
sensitive to the choice of frequency of minor GCs during the rampup. I
got nice results with modulo 100, but worse results with modulo 50 or
even modulo 70.)

Matthew: I would like to eventually write a blog post to report on my
findings for OCaml and Racket, and of course if would be nice if
Racket looked as good as reasonably possible¹. If you were willing to
make publicly available a patchset with your GC tuning on top of
Racket's current master branch (or whatever; or apply them to master
directly, of course), I would be glad to report the numbers using
them.

Hoping to improve on the ramp-up issue, I tried to build the
window-sized chan as an associative list, and build the immutable
hashtable with (make-immutable-hash window), perform a major GC after
that, and only then start the steady-state loop that inserts and
removes from the hashtable. This did not improve thing; thinking back,
it's natural it makes no different to the GC whether the ramp-up
allocations are stored in an alist rather than in an immutable table.

¹: The criticism in GHC on that benchmark is not that a given
benchmarking code has too high latencies, but that these latencies are
impossible to optimize away, even by expert tuning. In that spirit,
I'm fine with tweaking the Racket benchmark to do essentially the same
computation, even if it has to do non-trivial things like calling the
GC explicitly. What we are modelling is whether it is possible to
reduce latencies, with enough efforts, in a given language. Of course,
large reductions of throughput may not be acceptable, so I would
probably report (as Matthew did above) on worst-case latency "with
reasonable throughput" first, then potential improvements that
sacrifice throughput.

On Sun, May 15, 2016 at 10:05 PM, Matthew Flatt <mfl...@cs.utah.edu> wrote:
> At Sat, 14 May 2016 22:08:32 -0400, Gabriel Scherer wrote:

>> Short version:
>> Racket has relatively poor worst-case GC pause times on a specific
>> benchmark:
>> https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt
>>
>> 1) Is this expected? What is the GC algorithm for Racket's old generation?
>

> Racket mostly uses mark-and-sweep for the old generation. It sometimes
> compacts, but not usually. (I tried disabling compaction in the GC, and
> it doesn't seem to affect the benchmark.)
>
> Although Racket's incremental GC mode can help some, as you noted, it
> doesn't help out-of-the-box for the benchmark.
>

>> 2) Can you make it better?
>

Matthew Flatt

unread,

May 16, 2016, 11:34:33 AM5/16/16

to Gabriel Scherer, racke...@googlegroups.com, Max New

At Mon, 16 May 2016 11:21:43 -0400, Gabriel Scherer wrote:
> If you were willing to make publicly available a patchset with your
> GC tuning on top of Racket's current master branch (or whatever; or

> apply them to master directly, of course), [...]

Certainly --- those changes are already in the master branch (pushed
before my original reply).

Gabriel Scherer

unread,

May 16, 2016, 5:32:33 PM5/16/16

to Matthew Flatt, racke...@googlegroups.com, Max New

On my machine, working from master moves the:
- non-incremental worst pause move from 150ms to 120ms
- incremental-gc worst pause from 163ms to 120ms

I experimented with explicit GC calls during the ramp-up period again,
and on master it seems to be interesting. I have to choose a
high-enough frequency (1 over 100 is not enough, 1 over 30 or 1 over
50 give net improvements), and to call the gc not only during the
rampup, but also after it. Calling the GC between (window-size / 2) and
(window-size * 2) seems to work well -- and the results are very sensitive
to these factors, for example stopping only after (window-size * 1.5) degrades
the efficiency a lot (but more than 2 doesn't help much).

With this rampup strategy, I get the pause time down to 40ms with the
incremental GC, with a modest throughput cost: if I amortize by
iterating 2 million times instead of 1 million times, explicit GC
during rampup moves total runtime from 11s to 13s.
(These explicit calls only help when the incremental GC is used
(obviously ?). In fact they actively hurt when the non-incremental GC
is used.)

I added the ramp-up code, disabled by default, to the Racket
benchmark. I'm satisfied with the current results and I think they'll
make for a fine post (I think explaining the ramp-up problem and how
to observe/detect it is more important and useful than the details of
the current hack to mitigate it).

On Mon, May 16, 2016 at 11:34 AM, Matthew Flatt <mfl...@cs.utah.edu> wrote:
> At Mon, 16 May 2016 11:21:43 -0400, Gabriel Scherer wrote:

>> If you were willing to make publicly available a patchset with your
>> GC tuning on top of Racket's current master branch (or whatever; or

Gabriel Scherer

unread,

May 30, 2016, 2:07:41 PM5/30/16

to Matthew Flatt, racke...@googlegroups.com, Max New

My blog post is out:
http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencies-in-haskell-ocaml-racket/

Thanks to everyone that participated, the discussion allowed me to noticeably improve the Racket benchmark, its analysis tools, and provide for an interesting discussion. Thanks in particular to Max for providing the initial implementation that I could play with.

Reply all

Reply to author

Forward