[Haskell-cafe] Real-time garbage collection for Haskell

Luke Palmer

unread,

Feb 28, 2010, 12:21:16 AM2/28/10

to haskell-cafe

I have seen some proposals around here for SoC projects and other
things to try to improve the latency of GHC's garbage collector. I'm
currently developing a game in Haskell, and even 100ms pauses are
unacceptable for a real-time game. I'm calling out to people who have
seen or made such proposals, because I would be willing to contribute
funding and/or mentor a project that would contribute to this goal.
Also any ideas for reducing this latency in other ways would be very
appreciated.

Thanks,
Luke
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Pavel Perikov

unread,

Feb 28, 2010, 4:07:07 AM2/28/10

to haskell-cafe

Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?

Pavel.

Neil Davies

unread,

Feb 28, 2010, 4:22:05 AM2/28/10

to Pavel Perikov, Luke Palmer, haskell-cafe

My experience agrees with Pavel.

I've never observed ones that size. I have an application that runs in
'rate equivalent real-time' (i.e. there may be some jitter in the
exact time of events but it does not accumulate). It does have some
visibility of likely time of future events and uses that to perform
some speculative garbage collection. GC is pretty short and i've not
seen an effect > 1ms in those runs (all the usual caveats apply - my
programs are not your programs etc).

Neil

Andrew Coppin

unread,

Feb 28, 2010, 5:30:07 AM2/28/10

to haskel...@haskell.org

Luke Palmer wrote:
> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector.

I'm guessing making the GC concurrent (i.e., so you can perform GC
without having to stop all Haskell threads) would probably help in the
multithreaded case...

(I'm also guessing this is slightly nontrivial to implement. :-) )

Heinrich Apfelmus

unread,

Feb 28, 2010, 11:04:15 AM2/28/10

to haskel...@haskell.org

Luke Palmer wrote:
> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector. I'm
> currently developing a game in Haskell, and even 100ms pauses are
> unacceptable for a real-time game. I'm calling out to people who have
> seen or made such proposals, because I would be willing to contribute
> funding and/or mentor a project that would contribute to this goal.
>
> Also any ideas for reducing this latency in other ways would be very
> appreciated.

Overly long garbage collection might also be a sign of space leaks.

But there are many other things that can go wrong in a real time system
and might explain your delays. For example, you might need to avoid
amortized time data structures like Data.Sequence . Or for physics
simulations, you'd need to fix the time step ∆t, as described in

http://gafferongames.com/game-physics/fix-your-timestep/

or numerical integration will deteriorate rather quickly.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

Derek Elkins

unread,

Feb 28, 2010, 1:35:49 PM2/28/10

to Heinrich Apfelmus, haskel...@haskell.org

On Sun, Feb 28, 2010 at 10:03 AM, Heinrich Apfelmus
<apfe...@quantentunnel.de> wrote:
> Luke Palmer wrote:
>> I have seen some proposals around here for SoC projects and other
>> things to try to improve the latency of GHC's garbage collector. I'm
>> currently developing a game in Haskell, and even 100ms pauses are
>> unacceptable for a real-time game. I'm calling out to people who have
>> seen or made such proposals, because I would be willing to contribute
>> funding and/or mentor a project that would contribute to this goal.
>>
>> Also any ideas for reducing this latency in other ways would be very
>> appreciated.
>
> Overly long garbage collection might also be a sign of space leaks.
>
> But there are many other things that can go wrong in a real time system
> and might explain your delays. For example, you might need to avoid
> amortized time data structures like Data.Sequence . Or for physics
> simulations, you'd need to fix the time step ∆t, as described in
>
> http://gafferongames.com/game-physics/fix-your-timestep/
>
> or numerical integration will deteriorate rather quickly.

Incidentally, what's described there is a simplified version of the
frequency locked loops described in Massalin's thesis on Synthesis OS,
and it is used there for about the same purpose in a soft real-time
context.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4871

Luke Palmer

unread,

Feb 28, 2010, 7:05:01 PM2/28/10

to Pavel Perikov, haskell-cafe

On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov <per...@gmail.com> wrote:
> Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?

This is all hypothetical right now. I heard some horror stories in
which people had to switch to the main game loop in C++ and only do
the AI logic in Haskell because of pauses. I would rather not do
that, especially because this project is *about* proving Haskell as a
viable game development platform. So I am trying to be prepared if I
do see something like that, so that it doesn't put the show on hold
for a few months.

Presumably large, long-living objects would cause the generation 0
collections to take a long time. I am not sure if we will have any
said objects, but we can't rule it out...

Thanks for the positive reassurances, at least. I'd like to hear from
people with the opposite experience, if there are any.

Peter Verswyvelen

unread,

Mar 1, 2010, 8:41:47 AM3/1/10

to Luke Palmer, haskell-cafe

Sounds like we need to come up with some benchmarking programs so we
can measure the GC latency and soft-realtimeness...

PS: Regarding Haskell and games: the University of Utrecht teaches
Haskell in their brand new "game technology" course :-)

Sönke Hahn

unread,

Mar 1, 2010, 9:16:42 AM3/1/10

to haskel...@haskell.org

On Monday 01 March 2010 01:04:37 am Luke Palmer wrote:
> On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov <per...@gmail.com> wrote:
> > Did you really seen 100ms pauses?! I never did extensive research on this
> > but my numbers are rather in microseconds range (below 1ms). What causes
> > such a long garbage collection? Lots of allocated and long-living
> > objects?
>
> This is all hypothetical right now. I heard some horror stories in
> which people had to switch to the main game loop in C++ and only do
> the AI logic in Haskell because of pauses. I would rather not do
> that, especially because this project is *about* proving Haskell as a
> viable game development platform. So I am trying to be prepared if I
> do see something like that, so that it doesn't put the show on hold
> for a few months.
>
> Presumably large, long-living objects would cause the generation 0
> collections to take a long time. I am not sure if we will have any
> said objects, but we can't rule it out...
>
> Thanks for the positive reassurances, at least. I'd like to hear from
> people with the opposite experience, if there are any.

Yes there are. I am working on a game with Haskell using OpenGL rendering.
I've done some frame time measurements lately and encountered single frames
needing more than 100ms to be rendered. I am currently trying to gather
information on what is going on in these 100ms and why. From what i
understand, the GC is running very often and just some (very few) of its runs
are very slow.

BTW: switching to parallel GC (either with -N1 or -N2 (on a dual core
machine)) made the behavior MUCH worse, for some reason.

S�nke

Thomas Schilling

unread,

Mar 1, 2010, 9:53:48 AM3/1/10

to Luke Palmer, haskell-cafe

On 28 February 2010 05:20, Luke Palmer <lrpa...@gmail.com> wrote:
> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector. �I'm
> currently developing a game in Haskell, and even 100ms pauses are
> unacceptable for a real-time game. �I'm calling out to people who have
> seen or made such proposals, because I would be willing to contribute
> funding and/or mentor a project that would contribute to this goal.
> Also any ideas for reducing this latency in other ways would be very
> appreciated.

There is a SoC project suggestion to implement Immix's ideas [1] in
GHC's GC. Both already use similar overall designs. Both split the
heap into regions which may employ different collection strategies.
However, Immix does not address real-time issues.

The main difficulty with real-time GC is that, while first-generation
collection is usually very fast, eventually you just have to collect
the old generation and you have to do it all at once. Sun's new
Garbage-First ("G1") [2] collector therefore tracks pointers between
regions, as opposed to just pointers from older two newer generations.
This allows collecting regions independently (and in parallel). G1
is still stop-the-world, although marking phase is concurrent.
Tracking pointers between all regions can result in quite substantial
space overheads, however, so G1 uses some heuristics to discover
"popular objects" and treats them specially. In a personal
conversation Simon Marlow expressed to me that he intends to go
further into this direction, but I don't know how high-priority it is.
In general I don't think true real-time is the goal in any case, but
rather a general effort to keep GC-pauses short.

Truly concurrent garbage collection is a whole different beast.
Concurrent marking can be implemented efficiently with a write
barrier. I don't know of any fully concurrent GC scheme that gets by
without a read barrier and significant space overhead, however. There
are certainly no plans from the GC HQ to implement a fully concurrent
GC.

[1]: http://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf
[2]: http://research.sun.com/jtech/pubs/04-g1-paper-ismm.pdf

--
Push the envelope. Watch it bend.

Job Vranish

unread,

Mar 1, 2010, 11:27:34 AM3/1/10

to Thomas Schilling, haskell-cafe

My current area of work is on realtime embedded software programming for
avionics systems. We do most of our coding in Ada but I've been dreaming of
using haskell instaed.

However, the garbage collector is actually one of the larger obsticles to
making this happen.

All of our avionics software needs to be certified by various regulatory
agencies, and there are varying levels of certification depending on
criticality. For the higher certification levels we would need to be able to
sure (or a least very very confidant) that the GC will collect everything
within a fixed amount of time, and that it won't take more than some fixed
amount of time per major from to do it.

A delay of a several milliseconds that could occur effectively at random is
completely unacceptable.

I would be very interested in alternative GC algorithms/approaches that
would have a more deterministic/realtime behavior. I would be even be
willing to help out if there is other interest in this area :)

As a side note, I ran across an article on a way to use 100% reference
counting in a pure language by using weak references and being careful how
you preserve the weak/strong references during graph reduction:
http://comjnl.oxfordjournals.org/cgi/content/abstract/33/5/466
I don't want to pay $25 for the article though so I don't know how viable it
is. It would probably have lower performance than the current generational
GC but in this case I'd be willing to trade performance for determinism.
Has anyone heard of this algorithm before?

- Job

Sebastian Sylvan

unread,

Mar 1, 2010, 12:16:57 PM3/1/10

to Luke Palmer, haskell-cafe

On Sun, Feb 28, 2010 at 5:20 AM, Luke Palmer <lrpa...@gmail.com> wrote:

> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector. I'm
> currently developing a game in Haskell, and even 100ms pauses are
> unacceptable for a real-time game. I'm calling out to people who have
> seen or made such proposals, because I would be willing to contribute
> funding and/or mentor a project that would contribute to this goal.
> Also any ideas for reducing this latency in other ways would be very
> appreciated.
>

Since we're talking games here (my profession), I'd point out that it would
be cool to be able to supply "hints" to the GC about when might be a good
time to do a GC (without unconditionally forcing it), games in particular
have some pretty specific properties that may be exploitable.

Presumably a non-trivial portion of the objects copied from the nursery/G0
are actually short-lived objects that just happened to have their short
life-span overlap with the collection. So really, copying them to the next
generation is a "mistake" in some sense. For games, though, we have a very
good point that occurs regularly where we know that all/most short-lived
objects will no longer be referenced - at the start of a fresh frame.
So if we could do as many as possible of our G0 collections at that point
we'd avoid "accidental copying" of objects that are actually short-lived
into the older generation (which should reduce pressure on that older
generation, as well as speed up G0 collection). Ideally we'd have some way
of telling the GC to try to avoid running during the actual frame itself,
too, by for example tuning the heap region sizes automatically.

--
Sebastian Sylvan

Henning Thielemann

unread,

Mar 1, 2010, 1:13:10 PM3/1/10

to Luke Palmer, haskell-cafe

On Sat, 27 Feb 2010, Luke Palmer wrote:

> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector. I'm
> currently developing a game in Haskell, and even 100ms pauses are
> unacceptable for a real-time game. I'm calling out to people who have
> seen or made such proposals, because I would be willing to contribute
> funding and/or mentor a project that would contribute to this goal.
> Also any ideas for reducing this latency in other ways would be very
> appreciated.

In my experiments with real-time audio signal processing I could always
find a culprit for buffer-underflows other than the garbage collector.
Sometimes it was a space leak (e.g. by adding a finalizer to the wrong
object), sometimes incorrect handling of time differences, and when
working with LLVM it was frequent recompilation of LLVM code.

Thomas Schilling

unread,

Mar 1, 2010, 2:37:54 PM3/1/10

to Job Vranish, haskell-cafe

On 1 March 2010 16:27, Job Vranish <job.v...@gmail.com> wrote:
> My current area of work is on realtime embedded software programming for
> avionics systems. We do most of our coding in Ada but I've been dreaming of
> using haskell instaed.

Do you really think this is realistic? Garbage collector aside,
Haskell's execution model is very difficult to predict, which I would
suspect is crucial for even soft real-time systems. The whole concept
of lazy evaluation seems to run counter to the idea of real-time
systems. Lazy evaluation essentially says "do as little as possible
*now*" at the expense of having to do it all later. For a real-time
system you want almost the opposite; you want to make sure that you
complete all the required work in the current time slice.

A possible workaround would be to sprinkle lots of 'rnf's around your
code to make sure you don't build up a thunk or two that will delay
you later. And if you do this, aren't you essentially programming in
a strict functional language (like SML or O'Caml)? By careful
profiling you and auditing you can probably rule out most of the
potential bad cases, so it can be acceptable for a soft real-time
system (Galois did something like this, I believe). But for avionics
systems you probably want to more assurances than that, don't you?

John Van Enk

unread,

Mar 1, 2010, 3:17:10 PM3/1/10

to Thomas Schilling, haskell-cafe

> The whole concept of lazy evaluation seems to run counter to the idea of
real-time systems.

Hi Thomas,

Lazy evaluation is okay since it has deterministic characteristics. We can
predict what will happen quite accurately (heck, we can model it in simpler
cases). It might take a while to get people comfortable with the concept,
but it wouldn't be a show stopper (actually, some people would benefit
greatly from lazy evaluation and referential transparency).

The GC throws a wrench in a system which would otherwise make it past
certification with enough effort. If we can write a GC that can be modeled,
we'd have an excellent case for using Haskell in aerospace.

/jve

Job Vranish

unread,

Mar 1, 2010, 4:07:18 PM3/1/10

to Thomas Schilling, haskell-cafe

On Mon, Mar 1, 2010 at 2:37 PM, Thomas Schilling <nomi...@googlemail.com>wrote:

> On 1 March 2010 16:27, Job Vranish <job.v...@gmail.com> wrote:
> > My current area of work is on realtime embedded software programming for
> > avionics systems. We do most of our coding in Ada but I've been dreaming
> of
> > using haskell instaed.
>

> A possible workaround would be to sprinkle lots of 'rnf's around your
> code to make sure you don't build up a thunk or two that will delay
> you later. And if you do this, aren't you essentially programming in
> a strict functional language (like SML or O'Caml)? By careful
> profiling you and auditing you can probably rule out most of the
> potential bad cases, so it can be acceptable for a soft real-time
> system (Galois did something like this, I believe). But for avionics
> systems you probably want to more assurances than that, don't you?
>

Yes and no.
It's true that lazy evaluation makes reasoning about timings a bit more
difficult (and might not be usable in very time critical scenarios) but it
is still has well defined deterministic behavior.

It's the referential transparency that saves us here. If you run a lazy
function with the same objects (in the same evaluation state) it should
_theoretically_ take the same amount of time to run. All of our toplevel
inputs will be strict, and if we keep our frame-to-frame state strick, our
variances in runtimes, given the same inputs, should be quite low modulo the
GC.

Even our current code can take significantly different amounts of time to
compute things depending on what you're doing. Some waypoints take longer to
lookup from the database than others. Predicting the time to arrival can
take significantly longer/shorter depending on seemingly trivial parameters,
etc...

It matters less that code always takes the same amount of time to run
(though it needs to always be less than the frame time) and more so that it
always takes the same amount of time to run given the same initial
conditions.

- Job

Neil Davies

unread,

Mar 1, 2010, 4:34:32 PM3/1/10

to Job Vranish, haskell-cafe

I don't know that hanging your hat on the deterministic coat hook is
the right thing to do.

The way that I've always looked at this is more probabilistic - you
want the result to arrive within a certain time frame for a certain
operation with a high probability. There is always the probability
that the h/w will fail anyway - you could even reason that the
software taking too long is just a transient fault that clears -
random (non-correlated - preferably a bernoulli choice) failures are
OK, non-deterministic ones aren't.

This probabilistic, low probability of being at the tail of timing,
approach would give a lot more flexibility in any form of (say
incremental) GC - you may not be able to bound the incremental steps
absolutely but a strong probabilistic bound might well be more
achievable.

Neil

Thomas Schilling

unread,

Mar 1, 2010, 5:02:04 PM3/1/10

to Neil Davies, haskell-cafe

On 1 March 2010 21:34, Neil Davies <semanticp...@googlemail.com> wrote:
> I don't know that hanging your hat on the deterministic coat hook is the
> right thing to do.
> The way that I've always looked at this is more probabilistic - you want the
> result to arrive within a certain time frame for a certain operation with a
> high probability.

That's where I would think the difference between hard and soft
real-time lies, as I understand it. Of course, "real" hard real-time
of course is practically impossible on modern hardware due to caches,
branch prediction, out-of-order execution and such like.

> There is always the probability that the h/w will fail
> anyway - you could even reason that the software taking too long is just �a
> transient fault that clears - random (non-correlated - preferably a
> bernoulli choice) failures are OK, non-deterministic ones aren't.
> This probabilistic, low probability of being at the tail of timing, approach
> would give a lot more flexibility in any form of (say incremental) GC - you
> may not be able to bound the incremental steps absolutely but a strong
> probabilistic bound might well be more achievable.

The Garbage-First paper showed some promising but not spectacular
successes in keeping the soft real-time goal. I don't know the
correct numbers off the top of my head, but I think they missed the
deadline in > 5% of the cases. I would assume that it should be < 1%
if you want to treat it as a random failure. I understand that in a
fault-tolerant systems you have to handle worse things than that, but
you make assumptions about the likelihood of each kind of error
(otherwise you may run into QoS issues).

As Job and John have pointed out, though, laziness per se doesn't seem
to be an issue, which is good to hear. Space leaks might, but there
is no clear evidence that they are particularly harder to avoid than
in strict languages. As R�jemo and Runciman put it: "By using heap
profiles on a lazy language we find problems with lazy languages.
Using it on a strict language we would find problems with strict
languages too." [1] (very recommended paper, btw.)

[1]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219

--
Push the envelope. Watch it bend.

Jeremy Shaw

unread,

Mar 1, 2010, 6:18:50 PM3/1/10

to Luke Palmer, haskell-cafe

I am just going to jump on the RT dogpile and mention that I too have wanted
RT friendly GC in Haskell. I have attempted on more than one occasion to
implement real-time audio applications in Haskell. But I tend to get a lot
of buffer underruns, most likely due to GC.

For audio apps, there is a callback that happens every few milliseconds. As
often as 4ms. The callback needs to finish as quickly as possible to avoid a
buffer underruns. So, in theory, I believe I would want garbage collection
to only happen in the time periods between when the callback is running.
This wouldn't affect the total amount of time that garbage collection ran --
but it would help minimize the amount of time spent in the callback, which
should reduce buffer underruns.

My feeling right now is that the 'best' solution might be something similar
to synthesis OS. I would create a DSL for the realtime DSP code. Using
harpy, this DSL would be compiled to assembly with execution time guarantees
(as much as can be predicted on modern hardware). For a 'modular synth' this
might actually be better than C code, because the effects of certain choices
could be 'compiled in' instead of having to check the setting via a compare.
(that is what Synthesis OS does).

- jeremy

Simon Cranshaw

unread,

Mar 1, 2010, 9:33:46 PM3/1/10

to haskell-cafe

On Sun, Feb 28, 2010 at 6:06 PM, Pavel Perikov <per...@gmail.com> wrote:

> Did you really seen 100ms pauses?! I never did extensive research on this
> but my numbers are rather in microseconds range (below 1ms). What causes
> such a long garbage collection? Lots of allocated and long-living objects?
>
>

I am using an automated options trading system written in Haskell. I'm more
on the business side than the technical side of the issues so I'm not clear
on all the details. I can confirm that without tweaking the RTS settings we
were seeing over 100ms GC pauses. I've mainly been trying to minimise our
overall response time and we were able to improve this by increasing the
allocation area with -A. I think this brought GC well under 100ms. We are
still working on analysis of this.

I can also confirm, as others seem to have found, that under 6.12 the
parallel GC seemed to make things much worse. I am always turning it off
with -qg. If there is a project to improve performance of the GC I could be
interested to contribute.

Simon Cranshaw

John Van Enk

unread,

Mar 2, 2010, 12:06:48 AM3/2/10

to Simon Cranshaw, haskell-cafe

Simon,

Would a more predictable GC or a faster GC be better in your case? (Of
course, both would be nice.)

/jve

Simon Cranshaw

unread,

Mar 2, 2010, 12:22:42 AM3/2/10

to John Van Enk, haskell-cafe

John,

I suspect speed is more important than timing. In practice I don't mind a
pause for a gc when nothing is happening in the market. It's when the
market is moving fast that we particularly want to keep our response time
low. Busy times may continue for long periods though and I'm not sure if
being able to defer gc (if that's what you're suggesting) would be possible
for that long. We are still studying how the gc times are interacting with
our response times and so hopefully I will have a better answer to this
later on.

Simon

Simon Marlow

unread,

Mar 2, 2010, 8:51:06 AM3/2/10

to Luke Palmer, haskell-cafe

On 01/03/2010 00:04, Luke Palmer wrote:
> On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov<per...@gmail.com> wrote:
>> Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
>
> This is all hypothetical right now. I heard some horror stories in
> which people had to switch to the main game loop in C++ and only do
> the AI logic in Haskell because of pauses. I would rather not do
> that, especially because this project is *about* proving Haskell as a
> viable game development platform. So I am trying to be prepared if I
> do see something like that, so that it doesn't put the show on hold
> for a few months.
>
> Presumably large, long-living objects would cause the generation 0
> collections to take a long time. I am not sure if we will have any
> said objects, but we can't rule it out...
>
> Thanks for the positive reassurances, at least. I'd like to hear from
> people with the opposite experience, if there are any.

By default GHC uses a generational GC with a 512KB nursery size, which
means that most GC pauses will be very short but just occasionally you
have to do a major GC and there's no upper bound on the pause time for
that collection, because the whole live heap is traversed. The pause
time for a major collection is proportional to the amount of live data,
so the people who are experiencing very short pauses probably have
little live data and/or have allocation patterns that means the old
generation is collected very infrequently.

Monolithic major GC is a big stinking scalability problem, and the only
solutions are to do concurrent GC, incremental GC, or region-based GC.

Both concurrent GC and incremental GC tend to add overheads to the
mutator, because they need a read barrier. There was an incremental GC
for GHC once [1], taking advantage of the built-in read barrier that we
have whereby most closures are "entered", and it more-or-less worked but
was quite complicated and had some code-size overhead. Nowadays part of
this built-in read barrier has been eroded by pointer tagging, which
makes things a bit more tricky.

Region-based GC is a generalisation of generational GC, where you divide
the heap into regions and track all pointers between regions, so a given
collection can collect any subset of the regions. This basic idea is
quite popular amongst the Java people these days, Thomas mentioned the
G1 collector which does this and uses a concurrent mark phase to
identify which regions to collect. Regardless of how you figure out
which regions to collect, the basic idea of breaking up the old
generation like this is a good way to reduce pause times.

At the moment I'm focussed on something different: making individual
processors able to collect their own heaps independently, so removing
the stop-the-world requirement from minor GCs. This should help
parallel throughput a lot, but won't fix the major GC pause times. I am
slightly worried that the GC can't bear the extra complexity of adding
both processor-independent GC *and* region-based GC or some other
pause-time-reducing technique. But we'll see. I'll happily
supervise/mentor anyone who wants to tackle this.

Cheers,
Simon

[1]
http://delivery.acm.org/10.1145/360000/351265/p257-cheadle.pdf?key1=351265&key2=8540457621&coll=GUIDE&dl=GUIDE&CFID=80115628&CFTOKEN=59704548

Simon Marlow

unread,

Mar 2, 2010, 8:55:38 AM3/2/10

to Sönke Hahn, haskel...@haskell.org

Probably due to cache effects - shipping the data off to a different
core in the GC can far outweigh any benefits you would have got by
traversing the heap in parallel. For a single-threaded program,
parallel GC tends to only help when the heap gets large (over 50MB at a
guess). For parallel programs, parallel GC helps by keeping the data in
the right cache.

I'm finding that for parallel programs turning off load-balancing with
+RTS -qb often helps, because load-balancing is bad for locality.
Again, if the heap gets large, then collecting it in parallel is
probably more important than keeping it in the cache.

Cheers,
Simon

Simon Marlow

unread,

Mar 2, 2010, 9:11:17 AM3/2/10

to Thomas Schilling, haskell-cafe

A good summary of concurrent GC techniques used successfully in industry
was posted to the GC mailing list recently by Erez Petrank:

http://permalink.gmane.org/gmane.comp.programming.garbage-collection.general/388

Several of those we could do in GHC, for example mostly-concurrent
collection uses only a write barrier, and piggybacks on the generational
write-barrier we already have, but it does require two stop-the-world
phases per GC. I think you'd want to do it in conjunction with
Immix-style region sweeping in the old generation, so implementing Immix
would be a good first step.

Cheers,
Simon

Simon Marlow

unread,

Mar 2, 2010, 9:18:15 AM3/2/10

to Sebastian Sylvan, haskell-cafe

On 01/03/2010 17:16, Sebastian Sylvan wrote:
>
>
> On Sun, Feb 28, 2010 at 5:20 AM, Luke Palmer <lrpa...@gmail.com
> <mailto:lrpa...@gmail.com>> wrote:
>
> I have seen some proposals around here for SoC projects and other
> things to try to improve the latency of GHC's garbage collector. I'm
> currently developing a game in Haskell, and even 100ms pauses are
> unacceptable for a real-time game. I'm calling out to people who have
> seen or made such proposals, because I would be willing to contribute
> funding and/or mentor a project that would contribute to this goal.
> Also any ideas for reducing this latency in other ways would be very
> appreciated.
>
>
> Since we're talking games here (my profession), I'd point out that it
> would be cool to be able to supply "hints" to the GC about when might be
> a good time to do a GC (without unconditionally forcing it), games in
> particular have some pretty specific properties that may be exploitable.
>
> Presumably a non-trivial portion of the objects copied from the
> nursery/G0 are actually short-lived objects that just happened to have
> their short life-span overlap with the collection. So really, copying
> them to the next generation is a "mistake" in some sense.

There's a technique we use, that mitigates the effect of this to some
extent, called "aging". The idea is that an object is only promoted if
it survives at least N GCs in the young generation. Typically N=2 is a
good value, so an object that is allocated just before a minor GC will
be copied, but probably not promoted unless it survives all the way to
the next GC. You can also use fractional values of N and in fact you
should measure N in terms of bytes allocated rather than GCs. In GHC
6.12.1 you can tune the amount of aging with +RTS -T<n>, but I removed
the option in 6.14 in order to make the GC implementation simpler, we
now have a fixed aging -T2 aging policy. In practice other values were
very rarely any better, in fact.

> For games,
> though, we have a very good point that occurs regularly where we know
> that all/most short-lived objects will no longer be referenced - at the
> start of a fresh frame.

System.Mem.performGC is your friend, but if you're unlucky it might do a
major GC and then you'll get more pause than you bargained for.

Cheers,
Simon

Simon Marlow

unread,

Mar 2, 2010, 9:26:50 AM3/2/10

to Malcolm Wallace, haskell-cafe

On 02/03/2010 14:11, Malcolm Wallace wrote:
>> Both concurrent GC and incremental GC tend to add overheads to the
>> mutator, because they need a read barrier. There was an incremental GC
>> for GHC once [1], taking advantage of the built-in read barrier that
>> we have whereby most closures are "entered"
>

> Was there a specific reason why that GC implementation chose to use a
> read barrier rather than a write barrier? I would have thought that in
> general, a write barrier is cheaper to implement. Doesn't ghc update
> fewer thunks than it enters?

If the GC wants to move objects, then a read barrier is needed in order
to figure out whether you're looking at an object that has moved or not.
If you don't move objects, then you can get away with only a write
barrier - the write barrier is needed so that you can remember which
objects or fields might need to be re-scanned because they were mutated.

The choice made in the Non-stop Haskell paper was to take advantage of
the existing read barrier so that we could move objects.

Some schemes copy objects rather than move them, and then you can get
away without a read barrier, but then your write barrier has to keep the
two copies in sync. Tradeoffs are everywhere, GC is a tricky business!

Cheers,
Simon

Luke Palmer

unread,

Mar 2, 2010, 3:37:44 PM3/2/10

to Simon Marlow, haskell-cafe

On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow <marl...@gmail.com> wrote:
>> For games,
>> though, we have a very good point that occurs regularly where we know
>> that all/most short-lived objects will no longer be referenced - at the
>> start of a fresh frame.
>
> System.Mem.performGC is your friend, but if you're unlucky it might do a
> major GC and then you'll get more pause than you bargained for.

Some fine-grained control might be nice here. Eg. I could do a major
GC as a player is opening a menu, on a loading screen, when the game
is paused, or some other key points, and it might still be annoying,
but at least it wouldn't interfere with gameplay. There is of course
the question of what happens if one of these key points doesn't happen
when we need to do an allocation, but... oh well. Perhaps that could
be mitigated by saying "I would rather you allocate than major GC
right now". Are any of these options impossible, or be unreasonably
difficult to implement (I don't suspect so)?

Luke

Simon Marlow

unread,

Mar 2, 2010, 4:39:21 PM3/2/10

to Luke Palmer, haskell-cafe

On 02/03/10 20:37, Luke Palmer wrote:
> On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow<marl...@gmail.com> wrote:
>>> For games,
>>> though, we have a very good point that occurs regularly where we know
>>> that all/most short-lived objects will no longer be referenced - at the
>>> start of a fresh frame.
>>
>> System.Mem.performGC is your friend, but if you're unlucky it might do a
>> major GC and then you'll get more pause than you bargained for.
>
> Some fine-grained control might be nice here. Eg. I could do a major
> GC as a player is opening a menu, on a loading screen, when the game
> is paused, or some other key points, and it might still be annoying,
> but at least it wouldn't interfere with gameplay. There is of course
> the question of what happens if one of these key points doesn't happen
> when we need to do an allocation, but... oh well. Perhaps that could
> be mitigated by saying "I would rather you allocate than major GC
> right now". Are any of these options impossible, or be unreasonably
> difficult to implement (I don't suspect so)?

Actually that's one thing we can do relatively easily, i.e. defer major
GC for a while. Due to the way GHC has a two-layer memory manager, the
heap is a list of discontiguous blocks, so we can always allocate some
more memory.

So it would be pretty easy to provide something like

disableMajorGC, enableMajorGC :: IO ()

Of course leaving it disabled too long could be bad, but that's your
responsibility.

Oh, I just checked and System.Mem.performGC actually performs a major
GC, here's its implementation:

foreign import ccall {-safe-} "performMajorGC" performGC :: IO ()

to perform a minor GC (or possibly a major GC if one is due), you want this:

foreign import ccall {-safe-} "performGC" performMinorGC :: IO ()

Cheers,
Simon

Jason Dusek

unread,

Mar 2, 2010, 7:01:05 PM3/2/10

to Neil Davies, haskell-cafe

2010/02/28 Neil Davies <semanticp...@googlemail.com>:
> I've never observed ones that size. I have an application that runs in 'rate
> equivalent real-time' (i.e. there may be some jitter in the exact time of
> events but it does not accumulate). It does have some visibility of likely
> time of future events and uses that to perform some speculative garbage
> collection.

Do you have information on how it behaves without speculative
GC?

--
Jason Dusek

Neil Davies

unread,

Mar 3, 2010, 3:43:59 AM3/3/10

to Simon Marlow, haskell-cafe

Is there a similar set of runes to be able to see how much mutation
has occurred, how much was live last GC, etc?

Neil

Neil Davies

unread,

Mar 3, 2010, 3:51:09 AM3/3/10

to Jason Dusek, Neil Davies, haskell-cafe

Sorry, no.

We wanted a basic bound on the jitter - the application is not one
that creates much (if any) long lived heap.

Having just seen Simon's email on the fact that performGC forces a
major GC - i think that there is some
new mileage here with making the speculative GC's minor ones.

More control needs some more instrumentation of how much mutation is
occurring and ways of estimating
how much of that is short and long lived - I know that past history is
not necessarily a good indicator
of future actions - but visibility of the counters that being kept
would help.

Neil

Simon Marlow

unread,

Mar 3, 2010, 5:52:20 AM3/3/10

to Neil Davies, haskell-cafe

No, but there could be - the information is collected by the RTS, it
just isn't made available via a public API right now.

Sönke Hahn

unread,

Mar 3, 2010, 11:51:59 AM3/3/10

to haskel...@haskell.org

On Monday 01 March 2010 03:16:25 pm S�nke Hahn wrote:
>
> Yes there are. I am working on a game with Haskell using OpenGL rendering.
> I've done some frame time measurements lately and encountered single frames
> needing more than 100ms to be rendered. I am currently trying to gather
> information on what is going on in these 100ms and why. From what i
> understand, the GC is running very often and just some (very few) of its
> runs are very slow.

FYI: These high frame times were caused by a space leak. With the help of the
excellent hp2any-manager i found that out and where to put the needed
strictness annotations.

S�nke

Henning Thielemann

unread,

Mar 3, 2010, 6:02:47 PM3/3/10

to Jeremy Shaw, haskell-cafe

Jeremy Shaw schrieb:

> My feeling right now is that the 'best' solution might be something
> similar to synthesis OS. I would create a DSL for the realtime DSP
> code. Using harpy, this DSL would be compiled to assembly with
> execution time guarantees (as much as can be predicted on modern
> hardware). For a 'modular synth' this might actually be better than C
> code, because the effects of certain choices could be 'compiled in'
> instead of having to check the setting via a compare. (that is what
> Synthesis OS does).

I'm currently doing this with LLVM - which is more portable and better
optimizing than a plain X86 assembler:
http://code.haskell.org/synthesizer/llvm/

Curt Sampson

unread,

Mar 4, 2010, 2:43:49 AM3/4/10

to Thomas Schilling, Job Vranish, Neil Davies, haskell-cafe

On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:

> A possible workaround would be to sprinkle lots of 'rnf's around your

> code....

As I learned rather to my chagrin on a large project, you generally
don't want to do that. I spent a couple of days writing instance
of NFData and loading up my application with rnfs and then watched
performance fall into a sinkhole.

I believe that the problem is that rnf traverses the entirety of a large
data structure even if it's already strict and doesn't need traversal.
My guess is that doing this frequently on data structures (such as Maps)
of less than tiny size was blowing out my cache.

I switched strategies to forcing a deep(ish) evaluation of only
newly constructed data instead. For example, after inserting a newly
constructed object into a Map, I would look it up and force evaluation
only of the result of that lookup. That solved my space leak problem and
made things chug along quite nicely.

Understanding the general techniques for this sort of thing and seeing
where you're likely to need to apply them isn't all that difficult, once
you understand the problem. (It's probably much easier if you don't have
to work it out all for yourself, as I did. Someone needs to write the
"how to manage lazyness in Haskell" guide.) The difficult part of it is
that you've really got to stay on top of it, because if you don't, the
space leaks come back and you have to go find them again. It feels a
little like dealing with buffers and their lengths in C.

On 2010-03-01 16:06 -0500 (Mon), Job Vranish wrote:

> All of our toplevel inputs will be strict, and if we keep our

> frame-to-frame state strick, our variances in runtimes, given the same

> inputs, should be quite low modulo the GC.

This is exactly the approach I need to take for the trading system. I
basically have various (concurrent) loops that process input, update
state, and possibly generate output. The system runs for about six
hours, processing five million or so input messages with other loops
running anywhere from hundreds of thousands to millions of times. The
trick is to make sure that I never, ever start a new loop with an
unevaluated thunk referring to data needed only by the previous loop,
because otherwise I just grow and grow and grow....

Some tool to help with this would be wonderful. There's something for
y'all to think about.

On 2010-03-01 22:01 +0000 (Mon), Thomas Schilling wrote:

> As Job and John have pointed out, though, laziness per se doesn't seem
> to be an issue, which is good to hear. Space leaks might, but there is
> no clear evidence that they are particularly harder to avoid than in
> strict languages.

As I mentioned above, overall I find them so. Any individual space
leak you're looking at is easy to fix, but the constant vigilance is
difficult.

cjs
--
Curt Sampson <c...@cynic.net> +81 90 7737 2974
http://www.starling-software.com
The power of accurate observation is commonly called cynicism
by those who have not got it. --George Bernard Shaw

Curt Sampson

unread,

Mar 4, 2010, 4:14:37 AM3/4/10

to Peter Verswyvelen, Simon Cranshaw, John Van Enk, Jeremy Shaw, haskell-cafe

On 2010-03-02 11:33 +0900 (Tue), Simon Cranshaw wrote:

> I can confirm that without tweaking the RTS settings we were seeing
> over 100ms GC pauses.

Actually, we can't quite confirm that yet. We're seeing large amounts
of time go by in our main trading loop, but I'm still building the
profiling tools to see what exactly is going on there. However, GC is
high on our list of suspects, since twiddling the GC parameters can
improve things drastically.

On 2010-03-02 00:06 -0500 (Tue), John Van Enk wrote:

> Would a more predictable GC or a faster GC be better in your case? (Of
> course, both would be nice.)

Well, as on 2010-03-01 17:18 -0600 (Mon), Jeremy Shaw wrote:

> For audio apps, there is a callback that happens every few milliseconds. As
> often as 4ms. The callback needs to finish as quickly as possible to avoid a
> buffer underruns.

I think we're in about the same position. Ideally we'd never have to
stop for GC, but that's obviously not practical; what will hurt pretty
badly, and we should be able to prevent, is us gathering up a bunch
of market data, making a huge pause for a big GC, and then generating
orders based on that now oldish market data. We'd be far better off
doing the GC first, and then looking at the state of the market and
doing our thing, because though the orders will still not get out as
quickly as they would without the GC, at least they'll be using more
recent data.

I tried invoking System.Mem.performGC at the start of every loop, but
that didn't help. Now that I know it was invoking a major GC, I can see
why. :-) But really, before I go much further with this:

On 2010-03-01 14:41 +0100 (Mon), Peter Verswyvelen wrote:

> Sounds like we need to come up with some benchmarking programs so we
> can measure the GC latency and soft-realtimeness...

Exactly. Right now I'm working from logs made by my own logging and
profiling system. These are timestamped, and they're "good enough"
to get a sense of what's going on, but incomplete. I also have the
information from the new event logging system, which is excellent in
terms of knowing exactly when things are starting and stopping, but
doesn't include information about my program, nor does it include any
sort of GC stats. Then we have the GC statistics we get with -S, which
don't have timestamps.

My plan is to bring all of this together. The first step was to fix
GHC.Exts.traceEvent so that we can use that to report information about
what the application is doing. In 6.12.1 it segfaults, but we have a fix
(see http://hackage.haskell.org/trac/ghc/ticket/3874) and it looks as if
it will go into 6.12.2, even. The next step is to start recording the
information generated by -S in the eventlog as well, so that not only
do we know when a GC started or stopped in relation to our application
code, but we know what generation it was, how big the heap was at the
time, how much was collected, and so on and so forth. Someone mentioned
that there were various other stats that were collected but not printed
by -S; we should probably throw those in too.

With all of that information it should be much easier to figure
out where and when GC behaviour is causing us pain in low-latency
applications.

However: now that Simon's spent a bunch of time experimenting with the
runtime's GC settings and found a set that's mitigated much of our
problem, other things are pushing their way up my priority list. Between
that and an upcoming holiday, I'm probably not going to get back to this
for a few weeks. But I'd be happy to discuss my ideas with anybody else
who's interested in similar things, even if just to know what would be
useful to others.

What do you guys think about setting up a separate mailing list for
this? I have to admit, I don't follow haskell-cafe much due to the high
volume of the list. (Thus my late presence in this thread.) I would be
willing to keep much closer track of a low-volume list that dealt with
only GC stuff.

I'd even be open to providing hosting for the list, using my little baby
mailing list manager written in Haskell (mhailist). It's primitive, but
it does handle subscribing, unsubscribing and forwarding of messages.

Peter Verswyvelen

unread,

Mar 4, 2010, 4:22:45 AM3/4/10

to Curt Sampson, Simon Marlow, haskell-cafe

A fully concurrent GC running on multiple threads/cores might be
great, but I guess this is difficult to implement and introduces a lot
of overhead.

For simple video games, it might work to always do a full GC per
frame, but don't allow it to take more than T milliseconds. In a sense
the GC function should be able to be paused and resumed, but it could
run on the same thread as the game loop, so no synchronization
overhead would arise (in a sense many video games are already little
cooperative multitasking systems). The game loop should then decide
how to balance the time given to the GC and the memory being
collected. This would cause longer frame times and hence sometimes a
decrease in frame rate, but never long stalls.

Note that the issue also popped up for me in C many years ago. Using
Watcom C/C++ in the nineties, I found out that a call to the free
function could take a very long time. Also for games that could run
many hours or days (e.g. a multi player game server) the C heap could
get very fragmented, causing memory allocations and deallocations to
take ever more time, sometimes even fail. To make my games run
smoothly I had to implement my own memory manager. To make them run
for a very long time, I had to implement a memory defragmenter. So in
the end, this means a garbage collector :-) Of course this problem is
well known, I just wanted to re-state that for games the typical C
heap is not very well suitable either.

Don Stewart

unread,

Mar 4, 2010, 2:11:05 PM3/4/10

to Curt Sampson, Neil Davies, haskell-cafe

cjs:

> On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:
>
> > A possible workaround would be to sprinkle lots of 'rnf's around your
> > code....
>
> As I learned rather to my chagrin on a large project, you generally
> don't want to do that. I spent a couple of days writing instance
> of NFData and loading up my application with rnfs and then watched
> performance fall into a sinkhole.
>
> I believe that the problem is that rnf traverses the entirety of a large
> data structure even if it's already strict and doesn't need traversal.
> My guess is that doing this frequently on data structures (such as Maps)
> of less than tiny size was blowing out my cache.

And rnf will do the traversal whether it is needed or not.
Imo, it is better to ensure the structures you want are necessarily
strict by definition, so that only the minimum additional evaluation is
necessary.

'rnf' really is a hammer, but not everything is a nail.

-- Don

Jason Dagit

unread,

Mar 4, 2010, 2:22:39 PM3/4/10

to Don Stewart, Neil Davies, haskell-cafe

On Thu, Mar 4, 2010 at 11:10 AM, Don Stewart <do...@galois.com> wrote:

> cjs:
> > On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:
> >
> > > A possible workaround would be to sprinkle lots of 'rnf's around your
> > > code....
> >
> > As I learned rather to my chagrin on a large project, you generally
> > don't want to do that. I spent a couple of days writing instance
> > of NFData and loading up my application with rnfs and then watched
> > performance fall into a sinkhole.
> >
> > I believe that the problem is that rnf traverses the entirety of a large
> > data structure even if it's already strict and doesn't need traversal.
> > My guess is that doing this frequently on data structures (such as Maps)
> > of less than tiny size was blowing out my cache.
>
> And rnf will do the traversal whether it is needed or not.
> Imo, it is better to ensure the structures you want are necessarily
> strict by definition, so that only the minimum additional evaluation is
> necessary.
>

Isn't the downside of strict structures the implicit nature of the
'strictification'? You lose the fine grained control of when particular
values should be strict.

Jason

wren ng thornton

unread,

Mar 4, 2010, 11:56:00 PM3/4/10

to haskell-cafe

Simon Marlow wrote:
> So it would be pretty easy to provide something like
>
> disableMajorGC, enableMajorGC :: IO ()
>
> Of course leaving it disabled too long could be bad, but that's your
> responsibility.

It seems like it'd be preferable to have an interface like:

withMajorGCDisabled :: IO() -> IO()

or (for some definition of K):

withMajorGCDisabled :: (K -> IO()) -> IO()

in order to ensure that it always gets turned back on eventually. Of
course, the latter can be created from the former pair. It's just that
the former reminds me a bit much of explicit memory management and how
difficult it is to balance the free()s...

--
Live well,
~wren

Henning Thielemann

unread,

Mar 5, 2010, 4:51:30 AM3/5/10

to Curt Sampson, haskell-cafe

Curt Sampson schrieb:

> Understanding the general techniques for this sort of thing and seeing
> where you're likely to need to apply them isn't all that difficult, once
> you understand the problem. (It's probably much easier if you don't have
> to work it out all for yourself, as I did. Someone needs to write the
> "how to manage lazyness in Haskell" guide.)

My attempt in this direction:
http://www.haskell.org/haskellwiki/Laziness_is_not_always_good
http://www.haskell.org/haskellwiki/Maintaining_laziness

Simon Marlow

unread,

Mar 5, 2010, 7:38:57 AM3/5/10

to Curt Sampson, haskell-cafe

On 04/03/2010 09:14, Curt Sampson wrote:

> However: now that Simon's spent a bunch of time experimenting with the
> runtime's GC settings and found a set that's mitigated much of our
> problem, other things are pushing their way up my priority list. Between
> that and an upcoming holiday, I'm probably not going to get back to this
> for a few weeks. But I'd be happy to discuss my ideas with anybody else
> who's interested in similar things, even if just to know what would be
> useful to others.

What settings are you using, out of interest?

> What do you guys think about setting up a separate mailing list for
> this? I have to admit, I don't follow haskell-cafe much due to the high
> volume of the list. (Thus my late presence in this thread.) I would be
> willing to keep much closer track of a low-volume list that dealt with
> only GC stuff.

Please feel free to discuss on glasgow-haskell-users@ which is lower
volume. I don't particularly like setting up new mailing lists unless
there's likely to be an ongoing need - we have quite a few defunct lists
on haskell.org now.

> I'd even be open to providing hosting for the list, using my little baby
> mailing list manager written in Haskell (mhailist). It's primitive, but
> it does handle subscribing, unsubscribing and forwarding of messages.

Sure, that would be fine too.

Cheers,
Simon

Simon Marlow

unread,

Mar 5, 2010, 7:42:35 AM3/5/10

to wren ng thornton, haskell-cafe

On 05/03/2010 05:03, wren ng thornton wrote:
> Simon Marlow wrote:
>> So it would be pretty easy to provide something like
>>
>> disableMajorGC, enableMajorGC :: IO ()
>>
>> Of course leaving it disabled too long could be bad, but that's your
>> responsibility.
>
> It seems like it'd be preferable to have an interface like:
>
> withMajorGCDisabled :: IO() -> IO()
>
> or (for some definition of K):
>
> withMajorGCDisabled :: (K -> IO()) -> IO()

Sure, my intention was that you'd build this with the primitives.

> in order to ensure that it always gets turned back on eventually. Of
> course, the latter can be created from the former pair. It's just that
> the former reminds me a bit much of explicit memory management and how
> difficult it is to balance the free()s...

quite!

Cheers,
Simon

Simon Cranshaw

unread,

Mar 6, 2010, 1:57:10 AM3/6/10

to Simon Marlow, haskell-cafe

For settings we are using -N7 -A8m -qg.

I don't know if they are really the optimal values but I haven't found a
significant improvement on these yet. I tried -qb but that was slow. I
tried larger values of A but that didn't seem to make a big difference. Also
-N6 didn't make much difference. Specifying H values didn't seem to make
much difference. I have to admit I don't fully understand the implications
of the values and was just experimenting to see what worked best. Please
let me know if there's anything you suggest I should try. That said, we
have already got it to a speed where it's working so I'm ok with what we
now.

Simon Marlow

unread,

Mar 6, 2010, 7:43:00 AM3/6/10

to Simon Cranshaw, haskell-cafe

On 06/03/10 06:56, Simon Cranshaw wrote:
> For settings we are using -N7 -A8m -qg.

I'm surprised if turning off parallel GC improves things, unless you
really aren't using all the cores (ThreadScope will tell you that).

Do these flags give you an improvement in throughput, or just pause times?

> I don't know if they are really the optimal values but I haven't found a
> significant improvement on these yet. I tried -qb but that was slow.

Interesting, I often find that -qb improves things.

> I
> tried larger values of A but that didn't seem to make a big difference.

-A8m is close to the size of your L2 caches, right? That will certainly
be better than the default of -A512k.

> Also -N6 didn't make much difference. Specifying H values didn't seem
> to make much difference.

-H is certainly a mixed bag when it comes to parallel programs.

> I have to admit I don't fully understand the
> implications of the values and was just experimenting to see what worked
> best.

So the heap size is trading off locality (cache hits) against GC time.
The larger the heap, the fewer GCs you do, but the worse the locality.
Usually I find keeping the nursery size (-A) close to the L2 cache size
works best, although sometimes making it really big can be even better.

-qg disables parallel GC completely. This is usually terrible for
locality, because every GC will move all the recently allocated data
from each CPU's L2 cache into the cache of the CPU doing GC, where it
will have to be fetched out again after GC.

-qb disables load-balancing in the parallel GC, which improves locality
at the expense of parallelism, usually I find it is an improvement in
parallel programs.

Curt Sampson

unread,

Mar 9, 2010, 12:04:35 AM3/9/10

to Simon Marlow, haskell-cafe

On 2010-03-06 12:42 +0000 (Sat), Simon Marlow wrote:

> Usually I find keeping the nursery size (-A) close to the L2 cache size
> works best, although sometimes making it really big can be even better.

Interesting to know. I got the impression that I was being encouraged to
keep -A closer to the L1 cache size, myself.

> -qg disables parallel GC completely. This is usually terrible for
> locality, because every GC will move all the recently allocated data
> from each CPU's L2 cache into the cache of the CPU doing GC, where it
> will have to be fetched out again after GC.

I've since explained to Cranshaw (we are getting to have *way* too many
'Simon's around here) about the issues with our different machines; some
of this depends on the host on which we're doing the testing.

* Our Core i7 hosts share 8 MB of L3 cache amongst four cores with
two threads each. Thus, no locality penalties here.

* Our Xeon E5420 host has two 4-core CPUs, and each pair of cores
shares a 6 MB L2 cache. Thus there's a pretty good chance that
something you need is in someone else's cache. I don't know if
there's any difference between moving stuff between two caches on
the same CPU and two caches on different CPUs.

* Our Xeon E5520 host has two 4-core CPUs, each core of which has
two threads. Each CPU (4 cores) shares an 8 MB L3 cache. Thus,
presumably, less locality penalty than the E5420 but more than an
i7. As a side note, I also see slightly less memory bandwidth on
this system (for both caches and main memory) than I do on an i7.

This gets complex pretty fast. And don't ask me about Intel's new style
of having L1 and L3 or L2 and L3 caches rather than L1 and L2 caches.

> -qb disables load-balancing in the parallel GC, which improves locality
> at the expense of parallelism, usually I find it is an improvement in
> parallel programs.

I'd think so too. Figuring out what went on here is going to have to
wait until I get more detailed GC information in the eventlog.

Followups to glasgow-ha...@haskell.org.

cjs
--
Curt Sampson <c...@cynic.net> +81 90 7737 2974
http://www.starling-software.com
The power of accurate observation is commonly called cynicism
by those who have not got it. --George Bernard Shaw

Curt Sampson

unread,

Mar 9, 2010, 12:36:36 AM3/9/10

to Henning Thielemann, haskell-cafe

On 2010-03-05 10:50 +0100 (Fri), Henning Thielemann wrote:

> Curt Sampson schrieb:
>> Understanding the general techniques for this sort of thing and seeing
>> where you're likely to need to apply them isn't all that difficult, once
>> you understand the problem. (It's probably much easier if you don't have
>> to work it out all for yourself, as I did. Someone needs to write the
>> "how to manage lazyness in Haskell" guide.)
>
> My attempt in this direction:
> http://www.haskell.org/haskellwiki/Laziness_is_not_always_good
> http://www.haskell.org/haskellwiki/Maintaining_laziness

Unfortunately, neither of these address the problem of the day-to-day
programmer: "what are the typical ways I introduce space leaks, and how
do I stop doing that?"

cjs
--
Curt Sampson <c...@cynic.net> +81 90 7737 2974
http://www.starling-software.com
The power of accurate observation is commonly called cynicism
by those who have not got it. --George Bernard Shaw