Hardcoded/subjective parameters set unused memory half life to ~1mn
under steady conditions.
Thanks very much for doing this. Before I look at the code,
I'd like to talk some about the actual algorithm you are planning
to use, as per http://golang.org/doc/contribute.html#Design.
Can you summarize the algorithm in text (as opposed to code)?
Thanks.
Russ
Thanks,
Sebastien
1. The algo is guided by the instant heap_idle/heap_inuse values. I
think they can be quite misleading. What if the process routinely uses
1GB of memory, but at the very moment when the goroutine awakes it sees
heap_inuse=100MB (temporal load decrease + recent GC)? It will over-free
memory, something that we want to avoid at least in the first version. I
think the algo must be guided by max heap inuse during the period (or
min heap idle during period), it is more meaningful. For example, if the
process did not used more than 1GB during the whole period and we have
something on top of that, then we can free some of that excess.
2. I am not actually sure how important it is... but if it is simple to
implement and we agree that it can't hurt, it may be worth trying.
Even if there is no idle sys memory, there can be a significant amount
of non-yet-collected free memory (we just don't do GC since it become
unused). A straightforward solution is to trigger GC once per 10
seconds. However, I guess we don't want to do it. So the proposed
solution is as follows.
The decision is guided by 2 condition:
A = was there a normal GC during that 10 seconds?
B = does our release procedure triggered the last GC?
Then the algo is:
if A -> do nothing // the process seems to have everything under control
if B -> do nothing // we've already tried to do GC, and the process had
not done enough allocations since that to trigger GC
otherwise -> do GC
Humm... I am not because we don't actually know when the additional GC
worth doing (when the process is done with the memory).
http://codereview.appspot.com/5451057/diff/1002/src/pkg/runtime/mheap.c#newcode344
src/pkg/runtime/mheap.c:344: g->waitreason = "MHeap_Release pause";
It is user-visible, please provide an end user understandable
description.
http://codereview.appspot.com/5451057/diff/1002/src/pkg/runtime/mheap.c#newcode353
src/pkg/runtime/mheap.c:353: if (mstats.heap_inuse > prev_inuse) {
I think it is a way too restrictive (taking into account that the check
is executed once per 10 seconds).
http://codereview.appspot.com/5451057/diff/1002/src/pkg/runtime/mheap.c#newcode397
src/pkg/runtime/mheap.c:397: NextRound:
Perhaps a user would like to check how it works for him. What about
providing a one-line summary if GOGCTRACE>0? The summary may include the
values, how much we've scavenged or why we've scavenged nothing.
> 1. The algo is guided by the instant heap_idle/heap_inuse values. I
> think they can be quite misleading. What if the process routinely uses
> 1GB of memory, but at the very moment when the goroutine awakes it sees
> heap_inuse=100MB (temporal load decrease + recent GC)? It will over-free
> memory, something that we want to avoid at least in the first version. I
> think the algo must be guided by max heap inuse during the period (or
> min heap idle during period), it is more meaningful. For example, if the
> process did not used more than 1GB during the whole period and we have
> something on top of that, then we can free some of that excess.
Yes. If unlucky, more memory than what will later be needed can be
gave-back to the OS.
On the next round however, mheap_sys will have increased because of
the newly sys-allocated blocks, we'll then know that we've freed to
much memory and won't make the mistake twice (in a row). That scheme
also covers situations where mheap_sys just keep growing, because the
overall memory need is constantly increasing, without having to pay
attention to, or rely on, mheap_idle local variations (e.g init
stages).
A high water mark on mheap_inuse would indeed bear much more precise
information. We however don't have all the room we'd like for the
exact amount of memory to be released definition (constrained by
HeapAllocChunk and mainly by what's available in &h->large), so I'm
not sure adding too much sharpness here will ultimately improve the
end result.
A kind of saturating counter could also mitigate mheap_sys' condition
correctness.
> 2. I am not actually sure how important it is... but if it is simple to
> implement and we agree that it can't hurt, it may be worth trying.
> Even if there is no idle sys memory, there can be a significant amount
> of non-yet-collected free memory (we just don't do GC since it become
> unused). A straightforward solution is to trigger GC once per 10
> seconds. However, I guess we don't want to do it. So the proposed
> solution is as follows.
> The decision is guided by 2 condition:
> A = was there a normal GC during that 10 seconds?
> B = does our release procedure triggered the last GC?
> Then the algo is:
> if A -> do nothing // the process seems to have everything under control
> if B -> do nothing // we've already tried to do GC, and the process had
> not done enough allocations since that to trigger GC
> otherwise -> do GC
> Humm... I am not because we don't actually know when the additional GC
> worth doing (when the process is done with the memory).
I really think we should remain as loosely coupled as possible from
the GC and/or other components.
My initial purpose was just to make sure that unused memory will
eventually tend to ~zero, i.e toward something that is not significant
in comparison to what the process actually needs. In that regard,
waiting few more iterations for non-yet-collected memory to become
sys-releasable is not an issue to me. But that's just my intended
use-case.
Sebastien
Please take another look.
> src/pkg/runtime/mheap.c:328: runtime·MHeap_Release()
> I think we need a better/longer name. "Release" looks like a part of
> normal operation. Scavenger?
Done. I have no real preference here, I tried to emphasize the
particular point of giving back memory to the OS but the dynamic
aspect was clearly missing.
> src/pkg/runtime/mheap.c:349: if (runtime·MSpanList_IsEmpty(&h->large))
> Perhaps straight check (heap_idle < HeapAllocChunk) or some other
> constant. If there is a negligible amount of memory that we can
> potentially free, don't bother trying.
Taken into consideration in the prune loop (can't drop span <
HeapAllocChunk), so I would rather postpone this kind of fast path
optim and avoid condition redundancy whenever possible, for now. I
actually remove that whole statement. One less potential trigger to
setup imho.
> src/pkg/runtime/mheap.c:349: if (runtime·MSpanList_IsEmpty(&h->large)) {
> Drop excessive curly brackets here and below.
Fixed.
> src/pkg/runtime/mheap.c:357: target = cutRatio * (mstats.heap_idle >>
> PageShift); cut = 0;
> one statement per line please
Fixed.
> src/pkg/runtime/mheap.c:357: target = cutRatio * (mstats.heap_idle >>
> PageShift); cut = 0;
> I think target can be 0, then don't bother doing anything else.
If target is zero, we shouldn't find much candidates for deletion
anyway (a bit the same to me as the above fast path).
> src/pkg/runtime/mheap.c:368: for (s = &h->large; s->next != &h->large; s
> = s->next) {
> can't we just do something along the lines of
> s = h->large->prev
> ?
You're right. I definitely missed h->large's cyclicity. Thanks.
> src/pkg/runtime/mheap.c:398: prev_idle = mstats.heap_idle;
> identation
Fixed.
> src/pkg/runtime/mheap.c:344: g->waitreason = "MHeap_Release pause";
> It is user-visible, please provide an end user understandable
> description.
I did try;), it's now "scavenger asleep". No much better idea however.
> src/pkg/runtime/mheap.c:353: if (mstats.heap_inuse > prev_inuse) {
> I think it is a way too restrictive (taking into account that the check
> is executed once per 10 seconds).
Stupid mistake, I'm sorry. The test was meant to be on mheap_sys as a
(cheap) low water mark proxy for mheap_idle. The point was to try to
mitigate transient allocations (hence the comment).
I purposely refrained myself from injecting special purpose / alien
variables into the mstats structure (among others) as to keep the
initial footprint as small as possible. That's however clearly an
option, but I'm not sure about the real impact in the end.
> src/pkg/runtime/mheap.c:397: NextRound:
> Perhaps a user would like to check how it works for him. What about
> providing a one-line summary if GOGCTRACE>0? The summary may include the
> values, how much we've scavenged or why we've scavenged nothing.
Done (more than one line however).
Sebastien
http://codereview.appspot.com/5451057/diff/8/src/pkg/runtime/mheap.c#newcode420
src/pkg/runtime/mheap.c:420: k += 1;
k++
I like k += 1 myself , but k++ seems to be a convention in the codebase
I still would prefer mheap_sys_max, it's just a few additional lines of
code I think it will provide much better protection against freeing what
is actually actively used by a program (I think the current algo can
easily free used memory 2 times in a row).
I see that it works poorly for computation intensive application - the
scavenger just does not have a chance to run. But I think it's a part of
a bigger problem with non-preemptiveness, and it makes a little sense to
solve it here.
I've tested it on test/bench/binary-tree.go with -n=22. And I see a lot
(dozens) of consecutive record like:
scvg22: unused: 379 of 710 MB
scvg22: target: 18 MB+
scvg22: no candidate span
The program in a sort of settled state, but the scavenger fails to
release any of that 380MB during minutes. I think we need to do
something with that.
Also I see that that boosting logic is quite unstable:
scvg28: unused: 375 of 710 MB
scvg28: target: 75 MB+
scvg28: no candidate span
scvg29: unused: 367 of 710 MB
scvg29: target: 18 MB+
scvg29: no candidate span
Basically nothing is changed in the application, but target is 4x lower.
Not sure whether we want to do something with it, perhaps it is OK as
is.
And the last concern is about huge idle spans. Consider that all idle
memory is coalesced together into a huge idle span and then the
scavenger frees it. I think it makes sense to split spans if we are
going to exceed the target significantly (like it is done in
MHeap_AllocLocked).
> My initial purpose was just to make sure that unused memory will
> eventually tend to ~zero, i.e toward something that is not significant
> in comparison to what the process actually needs. In that regard,
> waiting few more iterations for non-yet-collected memory to become
> sys-releasable is not an issue to me. But that's just my intended
> use-case.
Makes sense.
I hope we can make this a little simpler, along these lines:
1. Add 'lastinuse int64' to each MSpan.
2. At the beginning of a garbage collection (not the end),
walk the span list and set m->lastinuse = now
for all the spans that are in use (not free),
where now is runtime.nanotime() from the beginning
of the garbage collection.
3. Add lastgc int64 somewhere, maybe in MHeap.
Set it to now at the beginning of the garbage collection.
Once a minute or so:
1. Run a garbage collection if there hasn't been one
in the last two minutes.
2. Walk the span list. Any span that was lastinuse
more than five minutes ago can be handed back to
the OS, but left in the list, by calling SysUnused.
Mark the span 'idle'. The next allocation from the span
or merging of the span with an adjacent span should
clear the idle flag.
Calling SysFree seems like asking for trouble.
SysUnused is just as effective and much easier.
The idea here is to wait long enough that even if we
did have to page the memory back in, it wouldn't be
a significant time cost compared to how long we held
it while idle. I get nervous when I see decisions being
made on the basis of heap size.
Does this seem reasonable, Sebastien and Dmitriy?
We will also have to update the deadlock check not to
think of the scavenger goroutine as something that
is contributing to the program execution.
Russ
Slightly more intrusive but I'm definitely fine with the idea.
What about tagging spans with an 'unusedsince' attribute (instead of
'lastinuse'), walk mheap's freelists at the end of garbage collections
and stamp spans iif (s->state == MSpanFree && s->usedsince == 0)? That
attribute would then be reseted when spans' state change from
MSpanFree to whatever (few occurrencies iirc).
It preserves the same logic but possibly provides with a more general
span aging information. It also does not require any gc/scavanger
dynamic coupling. Whilst still being an option, the need for forcing
garbage collections (under some cicumstances) from the scavenger
vanishes, as well as few possible sampling considerations.
> Calling SysFree seems like asking for trouble.
> SysUnused is just as effective and much easier.
If it can't work with SysFree I guess we have an issue somewhere, but
SysUnused has some nice advantages.
> The idea here is to wait long enough that even if we
> did have to page the memory back in, it wouldn't be
> a significant time cost compared to how long we held
> it while idle.
In that regard 5 mn should indeed be a large enough buffer. It suits
my own use-cases anyway.
> I get nervous when I see decisions being made on
> the basis of heap size.
Possibly the best argument;)
Thanks,
Sebastien
Sebastien
I believe that only resident memory counts for OOM.
Memory that is unused is the first to go (and then
doesn't count as resident) when the OOM handler kicks in.
> How will it be implemented on Windows?
Let's get non-Windows working first and then we can
reevaluate. If we have to move to forcible eviction
that's fine but I'd like to get the unused stuff working
first.
Russ
I'm missing free time right now but I hope to be able to get that done
in the next few days.
I'll favor the 'unusedsince' scheme, partly because of the concerned
span list being eventually smaller, but also because tagging on a
steady/initial state looks to me like being potentially less prone to
transition corner-case later embarrassments (e.g trimmed spans).
Yes, hinting for unused memory is advantageously gratified by the OOM
(/proc/pid/oom_score, highest score killed first).
Sebastien
Hello Russ, Dmitry,
I've uploaded a new duration based implementation.
Most of the nasty parts come from accounting and mixed (partly
released) spans. I've added a separated entry in mstat such that
heap_idle common understanding is not altered.
Moving to a SysUsed scheme is doable, operations around `npreleased'
can serve as markers for such a modification. Spans might however have
to be forced to be homegenous when coalescing since injecting an
additional free page list in the span structure would probably not
really be a good deal.
You (more especially) want to double-check scheduler's deadlock
detection. I can prove, at best, that it works in some circumstances
but that's a bit weak for a generalization.
Cheers,
Sebastien