Message:
Hello lvd (cc: golan...@googlegroups.com),
I'd like you to review this change to
https://go.googlecode.com/hg
Description:
gc: tweak and enable escape analysis
-s now means *disable* escape analysis.
Fix escape leaks for struct/slice/map literals.
Add ... tracking.
Rewrite new(T) and slice literal into stack allocation when safe.
Add annotations to reflect.
Reflect is too chummy with the compiler,
so changes like these affect it more than they should.
Please review this at http://codereview.appspot.com/4954043/
Affected files:
M src/cmd/gc/esc.c
M src/cmd/gc/gen.c
M src/cmd/gc/go.h
M src/cmd/gc/lex.c
M src/cmd/gc/sinit.c
M src/cmd/gc/typecheck.c
M src/cmd/gc/walk.c
M src/pkg/reflect/value.go
M test/escape2.go
$ gotest
rm -f _test/fmt.a
6g -o _gotest_.6 doc.go format.go print.go scan.go
rm -f _test/fmt.a
gopack grc _test/fmt.a _gotest_.6
mallocs per Sprintf(""): 0
mallocs per Sprintf("xxx"): 1
mallocs per Sprintf("%x"): 1
mallocs per Sprintf("%x %x"): 1
PASS
$
Sent from my iPhone
tested on exp/go/eval too i trust?
http://codereview.appspot.com/4954043/diff/7001/src/cmd/gc/esc.c
File src/cmd/gc/esc.c (right):
http://codereview.appspot.com/4954043/diff/7001/src/cmd/gc/esc.c#newcode266
src/cmd/gc/esc.c:266: } else {
i think these 3 can be simplified into one case: the initialisation of
esc, curfn and loopdepth can happen always, and for slices and maps you
can set dst=thesink before the loop
http://codereview.appspot.com/4954043/diff/7001/src/cmd/gc/go.h
File src/cmd/gc/go.h (left):
http://codereview.appspot.com/4954043/diff/7001/src/cmd/gc/go.h#oldcode294
src/cmd/gc/go.h:294: int escfloodgen; // increased for every flood to
detect loops
what is it with this desire to make global variables have as much uses
as possible? i argued against this before.
> tested on exp/go/eval too i trust?
and if so LGTM (i always forget to say this). LGTM. lets have escape
analysis before the weekend :-)
lvd@dhcp-172-28-208-194:[~/Project/go/src/pkg/exp/eval]$ make test
gotest
rm -f _test/exp/eval.a
6g -o _gotest_.6 abort.go bridge.go compiler.go expr.go expr1.go
func.go scope.go stmt.go type.go typec.go value.go world.go
eval_test.go expr_test.go stmt_test.go
rm -f _test/exp/eval.a
gopack grc _test/exp/eval.a _gotest_.6
--- FAIL: eval.TestExpr (0.04 seconds)
exprTests[105]: Run ^u; = *eval.uintV(248) want *eval.uintV(4294967294)
exprTests[108]: Run 1+u; = *eval.uintV(248) want *eval.uintV(2)
exprTests[200]: Run u<<2; = *eval.uintV(248) want *eval.uintV(4)
exprTests[204]: Run u<<2.0; = *eval.uintV(248) want *eval.uintV(4)
exprTests[206]: Run u<<u; = *eval.uintV(248) want *eval.uintV(2)
exprTests[208]: Run u<<u; = *eval.uintV(248) want *eval.uintV(2)
unexpected fault address 0x80263498
throw: fault
[signal 0xb code=0x1 addr=0x80263498 pc=0xf84016704c]
so no.
http://codereview.appspot.com/4954043/diff/7001/test/escape2.go#newcode746
test/escape2.go:746: m := &Bar{ii: x} // ERROR "&struct literal escapes
to heap"
when i run it it says it doesn't and i think thats how it should be.
it's a general mechanism for doing a
custom walk that doesn't revisit nodes.
why make it available only to one module?
the global was already in go.h, precisely
because i had used it (or almost used it)
in the past too.
I found the exp/eval bug - I was missing else blocks.
To fix that I added the obvious line but also defined
what the structure of the tree is, so there is a reference.
http://codereview.appspot.com/4954043/diff2/7001:21001/src/cmd/gc/go.h
That turned up another bug: in esccall, fn && fn->type
was being used as the condition for detecting a function
for which we have a body. That was also catching variables
of func type, so I restricted the condition a bit (and added tests).
Also, for debugging, I found it very useful to print the
things that *don't* escape, since it is easier to look for
a spurious 'does not escape' print than to look for a
missing 'does escape' print. So I added that. Those prints
and the escape prints are printed via warnl so that the
compiler can sort them by line number. For large
packages it was very confusing to get them in essentially
random order. True debugging prints (-mm) are still prints.
You can see the effect of the extra prints in escape2.go.
Thus, changes in:
http://codereview.appspot.com/4954043/diff2/7001:21001/src/cmd/gc/esc.c
* add nelse
* restrict esccall condition
* change print to warnl, add 'does not escape' prints
I tried your suggestion about merging the three
literal cases, but they are all slightly different
and it seemed less clear what was going on
in each case.
Russ
http://codereview.appspot.com/4954043/diff/21001/src/cmd/gc/esc.c
File src/cmd/gc/esc.c (right):
http://codereview.appspot.com/4954043/diff/21001/src/cmd/gc/esc.c#newcode89
src/cmd/gc/esc.c:89: for(l=noesc; l; l=l->next)
instead of a separate list, why not just for (l=xtop...) for
(ll=l->n->dcl... )?
http://codereview.appspot.com/4954043/diff/21001/src/cmd/gc/go.h
File src/cmd/gc/go.h (right):
http://codereview.appspot.com/4954043/diff/21001/src/cmd/gc/go.h#newcode220
src/cmd/gc/go.h:220: NodeList* ninit;
can we put this in the order the side effects should happen, i.e. the
order walk should walk?
--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/plus
http://niemeyer.net/twitter
http://niemeyer.net/blog
-- I never filed a patent.
Just as an empirical data point, the same benchmark with mgo that got
a speed boost with Dmitry's work on concurrency is getting a few
seconds back with this change in place, from ~16.1s to ~18.5s. The
benchmark consists in querying 70k documents out of a MongoDB database
and unmarshalling them, and these times are seen with GOMAXPROCS=4.
With GOMAXPROCS=1, the same benchmark takes ~46.5s without escape
analysis, and ~52.6s with it enabled.
Enabled is this CL without changes and disabled is this CL with
debug['s']++ in gc/lex.c.
Unsurprisingly, the cost seems to come from stack handling. Without
this CL, the pair old/newstack is called ~157k times, while this CL
causes the pair to be called ~25M times. May be worth tweaking the
stack size numbers a bit.
Confirming: doubling StackMin and StackBig makes the old/newstack pair
be called ~11k times and puts the timing back in the ~16 seconds
range.
Playing games with stack size to help one program
just hurts others. It's not a rational strategy.
Russ
Clearly, but I hope you're speaking about yourself, since you are the
ones playing games with the stack size, and all I did was reporting
the outcome of your game with a real-world example submitted by a user
of mgo a while ago. Unlike the made up examples this one has suffered
with a 2 orders of magnitude penalty on the number of calls on
old/newstack, which is entirely natural given that you are increasing
the stack pressure across the board.
So yes, I guess what you say makes sense.. playing games with the
stack size may hurt other programs. Now, I wouldn't go as far as
accusing you of being irrational, since that's unnecessarily harsh.. I
would instead suggest investigating a bit more if tweaking the stack
size would avoid the penalty caused by the additional pressure without
impacting significantly other programs, before getting to conclusions.
It's not that simple. Your explanation makes it sound
like there is a nicer linear correlation: more things on the stack,
more stack splits, slower run time. But there are not 100x
more things on the stack, so there are not 100x more stack
splits.
What's happening here is that at certain stack depths,
any call from that depth is more expensive than a normal call,
because it starts a new stack, and the corresponding return
is also more expensive, because it has to jump back to
the original stack and free the old one. Compiler changes
that affect the size of stack frames change which function
calls in a particular program are affected by this: some
formerly expensive ones become cheap, and some
formerly cheap ones become expensive. If your program
was spending most of its time calling the expensive ones
before, it can get measurably faster. If it was spending most
of its time calling the cheap ones before, it can get measurably
slower.
Whether a program gets faster or slower is not a simple
function of 'stack pressure' except in the limiting cases
where you make the stacks so big that no call ever
needs to make a new one or so small that every call does.
In the normal cases it is just a question of whether the
stack depths that have split costs are also stack depths
that make a lot of function calls. You can change that
equality by making the stacks a bit larger but also by
making the stacks a bit smaller. We also had reports of
programs getting slower (no one ever reports when their
programs get faster) for Luuk's CL that shortened stack
frames by eliminating dead spaces. If you graphed the
number of calls made by a program as a function of
stack depth, you'd find it to be very spiky. There are hot spots.
If the hot spots line up with stack splits, then you're
in trouble. If not, you're fine.
These slowdowns are an unfortunate, unpredictable side
effect of the stack splitting, and changing the numbers to
make one program faster makes other programs slower.
There's no obvious win here, which is why I said that
tweaking stack sizes is not a rational strategy for fixing
the problem. (It is a great strategy for pinning a slowdown
on stack splitting, though.)
The right long term solution is algorithmic, so that after
a few hits at a particular boundary we can shorten or
lengthen the stack segment above it in order to make
that boundary not expensive anymore. However, being
able to do that implies being able to move stack frames
from one segment to another, which requires being able
to tell, at any function call, which words in a stack frame
are pointers and which are not, so that pointers to things
in the stack frame can be adjusted. It's a non-trivial
amount of work.
Russ
the noesc list contains expressions like &x and new(T),
not just declarations.
> http://codereview.appspot.com/4954043/diff/21001/src/cmd/gc/go.h#newcode220
> src/cmd/gc/go.h:220: NodeList* ninit;
> can we put this in the order the side effects should happen, i.e. the
> order walk should walk?
except for ninit first, i don't know if there is one.
walk doesn't just blindly traverse that list:
it does different things based on the kind of node.
russ
gc: tweak and enable escape analysis
-s now means *disable* escape analysis.
Fix escape leaks for struct/slice/map literals.
Add ... tracking.
Rewrite new(T) and slice literal into stack allocation when safe.
Add annotations to reflect.
Reflect is too chummy with the compiler,
so changes like these affect it more than they should.
R=lvd, dave, gustavo
CC=golang-dev
http://codereview.appspot.com/4954043
> It's not that simple. Your explanation makes it sound
> like there is a nicer linear correlation: more things on the stack,
(...)
> was spending most of its time calling the expensive ones
> before, it can get measurably faster. If it was spending most
> of its time calling the cheap ones before, it can get measurably
> slower.
(...)
Understood the explanation and agree. As a detail, the overall
problem description makes it sound like it's merely a matter of
positioning, and seems to ignore the fact that escape analysis is
indeed increasing the frame size in some cases, which means a
shallower depth of calls kicks the segmentation logic. This is of
course more relevant in cases such as marshalling and unmarshalling
that depend on recursing up and down more heavily. Is that a valid
statement?
> making the stacks a bit smaller. We also had reports of
> programs getting slower (no one ever reports when their
> programs get faster) for Luuk's CL that shortened stack
I did report a speed up recently, with the same example coincidently.
> The right long term solution is algorithmic, so that after
> a few hits at a particular boundary we can shorten or
> lengthen the stack segment above it in order to make
> that boundary not expensive anymore. However, being
> able to do that implies being able to move stack frames
> from one segment to another, which requires being able
> to tell, at any function call, which words in a stack frame
> are pointers and which are not, so that pointers to things
> in the stack frame can be adjusted. It's a non-trivial
> amount of work.
Sounds tricky indeed. If the above point is valid, might it be
easier/relevant to algorithmically determine the likelyhood that
certain functions will recurse and break through the segment size and
have some additional padding in those cases?
Yes but only with enough caveats. For a heavily recursive
function, something like go/parser, it's zipping up and down
the stack across many segments like there's no tomorrow.
If it can only fit 0.9n frames into a segment where before it
could fit n frames, then it would see just a 1.1x slowdown
caused by stack splits.
To see the 100x that you are seeing, the most likely story is
that before you never did any stack splits, and now the
processing of leaves in your recursive function is typically
on another segment, so each leaf call is doing a stack split.
Before you might have needed 0.95 frames and now you
need 1.1 frames. Even so I am a little surprised that the
smaller load on the garbage collector does not make up
for the extra frame sizes.
If nothing else this is a good thing for profiling to record.
I created issue 2197.
Russ