Recursive Functions, Split Stacks and Tail Call Optimization

509 views
Skip to first unread message

Ugorji

unread,
Apr 17, 2012, 11:18:11 AM4/17/12
to golan...@googlegroups.com

I debated where to post this, but figure golang-devs is a better audience.


While benchmarking my msgpack package against BSON, Gob and JSON for some larger inputs, I started to notice that seemingly innocuous changes in code increased or decreased benchmark time dramatically (20%-100%).

I might add an innocuous line, and the benchmark time increases drastically, sometimes doubling.

E.g. switch from 
  rvi := rv.Interface()
  bs = rvi.([]byte)
TO
  bs = rv.Interface().([]byte)

Or I might add some code within a switch, and the benchmark time doubles. And then I move the exact code into its own function and the benchmark time comes back down.

Or I might sometimes add a do-nothing line at a point in a function, e.g. 
xxx := time.Parse
_ = xxx
and the benchmark time reduces.

Or I might do a type assertion first as a fast-path before falling back to reflection, and the benchmark time doubles. And I remove that fast-path (which caught 80% of the cases), and just the reflective code works significantly faster.

Gustavo gave me a hint that it had to do with split stacks. I did further research on it, looking at runtime/stack.h and Ian's paper on implementing it for GCC, and understood it further. 

With the current gc, It seems:
- Stacks are alway split when full (not at a predefined point)
- Tail Calls are not optimized
(I might be wrong or simplifying it. Please correct me if so.)

This presents an issue for recursive functions.

I would have expected that recursion incurs a linear memory penalty with split stacks. However, in practice, the penalty is very non-deterministic. This shows that the implementation of split stacks has a direct impact on the performance of function calls, notably recursive functions.

I do not understand the space well enough to know how this can be fixed, but I think a fix is needed.

For example, maybe some of these could help:
- Don't split stack at any point, but at predefined points 
- Support tail call optimization (for recursive functions that naturally take advantage of it)
- Use some real-time runtime hieristics to appropriate size the stacks

minux

unread,
Apr 17, 2012, 11:30:03 AM4/17/12
to Ugorji, golan...@googlegroups.com
On Tue, Apr 17, 2012 at 11:18 PM, Ugorji <ugo...@gmail.com> wrote:

While benchmarking my msgpack package against BSON, Gob and JSON for some larger inputs, I started to notice that seemingly innocuous changes in code increased or decreased benchmark time dramatically (20%-100%).

I might add an innocuous line, and the benchmark time increases drastically, sometimes doubling.

E.g. switch from 
  rvi := rv.Interface()
  bs = rvi.([]byte)
TO
  bs = rv.Interface().([]byte)

Or I might add some code within a switch, and the benchmark time doubles. And then I move the exact code into its own function and the benchmark time comes back down.

Or I might sometimes add a do-nothing line at a point in a function, e.g. 
xxx := time.Parse
_ = xxx
and the benchmark time reduces.

Or I might do a type assertion first as a fast-path before falling back to reflection, and the benchmark time doubles. And I remove that fast-path (which caught 80% of the cases), and just the reflective code works significantly faster.

Gustavo gave me a hint that it had to do with split stacks. I did further research on it, looking at runtime/stack.h and Ian's paper on implementing it for GCC, and understood it further. 
The quick way to find out if it is caused by split stack is this: use 6g -S to compile your function, look at line
with "TEXT", like this:
0000 (test.go:6) TEXT    main+0(SB),$176-0
The first number after "$" is the frame size of this function, if it is bigger, then stack split will definitely happen
sooner if this function recurse.

Ugorji

unread,
Apr 17, 2012, 12:15:07 PM4/17/12
to golan...@googlegroups.com, Ugorji
Thanks minux. I tried what you said, but don't know how to interprete it. I've added some more context for what I'm seeing.

I just ran the benchmark, and time for Encode was 178979 ns/op, and the biggest ones from 6g -S were:
0048 (helper.go:69) TEXT    getStructFieldInfos+0(SB),$440-24
0668 (encode.go:97) TEXT    (*Encoder).encodeValue+0(SB),$440-32
2373 (encode.go:304) TEXT    (*Encoder).writeb+0(SB),$168-48

After inlining a short 7-line function, my benchmark time increased to 237058 ns/op. The biggest ones from 6g -S were:
0048 (helper.go:69) TEXT    getStructFieldInfos+0(SB),$440-24
0668 (encode.go:97) TEXT    (*Encoder).encodeValue+0(SB),$456-32
2406 (encode.go:311) TEXT    (*Encoder).writeb+0(SB),$168-48

This is the function I copied into encodeValue(...)

func (e *Encoder) encBool(b bool) {
e.t = e.t[0:1]
if b {
e.t[0] = 0xc3
} else {
e.t[0] = 0xc2
}
e.write()
}

This is what the replacement looks like (snippet of encodeValue).
case reflect.Bool:
//e.encBool(rv.Bool())
e.t = e.t[0:1]
if rv.Bool() {
e.t[0] = 0xc3
} else {
e.t[0] = 0xc2
}
e.write()
case reflect.Float32:

BEFORE INLINING:
Benchmark__Msgpack__Encode   10000    178979 ns/op
AFTER INLINING:
Benchmark__Msgpack__Encode   10000    237058 ns/op

I basically inlined a function, and function time went up almost 50%.

Russ Cox

unread,
Apr 18, 2012, 9:46:28 AM4/18/12
to Ugorji, golan...@googlegroups.com
On Tue, Apr 17, 2012 at 11:18, Ugorji <ugo...@gmail.com> wrote:
> While benchmarking my msgpack package against BSON, Gob and JSON for some
> larger inputs, I started to notice that seemingly innocuous changes in code
> increased or decreased benchmark time dramatically (20%-100%).

This is part of the problem with benchmarks.
The context of your code - how much stack is left when calling into
the library - matters as far as whether a stack split occurs.
The benchmark is only testing one possible context.
When you change the stack frame, to make it either bigger
or smaller, that also affects whether a stack split occurs,
and where, which ends up changing the performance.

I wrote a longer explanation here:
https://groups.google.com/d/msg/golang-dev/vxRuA-gbwdw/xiaDzM586VMJ

Russ

Christoph Hack

unread,
Apr 18, 2012, 10:03:46 AM4/18/12
to golan...@googlegroups.com
On Tuesday, April 17, 2012 5:18:11 PM UTC+2, Ugorji wrote:
With the current gc, It seems:
- Tail Calls are not optimized
(I might be wrong or simplifying it. Please correct me if so.)

In previous discussions, most people were against automatic tail call optimization, because it makes it hard to debug programs later. But I am wondering why there is still no explicit "become" statement. It has been suggested several times and the feedback was always rather positive. Was it just postponed because of the Go1 release?

-christoph

Ugorji

unread,
Apr 18, 2012, 10:04:36 AM4/18/12
to golan...@googlegroups.com, Ugorji, r...@golang.org
Thanks Russ.

I read your explanation in the prior thread and it makes sense. There's no predictable short-term solution (aside from making stack frame extremely large which is not feasible for "cheap" goroutines).

In my initial posting, I mentioned some ideas which might help alleviate this.
- Don't split stack at any point, but at predefined points 
- Support tail call optimization (for recursive functions that naturally take advantage of it)
- Use some real-time runtime hieristics to appropriate size the stacks
Do you have any thoughts on these? You alluded to using runtime algorithms to possibly adjust stack sizes (ala #3 above). Is there an issue filed tracking this that I can watch? Also, I know at least #2 above (tail call optimization) has been discussed but the main issue raised was that you lose debugging information. Is this option on the table at all? And what about #1 (figuring out "smart" points to split the stack)?

Thanks.

Russ Cox

unread,
Apr 18, 2012, 11:13:36 AM4/18/12
to Ugorji, golan...@googlegroups.com
>> - Don't split stack at any point, but at predefined points
>> - Support tail call optimization (for recursive functions that naturally
>> take advantage of it)
>> - Use some real-time runtime hieristics to appropriate size the stacks
>
> Do you have any thoughts on these?

I think there could be some work done on choosing how big
a stack to allocate, which would perhaps relieve some of the
inner stack split checks.

There was some work at Berkeley a few years ago [1] that looked
at doing analysis of the program's call graph to remove some
stack split checks and to choose stack sizes, and I've been
meaning to look into whether we can do something smart like
that in Go's split stacks. But ultimately if you end up triggering
a stack split in a tight loop you're not going to run as fast as
if you don't.

Tail call optimization is not mandated by the spec, is not
possible in functions with defer, and makes debugging harder.
I'm glad that we don't have that in the 6g compiler.

I've thought about runtime heuristics for stack sizes, but we
don't have enough information at stack split time to move
the stack, so by the time you realize 'it would be nice if the
current frame were bigger' it's too late.

Russ

[1] http://capriccio.cs.berkeley.edu/pubs/capriccio-sosp-2003.pdf

Ugorji

unread,
Apr 18, 2012, 11:35:07 AM4/18/12
to golan...@googlegroups.com, Ugorji, r...@golang.org
Thanks Russ.

The only reason I brought up tail-call optimization is because it could help alleviate the penalty of stack splits in recursive functions. It's just one way to do it. I agree that it's a better plan to tackle the other ways (ie static/runtime analysis to remove split checks and choose stack sizes). 

Thanks much.

BTW, If I squint hard enough, I think I see something about gccgo having TCO.

roger peppe

unread,
Apr 18, 2012, 12:29:38 PM4/18/12
to Ugorji, golan...@googlegroups.com, r...@golang.org
On 18 April 2012 16:35, Ugorji <ugo...@gmail.com> wrote:
> Thanks Russ.
>
> The only reason I brought up tail-call optimization is because it could help
> alleviate the penalty of stack splits in recursive functions.

isn't that only the case if you're actually making a tail call?
most of the recursive calls in marshalling/unmarshalling code
that i've seen are not in tail position. simple recursive
calls to the same function can often be easily optimised by
replacing them with a goto or a loop.

Reply all
Reply to author
Forward
0 new messages