While benchmarking my msgpack package against BSON, Gob and JSON for some larger inputs, I started to notice that seemingly innocuous changes in code increased or decreased benchmark time dramatically (20%-100%).I might add an innocuous line, and the benchmark time increases drastically, sometimes doubling.E.g. switch fromrvi := rv.Interface()bs = rvi.(byte)TObs = rv.Interface().(byte)Or I might add some code within a switch, and the benchmark time doubles. And then I move the exact code into its own function and the benchmark time comes back down.Or I might sometimes add a do-nothing line at a point in a function, e.g.xxx := time.Parse_ = xxxand the benchmark time reduces.Or I might do a type assertion first as a fast-path before falling back to reflection, and the benchmark time doubles. And I remove that fast-path (which caught 80% of the cases), and just the reflective code works significantly faster.Gustavo gave me a hint that it had to do with split stacks. I did further research on it, looking at runtime/stack.h and Ian's paper on implementing it for GCC, and understood it further.
This is part of the problem with benchmarks.
The context of your code - how much stack is left when calling into
the library - matters as far as whether a stack split occurs.
The benchmark is only testing one possible context.
When you change the stack frame, to make it either bigger
or smaller, that also affects whether a stack split occurs,
and where, which ends up changing the performance.
I wrote a longer explanation here:
With the current gc, It seems:
- Tail Calls are not optimized(I might be wrong or simplifying it. Please correct me if so.)
- Don't split stack at any point, but at predefined points
- Support tail call optimization (for recursive functions that naturally take advantage of it)
- Use some real-time runtime hieristics to appropriate size the stacks
I think there could be some work done on choosing how big
a stack to allocate, which would perhaps relieve some of the
inner stack split checks.
There was some work at Berkeley a few years ago  that looked
at doing analysis of the program's call graph to remove some
stack split checks and to choose stack sizes, and I've been
meaning to look into whether we can do something smart like
that in Go's split stacks. But ultimately if you end up triggering
a stack split in a tight loop you're not going to run as fast as
if you don't.
Tail call optimization is not mandated by the spec, is not
possible in functions with defer, and makes debugging harder.
I'm glad that we don't have that in the 6g compiler.
I've thought about runtime heuristics for stack sizes, but we
don't have enough information at stack split time to move
the stack, so by the time you realize 'it would be nice if the
current frame were bigger' it's too late.
isn't that only the case if you're actually making a tail call?
most of the recursive calls in marshalling/unmarshalling code
that i've seen are not in tail position. simple recursive
calls to the same function can often be easily optimised by
replacing them with a goto or a loop.