Can you measure the effect on a more realistic program? E.g. an
important computation kernel made to trigger stack split in the loop,
but with inlining enabled. It would be useful to see what is the worst
case effect but with inlining and real work.
It's not obvious that this is a win-win solution. It may be better
than what we have now, or waste memory or has another set of corner
cases (e.g. short running goroutine that triggers split stack once).
I have 2 other solutions to consider.
The first is as simple as:
func runtime.RunWithStack(stackSize int, f func()) // executes f with
at least stackSize bytes stack segment
It is ugly. But it is similar to what we have for slice capacity, map
capacity, bufio buffer size, etc.
The second is to optimize what we have now. It's definitely not
particularly optimized yet.
I have a prototype of morestackfast16/lessstackfast16:
https://codereview.appspot.com/9187048/diff/2001/src/pkg/runtime/asm_amd64.s
It's just current machinery re-implemented in assembly, w/o
unnecessary cases (reflect) and w/o g0 stack switch.
You've reported:
segmented stacks:
no split: 1.25925147s
with split: 5.372118558s <- triggers hot split problem
contiguous stacks:
no split: 1.261624848s
with split: 1.262939769s
With this change it is:
no split: 1.252429208s
with split: 2.483917597s
Not that good as your numbers, but much better than what we have now.
That's why I am interested in a more realistic benchmark.
I have 2 other ideas on how to optimize it futher:
1. The main slowdown comes from memory accesses and data dependencies.
We can use SP to find current stack segment -- ((SP & ~0xfff) + 4024),
this has 0 memory loads instead of 2 loads for g->stackbase.
2. We may keep unused segments in the goorutine linked list (use it as
cache), such segments can be reclaimed when the goroutine blocks or
during GC. This saves some work, and we do not need to access m
(m->stackcache) at all during "fast" switching. So "fast" switching
enables when we already have an appropriate next stack segment in the
goroutine.
I have a prototype of these ideas:
https://codereview.appspot.com/9442043/diff/5001/src/pkg/runtime/asm_amd64.s
but was unable to make it significantly faster yet. There seems to be
some micro architecture penalties for such code (I don't know what
exactly, either messing with SP, or messing with JMP, or just too many
loads and stores).
The results for this change are:
no split: 1.252429208s
with split: 2.28179379s