Changing segmented stacks to contiguous stacks

1,531 views
Skip to first unread message

Keith Randall

unread,
May 15, 2013, 4:27:45 PM5/15/13
to golang-dev
I've been thinking about how to change the stack implementation for Go routines.  I've been poking around with an implementation that keeps stacks contiguous and copies them in order to grow them.  I've written up a design doc and I'm interested in what people think about the idea, both whether it is a good idea at all, or any changes/improvements you might suggest.

Brad Fitzpatrick

unread,
May 15, 2013, 4:31:14 PM5/15/13
to Keith Randall, golang-dev
Doc needs to be made public.



On Wed, May 15, 2013 at 1:27 PM, Keith Randall <k...@google.com> wrote:
I've been thinking about how to change the stack implementation for Go routines.  I've been poking around with an implementation that keeps stacks contiguous and copies them in order to grow them.  I've written up a design doc and I'm interested in what people think about the idea, both whether it is a good idea at all, or any changes/improvements you might suggest.

--
 
---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Keith Randall

unread,
May 15, 2013, 4:42:05 PM5/15/13
to Brad Fitzpatrick, golang-dev

Brad Fitzpatrick

unread,
May 15, 2013, 4:44:52 PM5/15/13
to Keith Randall, golang-dev
Latter URL works.  First one says:

"You are signed in as brad...@golang.org, but you don't have permission to access this published document. You may need to sign in as a different user."

I shared successfully from drive as my golang.org account the other day twice.  I tested in an incognito window.

Aram Hăvărneanu

unread,
May 15, 2013, 6:10:24 PM5/15/13
to golan...@googlegroups.com
Copying the whole stack is costly, the cost is variable and
unpredictable, and it's hard to recover the space later. I think
it's better if we have smaller stack segments, perhaps 512bytes or
so, and when we have to grow the stack we create a larger stack
segment and copy the latest stack segment into it doing the kind
of fixups mentioned in your document. That means that a tight loop
that causes a stack split will take the hit only once, after that
it will return in the larger stack segment and won't cross the line
anymore.

This method:
    o) wastes less space than the current implementation.

    o) is faster than both the current implementation (by avoiding
    allocating stag segments in loops) and your proposal (fewer
    things needs copying at a time).

    o) compared to your proposal, space is reclaimed automatically
    when returning yet another segment back.

    o) when you take a performance hit, that hit is constant and
    known.

Keith Randall

unread,
May 15, 2013, 6:18:56 PM5/15/13
to Aram Hăvărneanu, golang-dev
On Wed, May 15, 2013 at 3:10 PM, Aram Hăvărneanu <ar...@mgk.ro> wrote:
Copying the whole stack is costly, the cost is variable and
unpredictable,

 I don't think it is costly or variable, at least no more so than standard containers (e.g. slice append).  As long as you're growing exponentially, it's at most a constant cost per byte of stack.

and it's hard to recover the space later. I think

That I agree with.
 
it's better if we have smaller stack segments, perhaps 512bytes or
so, and when we have to grow the stack we create a larger stack
segment and copy the latest stack segment into it doing the kind
of fixups mentioned in your document. That means that a tight loop
that causes a stack split will take the hit only once, after that
it will return in the larger stack segment and won't cross the line
anymore.


This confuses me - it sounds exactly like what I am proposing.  When does a split happen in this scheme?
 
This method:
    o) wastes less space than the current implementation.

    o) is faster than both the current implementation (by avoiding
    allocating stag segments in loops) and your proposal (fewer
    things needs copying at a time).

    o) compared to your proposal, space is reclaimed automatically
    when returning yet another segment back.

    o) when you take a performance hit, that hit is constant and
    known.

roger peppe

unread,
May 16, 2013, 6:21:37 AM5/16/13
to Keith Randall, Aram Hăvărneanu, golang-dev
As an adjunct to this change, it would be great if the
compiler could do static analysis of stack sizes so
that goroutines were created with an appropriate
size. This probably affects goroutines with small
stacks more than large ones, but could be advantageous
in both cases.

It would be great if a simple multiplexer didn't need the
usual 4K stack.

go func() {
for x := range in {
out <- msg{id, x}
}
}()

I think Russ has mentioned doing something like this,
but I don't know if it's a current plan.

Keith Randall

unread,
May 16, 2013, 11:39:55 AM5/16/13
to roger peppe, Aram Hăvărneanu, golang-dev
Yes, that would be nice.  I don't think anyone is signed up to do it at the moment.

BTW, I'm thinking about making the starting stack smaller as part of this change, maybe 1K.

Russ Cox

unread,
May 16, 2013, 12:24:35 PM5/16/13
to Keith Randall, roger peppe, Aram Hăvărneanu, golang-dev
I think being able to start stacks tiny will make precise analysis unnecessary.

Dmitry Vyukov

unread,
May 16, 2013, 1:59:44 PM5/16/13
to Keith Randall, golang-dev
Can you measure the effect on a more realistic program? E.g. an
important computation kernel made to trigger stack split in the loop,
but with inlining enabled. It would be useful to see what is the worst
case effect but with inlining and real work.

It's not obvious that this is a win-win solution. It may be better
than what we have now, or waste memory or has another set of corner
cases (e.g. short running goroutine that triggers split stack once).

I have 2 other solutions to consider.
The first is as simple as:
func runtime.RunWithStack(stackSize int, f func()) // executes f with
at least stackSize bytes stack segment
It is ugly. But it is similar to what we have for slice capacity, map
capacity, bufio buffer size, etc.

The second is to optimize what we have now. It's definitely not
particularly optimized yet.
I have a prototype of morestackfast16/lessstackfast16:
https://codereview.appspot.com/9187048/diff/2001/src/pkg/runtime/asm_amd64.s
It's just current machinery re-implemented in assembly, w/o
unnecessary cases (reflect) and w/o g0 stack switch.

You've reported:
segmented stacks:
no split: 1.25925147s
with split: 5.372118558s <- triggers hot split problem

contiguous stacks:
no split: 1.261624848s
with split: 1.262939769s

With this change it is:
no split: 1.252429208s
with split: 2.483917597s

Not that good as your numbers, but much better than what we have now.
That's why I am interested in a more realistic benchmark.

I have 2 other ideas on how to optimize it futher:
1. The main slowdown comes from memory accesses and data dependencies.
We can use SP to find current stack segment -- ((SP & ~0xfff) + 4024),
this has 0 memory loads instead of 2 loads for g->stackbase.
2. We may keep unused segments in the goorutine linked list (use it as
cache), such segments can be reclaimed when the goroutine blocks or
during GC. This saves some work, and we do not need to access m
(m->stackcache) at all during "fast" switching. So "fast" switching
enables when we already have an appropriate next stack segment in the
goroutine.
I have a prototype of these ideas:
https://codereview.appspot.com/9442043/diff/5001/src/pkg/runtime/asm_amd64.s
but was unable to make it significantly faster yet. There seems to be
some micro architecture penalties for such code (I don't know what
exactly, either messing with SP, or messing with JMP, or just too many
loads and stores).
The results for this change are:
no split: 1.252429208s
with split: 2.28179379s

Russ Cox

unread,
May 16, 2013, 2:08:46 PM5/16/13
to Dmitry Vyukov, Keith Randall, golang-dev
golang.org/issue/3787 might be worth reinvestigating with the copying stacks. The belief was always that it was a bad stack split moving around.

Russ

minux

unread,
May 16, 2013, 2:15:50 PM5/16/13
to Dmitry Vyukov, Keith Randall, golang-dev
On Fri, May 17, 2013 at 1:59 AM, Dmitry Vyukov <dvy...@google.com> wrote:
I have 2 other solutions to consider.
The first is as simple as:
func runtime.RunWithStack(stackSize int, f func())  // executes f with
at least stackSize bytes stack segment
It is ugly. But it is similar to what we have for slice capacity, map
capacity, bufio buffer size, etc.
i don't like this interface as it feels too low-level, and because it's difficult for
people to get good estimate of function stack usage (but for channel buffer,
bufio buffer size and slice capacity, one usually can estimate from the input
data)

besides, introducing interfaces like this sounds like bad excuse for not
optimizing the root cause.

Dmitry Vyukov

unread,
May 16, 2013, 2:18:00 PM5/16/13
to minux, Keith Randall, golang-dev
On Thu, May 16, 2013 at 10:15 PM, minux <minu...@gmail.com> wrote:
>> I have 2 other solutions to consider.
>> The first is as simple as:
>> func runtime.RunWithStack(stackSize int, f func()) // executes f with
>> at least stackSize bytes stack segment
>> It is ugly. But it is similar to what we have for slice capacity, map
>> capacity, bufio buffer size, etc.
>
> i don't like this interface as it feels too low-level, and because it's
> difficult for
> people to get good estimate of function stack usage (but for channel buffer,
> bufio buffer size and slice capacity, one usually can estimate from the
> input
> data)

Agree.

Russ Cox

unread,
May 16, 2013, 2:19:53 PM5/16/13
to Dmitry Vyukov, Keith Randall, golang-dev
On Thu, May 16, 2013 at 1:59 PM, Dmitry Vyukov <dvy...@google.com> wrote:
The first is as simple as:
func runtime.RunWithStack(stackSize int, f func())  // executes f with
at least stackSize bytes stack segment
It is ugly.

Not just to Dmitriy: please remember that performance is almost never a good enough reason to make things ugly. There's been too much of this going on recently. Find a better way.

Russ
Reply all
Reply to author
Forward
0 new messages