runtime: split stack overflow

930 views
Skip to first unread message

Josh Bleecher Snyder

unread,
Sep 1, 2016, 8:53:14 PM9/1/16
to golan...@googlegroups.com
I'm trying to figure out why
https://go-review.googlesource.com/c/26668/ is failing on the
darwin-amd64-10_10 builder. It's slow going because I'm debugging
entirely via trybots, and I was hoping someone here might have some
hints.

Here's a sample failure from
https://storage.googleapis.com/go-build-log/1ae6125b/darwin-amd64-10_10_156b5ee0.log:

runtime: newstack sp=0xb051be00 stack=[0xc42013d800, 0xc42013dfe0]
morebuf={pc:0x41752ac sp:0xb051be10 lr:0x0}
sched={pc:0x4044b50 sp:0xb051be08 lr:0x0 ctxt:0x0}
runtime: gp=0xc420066340, gp->status=0x2
runtime: split stack overflow: 0xb051be00 < 0xc42013d800
fatal error: runtime: split stack overflow

The triggering code is from misc/cgo/test/helpers.go:
https://github.com/golang/go/blob/master/misc/cgo/test/helpers.go. The
testPairs var in that file is newly staticly initialized as of this
CL. The failure only happens on the one builder. Manual inspection of
the compiler output yields nothing--all looks correct, at least as far
as my human eyes can see.

Does anyone have any ideas or suggestions for how to debug from here?

Thanks,
Josh

Austin Clements

unread,
Sep 1, 2016, 9:24:29 PM9/1/16
to Josh Bleecher Snyder, golan...@googlegroups.com
A few thoughts:

The split stack overflow happens when you're already in sigpanic. This suggests that the "split stack overflow" panic is actually masking some other failure, perhaps a failure that's really not supposed to happen, so this code path hasn't been exercised. You could print the signal information in sighandler just before the "c.set_rip(uint64(funcPC(sigpanic)))", then, if feasible, remove the _SigPanic flag from that signal in signal_darwin.go to get a panic directly from the signal (rather than trying to raise it to user code).

The SP it's complaining about is clearly bogus, but it's not *entirely* bogus. The traceback managed to unwind the runtime stack from around that SP, which means there's definitely memory there that contains something that looks like a stack (whether or not it's supposed to). OTOH, the morebuf.pc is smack in the middle of the runtime.newstack frame according to the traceback, which is.. not good. My guess is that this SP is in a g0 stack allocated by the system, which is why it didn't come from the Go heap. You could print the bounds of the g0 stack to confirm this.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brad Fitzpatrick

unread,
Sep 1, 2016, 10:15:17 PM9/1/16
to Josh Bleecher Snyder, golan...@googlegroups.com
FWIW, I put in a ticket to increase our Mac VM capacity from 2 to 22 yesterday, so we'll have more gomote/ssh capacity soonish.

I'd love to provide a gomote ssh proxy command for interactive debugging, at least for the platforms supporting ssh (most).


Ian Lance Taylor

unread,
Sep 2, 2016, 12:37:29 AM9/2/16
to Josh Bleecher Snyder, golan...@googlegroups.com
On Thu, Sep 1, 2016 at 5:52 PM, Josh Bleecher Snyder
<josh...@gmail.com> wrote:
Looks like the program is getting a signal and getting confused about
the stack. Try forcing -test.v on to see which test is running. Try
forcing the environment variable GOTRACEBACK=system since it looks
like some runtime frames are missing in the stack trace.

Ian

Josh Bleecher Snyder

unread,
Sep 11, 2016, 11:15:11 AM9/11/16
to golan...@googlegroups.com, David Crawshaw, Michael Hudson-Doyle, Dmitry Vyukov, Ian Lance Taylor
Thanks, Austin and Ian. That was very helpful. I finally understand a
little more what's wrong, but I'd love another round of insight from
folks who grok cgo and the linker (and maybe tsan).

The crash is a fault from within __tsan_read. The problem appears to
be that the memory span mmap'd by __tsan_map_shadow is wrong. That
memory span is calculated to start at the lowest of
runtime.{noptrdata,data,noptrbss,bss} and end at the highest of
runtime.{enoptrdata,edata,enoptrbss,ebss}; see
https://github.com/golang/go/blob/master/src/runtime/race.go#L254.
However, there are cgo symbols that get dereferenced that lie outside
that range.

I can reproduce this symbol out of range problem (but not the crash)
on my darwin/amd64 laptop using tip:

$ cd misc/cgo/test
$ go test -o bad.test -c -race
$ go tool nm -sort=address bad.test | grep
"runtime\.e\?\(bss\|data\|noptrbss\|noptrdata\)"
4301520 D runtime.noptrdata
430ffe0 D runtime.enoptrdata
430ffe0 D runtime.data
4318e70 D runtime.edata
431b0c0 B runtime.bss
4335eb0 B runtime.ebss
4ce8cc0 B runtime.noptrbss
4cef0a0 B runtime.enoptrbss
$ go tool nm -sort=address bad.test | tail -n 5
4cef2a4 D _is_windows
4cef2a8 D _base_symbol
4cef2ac D _hola
4cef2b0 D _SansTypeface
4cef2b4 D _issue8811Initialized

Note that _hola is outside of the runtime symbol bracketed range.
_hola is defined at
https://github.com/golang/go/blob/master/misc/cgo/test/issue1635.go#L20
and read from at
https://github.com/golang/go/blob/master/misc/cgo/test/issue1635.go#L32.

It's unclear to me how this ever worked reliably, which is a sign that
my understanding is probably wrong.

It could (does?) work unreliably. When calculating the tsan memory
span, the size gets rounded up at
https://github.com/golang/go/blob/master/src/runtime/race.go#L278. And
tsan reads themselves get masked down; see
https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#32-bit.
I'm still wading through the tsan sources to find the exact
calculation to confirm, but it appears that the difference between my
laptop and the broken builder might be a bit of (un)lucky alignment
and rounding.

Does this diagnosis look plausible?

Thanks,
Josh

Ian Lance Taylor

unread,
Sep 11, 2016, 1:48:26 PM9/11/16
to Josh Bleecher Snyder, golan...@googlegroups.com, David Crawshaw, Michael Hudson-Doyle, Dmitry Vyukov
It's possible that this is due to a behavior change in the Darwin
linker. It is also possible that this has only ever worked by
coincidence on Darwin.

When doing a cgo link, the Go linker will create a C object file
containing all the Go code. It will pass it to the C linker along
with the all the C objects generated from files created by cgo. On
GNU/Linux I see the Go BSS symbols that contain pointer values, then
runtime.ebss, then all the C BSS symbols including the race detector
symbols, then runtime.noptrbss, then all the Go BSS symbols that do
not contain pointer values. So this works fine, as the Go race
detector allocates space for everything between runtime.data and
runtime.enoptrbss, which happens to include all the C symbols. In the
object file passed to the C linker, the Go BSS symbols with pointer
values will be in a section named .bss and the Go BSS symbols without
pointer values will be in a section named .noptrbss. In general the C
BSS symbols will be in a section named .bss. A typical ELF linker
will group all the .bss sections together, and will put the .noptrbss
section after them.

Apparently that is not happening with your link. The Darwin ld man
page hints that it places all sections from each input file in order,
rather than grouping them by name as an ELF linker does. That would
be consistent with what you are seeing.

That is also what I see when I use gomote to build the misc/cgo/test
test with -race. For my test binary, everything works fine.
runtime.enoptrbss is at 0x4d05160. The final symbol in the program,
_issue8811Initialized, is at 0x4d05374. Those are on the same page in
memory, the one from 0x4d05000 to 0x4d06000, so the mmap of the Go
symbols include the C symbols.

Actually, though, the same is true of the values you show above.
runtime.enoptrbss is at 0x4cef0a0, which is a page that extends to
0x4cf0000, which is less than the hola symbol, so I don't know why
your program is failing. Oh, I see, that program is not failing. The
one that is failing is using some patch that must be shifting the
addresses so that the C symbols are on a different page.

Presumably the ideal fix would be for the race detector to somehow
create a shadow map for the C symbols. But I don't know how to do
that. I don't know how the program can discover the end of the BSS
section. The Darwin linker does not seem to create an _end symbol as
ELF linkers do.

Ian

Josh Bleecher Snyder

unread,
Sep 11, 2016, 3:34:51 PM9/11/16
to Ian Lance Taylor, golan...@googlegroups.com, David Crawshaw, Michael Hudson-Doyle, Dmitry Vyukov
Got it. Thanks, Ian. With that hint, I now have a standalone
reproduction. I've filed issue 17065 for further discussion.

https://github.com/golang/go/issues/17065

-josh
Reply all
Reply to author
Forward
0 new messages