Cause of SIGBUS panic in gc?

1,605 views
Skip to first unread message

Justin Israel

unread,
Nov 28, 2018, 10:18:22 PM11/28/18
to golang-nuts
I've got a service that I have been testing quite a lot over the last few days. Only after I handed it off for some testing to a colleague, was he able to produce a SIGBUS panic that I had not seen before:

go 1.11.2 linux/amd64

The service does set up its own SIGINT/SIGTERM handling via the typical siginal.Notify approach. The nature of the program is that it listens on nats.io message queues, and receives requests to run tasks as sub-processes. My tests have been running between 40-200 of these instances over the course of a few days. But this panic occurred on a completely different machine that those I had been testing...

goroutine 1121 [runnable (scan)]:
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x2 addr=0xfa2adc pc=0x451637]

runtime stack:
runtime.throw(0xcf7fe3, 0x2a)
        /vol/apps/go/1.11.2/src/runtime/panic.go:608 +0x72
runtime.sigpanic()
        /vol/apps/go/1.11.2/src/runtime/signal_unix.go:374 +0x2f2
runtime.gentraceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc0004baa80, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, ...)
        /vol/apps/go/1.11.2/src/runtime/traceback.go:190 +0x377
runtime.traceback1(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc0004baa80, 0x0)
        /vol/apps/go/1.11.2/src/runtime/traceback.go:728 +0xf3
runtime.traceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc0004baa80)
        /vol/apps/go/1.11.2/src/runtime/traceback.go:682 +0x52
runtime.tracebackothers(0xc00012e780)
        /vol/apps/go/1.11.2/src/runtime/traceback.go:947 +0x187
runtime.dopanic_m(0xc00012e780, 0x42dcc2, 0x7f83f6ffc808, 0x1)
        /vol/apps/go/1.11.2/src/runtime/panic.go:805 +0x2aa
runtime.fatalthrow.func1()
        /vol/apps/go/1.11.2/src/runtime/panic.go:663 +0x5f
runtime.fatalthrow()
        /vol/apps/go/1.11.2/src/runtime/panic.go:660 +0x57
runtime.throw(0xcf7fe3, 0x2a)
        /vol/apps/go/1.11.2/src/runtime/panic.go:608 +0x72
runtime.sigpanic()
        /vol/apps/go/1.11.2/src/runtime/signal_unix.go:374 +0x2f2
runtime.gentraceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc0004baa80, 0x0, 0x0, 0x7fffffff, 0x7f83f6ffcd00, 0x0, 0x0, ...)
        /vol/apps/go/1.11.2/src/runtime/traceback.go:190 +0x377
runtime.scanstack(0xc0004baa80, 0xc000031270)
        /vol/apps/go/1.11.2/src/runtime/mgcmark.go:786 +0x15a
runtime.scang(0xc0004baa80, 0xc000031270)
        /vol/apps/go/1.11.2/src/runtime/proc.go:947 +0x218
runtime.markroot.func1()
        /vol/apps/go/1.11.2/src/runtime/mgcmark.go:264 +0x6d
runtime.markroot(0xc000031270, 0xc000000047)
        /vol/apps/go/1.11.2/src/runtime/mgcmark.go:245 +0x309
runtime.gcDrain(0xc000031270, 0x6)
        /vol/apps/go/1.11.2/src/runtime/mgcmark.go:882 +0x117
runtime.gcBgMarkWorker.func2()
        /vol/apps/go/1.11.2/src/runtime/mgc.go:1858 +0x13f
runtime.systemstack(0x7f83f7ffeb90)
        /vol/apps/go/1.11.2/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /vol/apps/go/1.11.2/src/runtime/proc.go:1229 

Much appreciated for any insight. 

Justin

Ian Lance Taylor

unread,
Nov 28, 2018, 11:32:49 PM11/28/18
to Justin Israel, golang-nuts
Is the problem repeatable?

It looks like it crashed while tracing back the stack during garbage
collection, but I don't know why since the panic was evidently able to
trace back the stack just fine.

Ian

Justin Israel

unread,
Nov 29, 2018, 12:21:20 AM11/29/18
to Ian Lance Taylor, golang-nuts
Thanks for the reply. Unfortunately it was rare and never happened in my own testing of thousands of runs of this service. The colleague that saw this crash on one of his workstations was not able to repro it after attempting another run of the workflow. I wasn't really sure how to debug this particular crash since it was in the gc and I have seen a "panic during panic" before. Thought it might jump out at someone.


Ian

Justin Israel

unread,
Nov 29, 2018, 12:22:56 AM11/29/18
to Ian Lance Taylor, golang-nuts
Oops. I meant that I *haven't* seen a "panic during panic" before :-) 



Ian

Justin Israel

unread,
Apr 30, 2019, 12:59:34 AM4/30/19
to golang-nuts
This is a follow up to the issue of seeing a SIGBUS in my application. While I still don't have a way to reproduce the problem, I have received reports from my users of another similar SIGBUS:

unexpected fault address 0x7fdf50
fatal error: fault
[signal 0xb code=0x2 addr=0x7fdf50 pc=0x7fdf50]

runtime.throw(0xad7840, 0x5)
        /s/go/1.12.1/src/runtime/panic.go:617 +0x72 fp=0xc000f75aa8 sp=0xc000f75a78 pc=0x444a5e 
runtime.sigpanic()
        /s/go/1.12.1/src/runtime/sigpanic_unix.go:387 +0x47e fp=0xc000f75ad8 sp=0xc000f75aa8 pc=0x444a5e

This is an anonymous inline function closure that was passed to a nats.io client topic subscription. If I am reading this correctly, it seems the address to the anonymous function is suddenly invalid?

ie.

go func() {
    ...
    someChan := make(chan bool, 1)
    natsConn.Subscribe(topic, func(_ string, typ Type) {
        ...
        someChan <- true
    })
    ...
}()

Could I be triggering a bug based on this anonymous function closure in the goroutine? I can try defining things outside the goroutine, including the function. But honestly without this being a reliable crash I would be fumbling in the dark. 

Justin 

vaastav anand

unread,
Apr 30, 2019, 1:43:23 AM4/30/19
to golang-nuts
I'd be very surprised if the anonymous goroutine is the reason behind a SIGBUS violation.
So, if I remember SIGBUS correctly, it means that you are issuing a read/write to a memory address which is not really addressable or it is misaligned. I think the chances of the address being misaligned are very low.....so it really has to be a non-existent address.
It can happen if you have try to access memory outside the region mmaped into your application.
If your application has any kind of mmap or shared memory access, I would start there.
In any case your best bet is to somehow reproduce the bug consistently and then figure out which memory access is causing the fault.

Justin Israel

unread,
Apr 30, 2019, 1:57:40 AM4/30/19
to vaastav anand, golang-nuts
On Tue, Apr 30, 2019 at 5:43 PM vaastav anand <vaastav...@gmail.com> wrote:
I'd be very surprised if the anonymous goroutine is the reason behind a SIGBUS violation.
So, if I remember SIGBUS correctly, it means that you are issuing a read/write to a memory address which is not really addressable or it is misaligned. I think the chances of the address being misaligned are very low.....so it really has to be a non-existent address.
It can happen if you have try to access memory outside the region mmaped into your application.
If your application has any kind of mmap or shared memory access, I would start there.
In any case your best bet is to somehow reproduce the bug consistently and then figure out which memory access is causing the fault.

My application isn't doing anything with mmap or shared memory, and my direct and indirect dependencies don't seem to be anything like that either. Its limited to pretty much nats.io client, gnatds embedded server, and a thrift rpc. 

It seems so random that I doubt I could get a reproducible crash. So I can really only try testing this on go 1.11 instead to see if any of the GC work in 1.12 causes this.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/5tIkzXWCK0k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vaastav anand

unread,
Apr 30, 2019, 2:09:09 AM4/30/19
to golang-nuts
Ok, so in the 2nd piece of code you posted, is some request being pushed onto some OS queue? If so, is it possible that you may be maxing the queue out and then pushing something else into it and that could cause a SIGBUS as well.... This seems super farfetched tho but it is hard to debug without really knowing what the application might really be doing.
To unsubscribe from this group and all its topics, send an email to golan...@googlegroups.com.

Justin Israel

unread,
Apr 30, 2019, 2:25:52 AM4/30/19
to vaastav anand, golang-nuts
On Tue, Apr 30, 2019 at 6:09 PM vaastav anand <vaastav...@gmail.com> wrote:
Ok, so in the 2nd piece of code you posted, is some request being pushed onto some OS queue? If so, is it possible that you may be maxing the queue out and then pushing something else into it and that could cause a SIGBUS as well.... This seems super farfetched tho but it is hard to debug without really knowing what the application might really be doing.

I want to say that I really appreciate you taking the time to try and give me some possible ideas, even though this is a really vague problem. I had only hoped someone had encountered something similar. 

So that line in the SIGBUS crash is just trying to add a subscription to a message topic callback in the nats client connection:
https://godoc.org/github.com/nats-io/go-nats#Conn.Subscribe 
It's pretty high level logic at my application level. 

One thing that stood out to me was that in the crash, the goroutine id number was 3538668. I had to double check to confirm that the go runtime just uses an insignificant increasing number. I guess it does indicate that the application turned over > 3 mil goroutines by that point. I'm wondering if this is caused by something in the gnatsd embedded server (https://github.com/nats-io/gnatsd/tree/master/server) since most the goroutines do come from that, with all the client handling going on. If we are talking about something that is managing very large queues, that would be the one doing so in this application.
 
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

vaastav anand

unread,
Apr 30, 2019, 2:33:11 AM4/30/19
to golang-nuts
I have encountered a SIGBUS with go before but I was hacking inside the runtime and using shared mem with mmap.

goroutines are assigned IDs incrementally and each goroutine at bare minimum has 2.1KB stack space in go1.11 down from 2.7KB in go1.10 if I recall correctly. So, at the very least at that point you could have easily burnt through at least 7.5GB of memory. I am not sure what could happen if you somehow exceed the amount of memory available. Seems like that is a test you could write and see if launching more goroutines than that could fit in the size of memory could actually cause a SIGBUS.

Justin Israel

unread,
Apr 30, 2019, 2:54:54 AM4/30/19
to vaastav anand, golang-nuts
On Tue, Apr 30, 2019 at 6:33 PM vaastav anand <vaastav...@gmail.com> wrote:
I have encountered a SIGBUS with go before but I was hacking inside the runtime and using shared mem with mmap.

goroutines are assigned IDs incrementally and each goroutine at bare minimum has 2.1KB stack space in go1.11 down from 2.7KB in go1.10 if I recall correctly. So, at the very least at that point you could have easily burnt through at least 7.5GB of memory. I am not sure what could happen if you somehow exceed the amount of memory available. Seems like that is a test you could write and see if launching more goroutines than that could fit in the size of memory could actually cause a SIGBUS.

The stack trace only listed 282 goroutines, which seems about right considering the number of clients that are connected. Its about 3 goroutines per client connection, plus the other stuff in the server. I think it just indicates that I have turned over a lot of client connections over time. 
 
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

vaastav anand

unread,
Apr 30, 2019, 3:28:29 AM4/30/19
to golang-nuts
The stack trace only lists goroutines that are not dead/not system goroutines/not the goroutine that is calling the traceback function. (src/runtime/traceback.go)
Additionally, I don't think go reclaims any memory from dead goroutines. allgs struct in src/runtime/proc.go file in the go source code holds all the goroutines that have been created during the lifetime of the program and it is all heap allocated. I don't know if the garbage collector reclaims any of these dead goroutines. If it doesn't, which I don't think it does because nothing ever seems to be removed from allgs.
  • Add to Phrasebook
    • No word lists for English -> English...
    • Create a new word list...
  • Copy

vaastav anand

unread,
Apr 30, 2019, 5:15:26 AM4/30/19
to golang-nuts
I was wrong about the gc not getting memory back from the goroutines. I think it does get that through the gcResetMarkState function.....
So I don't think the # of goroutines are the issue......I'm sorry if I misled you
Reply all
Reply to author
Forward
0 new messages