Debugging "fatal: morestack on gsignal"

625 views
Skip to first unread message

varun...@gmail.com

unread,
Sep 1, 2021, 5:14:52 AM9/1/21
to golang-nuts
Hello,

Recently we have been seeing multiple occurrences of "fatal: morestack on gsignal" error in our system tests. The issue happens with golang 1.13.7 and golang 1.16.5 runtimes. The process restarts after this error and no stack trace is dumped. Our process uses CGO calls, I/O ops etc.

So far, the issue seem to happen only on CentOS7 with linux kernel version: 3.10.0-1127.19.1.el7.x86_64. The issue is not seen on: 3.10.0-693.5.2.el7.x86_64

At this point, we have no clue as to why this is happening and no stack trace is making it difficult to debug. Any pointers on debugging the issue would be of great help.

Thanks,
Varun

varun...@gmail.com

unread,
Sep 9, 2021, 7:14:42 AM9/9/21
to golang-nuts
Minor update:

a. The crash with "morestack on gsignal" happens to be independent of the kernel versions mentioned earlier
b. We implemented a signal handler to catch SIGSEGV from CGO. That is not helping either
c. From system audit logs, we can see the following message:

./audit/audit.log.2:22944:type=ANOM_ABEND msg=audit(1630376735.155:5486532): auid=4294967295 uid=996 gid=994 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=26422 comm="indexer" reason="memory violation" sig=5

Sig=5 is SIGTRAP. At this point, we suspect there is a SIGSEGV in CGO layer and that is being returned as SIGTRAP by runtime. 

I am still clueless as to why a process can crash with "morestack on gsignal".  Any pointers to debug this further would be of great help.

Thanks,
Varun

Kurtis Rader

unread,
Sep 9, 2021, 10:37:00 AM9/9/21
to varun...@gmail.com, golang-nuts
Googling "go morestack on gsignal" turns up quite a few reports. Some of which involved kernel bugs (e.g., https://github.com/golang/go/issues/19652) but most seem to involve SIGSEGV errors in non-kernel code; sometimes the Go runtime or stdlib (e.g., https://github.com/golang/go/issues/35235) and sometimes user code. Given that you're using CGO a likely explanation is a bug in your C code. However, you say you're using Go 1.13.7 which is getting long in the tooth. Are you sure you're seeing the same failure with Go 1.16.5? If yes then I would bet the bug is in your C code. If no then it could be a Go bug that has likely already been fixed.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/9f36ab30-3222-4eed-8388-93ecd259bf55n%40googlegroups.com.


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

varun...@gmail.com

unread,
Sep 9, 2021, 11:28:38 AM9/9/21
to golang-nuts
@Kurtis, Thanks for the reply.

Yes. We upgraded to 1.16.5 due to this issue and had to revert back to 1.13.7 (over earlier working version) as we are frequently seeing these errors in 1.16.5.

I looked into all the related issues on golang forums. We are not on NetBSD (We are on CentOS7.8)

We also suspect it is a bug in CGO code (specifically some memory violation) but finding the bug is like finding needle in a haystack. The SIGSEGV CGO signal handler did not catch anything at the time of this error. The error message by itself does not mention any thing debuggable. There is no stack or core dump. We tried working with GOTRACEBACK but that did not generate any core dump as well. 

Thanks,
Varun

Brian Candler

unread,
Sep 9, 2021, 1:17:30 PM9/9/21
to golang-nuts
Just a random thought, but have you tried 1.16 with GODEBUG=asyncpreemptoff=1 ?  The preemptive scheduling stuff was introduced in 1.14 I believe.

Varun V

unread,
Sep 9, 2021, 1:29:58 PM9/9/21
to Brian Candler, golang-nuts
@Brian, We did not try that as the issue is happening with go 1.13.7 as well

You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/msVmzxWjIOI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/4e6c8f7d-86be-4ecf-b4f5-19554f3d1379n%40googlegroups.com.

Ian Lance Taylor

unread,
Sep 9, 2021, 3:14:41 PM9/9/21
to Varun V, Brian Candler, golang-nuts
On Thu, Sep 9, 2021 at 10:29 AM Varun V <varun...@gmail.com> wrote:
>
> @Brian, We did not try that as the issue is happening with go 1.13.7 as well

The "fatal: morestack on gsignal" error is more or less impossible.
Some things that can cause it to happen are

1) C code calls sigaltstack, but the alternate signal stack is too small.
2) C code trashes the TLS slot that stores the value of G during a cgo call.
3) C code fails to correctly preserve registers across a cgo call.

None of these are at all likely.

Ian
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAMzZVzBrwcw%2BMJOjvXgg_YhFA0ZgYD%3DiLhYJyYHCxzXFKYD%2BeQ%40mail.gmail.com.

Robert Engels

unread,
Sep 13, 2021, 8:34:52 AM9/13/21
to Ian Lance Taylor, Varun V, Brian Candler, golang-nuts
If the C code overruns a stack allocated variable couldn’t it easily corrupt the saved registers ?

> On Sep 9, 2021, at 2:14 PM, Ian Lance Taylor <ia...@golang.org> wrote:
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAOyqgcX%3DDSaZBZVa69xM4RNxVhfV2KianmVOfmXm2Bs_3bUcxA%40mail.gmail.com.

Ian Lance Taylor

unread,
Sep 13, 2021, 2:16:52 PM9/13/21
to Robert Engels, Varun V, Brian Candler, golang-nuts
On Mon, Sep 13, 2021 at 5:34 AM Robert Engels <ren...@ix.netcom.com> wrote:
>
> If the C code overruns a stack allocated variable couldn’t it easily corrupt the saved registers ?

Yes, that is another possible cause. Thanks.

Ian

varun...@gmail.com

unread,
Sep 14, 2021, 1:59:24 PM9/14/21
to golang-nuts
Thanks @Ian for the comments. I think we have a breakthrough in this. One of our C-libraries uses breakpad (https://github.com/google/breakpad) to generate a mini-core dump when there is a crash. Disabling the breakpad crash handling mechanisms does not seem to reproduce the issue. We are still not clear about few things like:

a. In a normal test run, we do not notice any crashes. We do not expect breakpad to do anything until any crash happens. So, what is leading to breakpad mis-behaving?
b. How is breakpad leading to "morestack on gsignal"

We are investigating these two issues but we now have a hook to navigate the stack and try to understand what is happening. Thanks for all the comments.

Regards,
Varun

Reply all
Reply to author
Forward
0 new messages