CGo, GC, and reproducible heap corruption

349 views
Skip to first unread message

ad...@google.com

unread,
Jul 16, 2015, 2:16:40 PM7/16/15
to golan...@googlegroups.com
Hi Gophers,

My team is developing a Go tool using the open-source libgit2 library and Go 1.4.2. Our (quite large) binary occasionally segfaults during a Go->C->Go callback sequence. From our GDB deep dives we observe the GC clearing valid Go memory. It seems related to GC running during the Go callback, which should in general work fine from reading the Go runtime tests.

We're well aware that the C code in this library could be at fault, so I reduced the source while retaining what I believe to be the same underlying issue. I've produced a small Go program (~50 SLOC) which calls just a handful of auditable C functions, yet panics with "runtime: garbage collector found invalid heap pointer *(0x4c20801f5e0+0x18)=0x1 s=nil". Using a GDB watchpoint on that address I see it is only ever written by the Go runtime itself. We can't find what about this small program is incorrect, having pored over threads and GitHub issues relating to the correct use of CGo. Our hope is that if we can figure out why this program crashes, we can figure out why our larger binary crashes.

We're using Go 1.4.2 on Linux x86_64, most specifically 3.13.0-55-generic #94-Ubuntu SMP Thu Jun 18 00:27:10 UTC 2015. I've also reproduced it on 3.13.0-49-generic #83-Ubuntu SMP Fri Apr 10 20:11:33 UTC 2015.

I've attached a tarball containing the repro source, along with a small script to build libgit2, build the binary and trigger the crash. Unfortunately libgit2 is only distributed through source and is built using cmake, a non-standard build tool. So you will need cmake to run the script/build libgit2.

Thank you for your time, and please let me know if there's any more information we can provide to help figure this out.

Cheers!
Mike


repro.tar.gz

David Crawshaw

unread,
Jul 16, 2015, 2:46:11 PM7/16/15
to ad...@google.com, golang-nuts
Without running your program, I see that crefspecs is allocated and
passed to C on L54. But it is not pinned on the Go side. That is, it's
still in the function scope, but it is not used after L54, so it could
be collected. If you declare crefspecs a global variable, do you still
see the problem?
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

ad...@google.com

unread,
Jul 16, 2015, 3:01:31 PM7/16/15
to golan...@googlegroups.com, ad...@google.com
On Thursday, July 16, 2015 at 2:46:11 PM UTC-4, David Crawshaw wrote:
Without running your program, I see that crefspecs is allocated and
passed to C on L54. But it is not pinned on the Go side. That is, it's
still in the function scope, but it is not used after L54, so it could
be collected. If you declare crefspecs a global variable, do you still
see the problem?

I do still see the problem if I declare "var crefspecs C.git_strarray" as a global variable.

(I had assumed, perhaps erroneously, that "defer C.free(unsafe.Pointer(crefspecs.strings))" would have pinned the overall crefspecs object - we ran into the need to pin objects during the reducing process. I don't think anything else there is relying on that class of assumption)

David Crawshaw

unread,
Jul 16, 2015, 3:10:05 PM7/16/15
to ad...@google.com, golang-nuts
On Thu, Jul 16, 2015 at 3:01 PM, <ad...@google.com> wrote:
> On Thursday, July 16, 2015 at 2:46:11 PM UTC-4, David Crawshaw wrote:
>>
>> Without running your program, I see that crefspecs is allocated and
>> passed to C on L54. But it is not pinned on the Go side. That is, it's
>> still in the function scope, but it is not used after L54, so it could
>> be collected. If you declare crefspecs a global variable, do you still
>> see the problem?
>
>
> I do still see the problem if I declare "var crefspecs C.git_strarray" as a
> global variable.
>
> (I had assumed, perhaps erroneously, that "defer
> C.free(unsafe.Pointer(crefspecs.strings))" would have pinned the overall
> crefspecs object - we ran into the need to pin objects during the reducing
> process. I don't think anything else there is relying on that class of
> assumption)

The expression crefspecs.strings is evaluated immediately, not when
the deferred function runs, which means crefspecs does not need to
live until the end of the function.

But that was the only thing that stood out reading your code. I'm not
having much luck running your code on OS X, I'll try again later
tonight.

David Crawshaw

unread,
Jul 19, 2015, 11:50:42 AM7/19/15
to Michael Edgar, golang-nuts
I can get your code to compile, but I cannot replicate your error.
(Instead I see what looks like an unrelated SIGABRT after the Go
callback returns.)

However, if I add -gcflags "-m -l -N" to your go install command, with
Go 1.4.2 I see:

./main.go:41: moved to heap: ccallbacks
./main.go:42: &ccallbacks escapes to heap
./main.go:29: main &repoPtr does not escape
./main.go:36: main &remotePtr does not escape
./main.go:43: main &ccallbacks does not escape
./main.go:44: main &ccallbacks does not escape
./main.go:45: main &ccallbacks does not escape
./main.go:55: main ... argument does not escape
./main.go:56: main &crefspecs does not escape
./main.go:57: main ... argument does not escape

That's a bug, the kind of bug that could cause the error you saw. In
fact, I believe it is https://golang.org/issue/10303, which has been
fixed in Go 1.5. Running with tip I see:

./main.go:27: moved to heap: repoPtr
./main.go:29: &repoPtr escapes to heap
./main.go:33: moved to heap: remotePtr
./main.go:36: &remotePtr escapes to heap
./main.go:41: moved to heap: ccallbacks
./main.go:42: &ccallbacks escapes to heap
./main.go:43: &ccallbacks escapes to heap
./main.go:44: &ccallbacks escapes to heap
./main.go:45: &ccallbacks escapes to heap
./main.go:50: moved to heap: crefspecs
./main.go:56: &crefspecs escapes to heap

...which looks promising. Could you try out the Go 1.5 beta and see if
that resolves your issue?

d.

ad...@google.com

unread,
Jul 21, 2015, 9:51:26 AM7/21/15
to golan...@googlegroups.com, ad...@google.com


On Sunday, July 19, 2015 at 11:50:42 AM UTC-4, David Crawshaw wrote:
...which looks promising. Could you try out the Go 1.5 beta and see if
that resolves your issue?


Rebuilding with Go 1.5 beta2 seems to have resolved the segfaults in our main binary in our tests and in our users' environments.

Given the generality of the linked issue and the others filed similar to it, it seems like cgo was broken for supported, expected use-cases for at least part of the 1.4 release cycle. Will the Go team be producing a postmortem which details the extent of the impact, root cause, and actions taken to prevent breakage in the future?

David Crawshaw

unread,
Jul 21, 2015, 10:43:34 AM7/21/15
to Michael Edgar, golang-nuts
On Tue, Jul 21, 2015 at 9:51 AM, <ad...@google.com> wrote:
>
>
> On Sunday, July 19, 2015 at 11:50:42 AM UTC-4, David Crawshaw wrote:
>>
>> ...which looks promising. Could you try out the Go 1.5 beta and see if
>> that resolves your issue?
>>
>
> Rebuilding with Go 1.5 beta2 seems to have resolved the segfaults in our
> main binary in our tests and in our users' environments.
>
> Given the generality of the linked issue and the others filed similar to it,
> it seems like cgo was broken for supported, expected use-cases for at least
> part of the 1.4 release cycle. Will the Go team be producing a postmortem
> which details the extent of the impact, root cause, and actions taken to
> prevent breakage in the future?

A postmortem is typically used at Google for a production outage or
paging event. The ones I have been involved in have been helpful in
diagnosing organizational process issues, in particular interactions
between teams. But I'm unsure how to apply it here.

In particular, while I can safely say what you encountered here will
be classified as a bug, and golang.org/cl/10814 included a test to
avoid regressions, it is part of a larger effort to specify what is
allowed and what is not with Go pointers crossing the cgo boundary
into C. If you read through #10303 and follow the referenced
golang-dev discussions you'll see this is an ongoing unresolved issue.
(See also #8310, #8839, etc.) It is not unresolved for lack of effort
or discussion, it's a hard technical problem that I don't think I can
constructively add to with an SRE-inspired postmortem.
Reply all
Reply to author
Forward
0 new messages