Strange behavior on Go tip

629 views
Skip to first unread message

ca...@stripe.com

unread,
Aug 5, 2015, 4:33:24 PM8/5/15
to golang-dev
At work we're running tests for a moderate sized Go codebase (wc says ~60k including non stdlib dependencies) against Go tip in preparation for the Go 1.5 release, and we've noticed some strange behavior of the "smells like it could be a bug in Go" variety. I'm not sure if this is the right venue to surface this, or if I should do more legwork on my end to produce a more minimal test case, but here goes:

The first oddity appears to be a fault from within the runtime hashmap implementation (all backtraces lightly edited for privacy):

unexpected fault address 0x0
fatal error: fault
[signal 0x7 code=0x80 addr=0x0 pc=0x408e5b]

goroutine 106 [running]:
runtime.throw(0x8bce98, 0x5)
/home/carl/git/go/src/runtime/panic.go:527 +0x90 fp=0xc82003c9c8 sp=0xc82003c9b0
runtime.sigpanic()
/home/carl/git/go/src/runtime/sigpanic_unix.go:21 +0x10c fp=0xc82003ca18 sp=0xc82003c9c8
runtime.mapassign1(0x7c4e20, 0xc820418480, 0xc82003cb20, 0xc82003cb60)
/home/carl/git/go/src/runtime/hashmap.go:446 +0x2eb fp=0xc82003cae0 sp=0xc82003ca18
FuncA(0xc8203bdd50, 0xc8203dc000, 0xb, 0x0, 0x0)
/path/to/package/under/test/funca.go:110 +0x1f9 fp=0xc82003cb80 sp=0xc82003cae0
GoroutineOfClosureInTestFunc(0xc82013d8c0, 0xc8200b45a0, 0xc82009eaa0, 0x31)
/path/to/package/under/test/my_test.go:82 +0xdf7 fp=0xc82003cf90 sp=0xc82003cb80
runtime.goexit()
/home/carl/git/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc82003cf98 sp=0xc82003cf90
created by TestFunc
/path/to/package/under/test/my_test.go:103 +0x174

Many other goroutines have been omitted for brevity. The relevant line in FuncA is a simple vanilla map assignment, and at the point of the panic the map in question has been exclusively created and owned by a single goroutine. This one occurred three times: once in our testing environment (Travis CI), once in a local Ubuntu 14.04.3 LTS VM, and once on a remote Ubuntu 12.04.5 LTS VM. The latter two were occurred using the same binary.

The second oddity is an unexplained panic in makeslice:

panic: runtime error: makeslice: len out of range
 
goroutine 82 [running]:
FuncA(0xc8204fdd70, 0x7fe150388c18, 0xc82000a090, 0x0, 0x0, 0x0, 0x0, 0x0)
        /path/to/package/under/test/subpackage/filea.go:18 +0x92
FuncB(0xc820369960, 0x7fe150388c18, 0xc82000a090, 0x0, 0x0)
        /path/to/package/under/test/fileb.go:92 +0xa3
FuncC(0xc820369960, 0xc82011e020, 0x1d, 0x1d, 0x0, 0x0)
        /path/to/package/under/test/fileb.go:125 +0x14e
FuncD(0xc820369960, 0xc82011e020, 0x1d, 0x1d, 0x0, 0x0)
        <autogenerated>:29 +0x7a
GoroutineServerRecvLoop(0xc8200a29b0, 0x7fe151bc7138, 0xc8201b80c0, 0x7fe151bc7178, 0xc8201ba000)
        /path/to/package/under/test/channel.go:195 +0x2fe
created by /path/to/package/under/test.(*Channel).connLoop
        /path/to/package/under/test/channel.go:110 +0x40b

This normally indicates a programming error, but FuncA looks like this:

type Thing struct {
Length int
}

func (t *Thing) FuncA(reader io.Reader) ([]byte, error) {
data := make([]byte, t.Length)
_, err := io.ReadFull(reader, data)
return data, err
}

and I happen to know that Thing.Length is always a compile-time constant in the range 1–110 that is set once at struct creation time and never mutated. This one occurred once on a Ubuntu 12.04.5 LTS VM.

The third oddity is a runtime sanity check / error return that resulted in an ordinary test failure.

channel_test.go:89: len(data)=2 != field.length=859534820080

The code generating this error is a method on the same Thing struct from the previous oddity:

func (t *Thing) Write(writer io.Writer, data []byte) error {
if len(data) != t.Length {
return fmt.Errorf("len(data)=%d != thing.length=%d", len(data), t.Length)
}
_, err := writer.Write(data)
return err
}

As above, Length is always a compile-time small integer constant, and the value 859534820080 (or any value like it) does not appear in our codebase. This one occurred once on an Ubuntu 12.04.5 LTS VM.


All of the above results occurred against Go commit 3cfc34a555aad78cff2023eb00358e4d9a8f7940, bootstrapped using the official 1.4.2 tarball, on 64-bit Linux. I have a binary produced by "go test -race -c" that has produced each of the above results over thousands to tens-of-thousands of runs. The error seems to be triggered ~exclusively as part of a moderately goroutine intensive stress test (50 goroutines). No non-standard library package in any dependency of this package includes package unsafe, and there is no use of cgo. I haven't performed similar tests against Go 1.4 (beyond our standard CI) or any other Go versions, so I can't be sure that this behavior is new since the last release / beta.

I guess first of all: has anyone else seen odd behavior like this in Go tip or in any of the betas? Our code under test doesn't strike me as being particularly interesting, so I'd be surprised if we were the only ones to stumble upon this behavior. Secondly, does this smell like a Go bug? I haven't conclusively ruled out the possibility of a race in our application logic or testing code, but I've given it a pretty honest looking over.

Assuming this behavior does indicate a bug in Go, I'm happy to help out in whatever way I can, but unfortunately the code is proprietary and I likely won't be able to distribute it very widely.

Carl

Carl Jackson

unread,
Aug 5, 2015, 8:29:14 PM8/5/15
to rhys.h...@gmail.com, golang-dev
Thanks, Rhys!

I've bumped the number of goroutines in the problematic stress test from 50 to 500, and now have a 64-bit linux test binary that pretty reliably (~90% of runs) triggers one of the three weird behaviors. I've been playing around with GODEBUG flags per your suggestion, and have found the following:

- gcstoptheworld=1 reliably eliminates the errors
- gccheckmark=1 reliably fails with an error like the following:

runtime:greyobject: checkmarks finds unexpected unmarked object obj=0xc820129bd0
runtime: found obj at *(0xc820371d50+0x8)
base=0xc820371d50 k=0x64101b8 s.start*_PageSize=0xc820370000 s.limit=0xc820371ff0 s.sizeclass=8 s.elemsize=112
 *(base+0) = 0x7f36649ca9e8
 *(base+8) = 0xc820129bd0 <==
 *(base+16) = 0xc82005d3b0
 *(base+24) = 0x7f366407d000
 *(base+32) = 0xabbb80
 *(base+40) = 0xc82005d3e0
 *(base+48) = 0xc820129be0
 *(base+56) = 0x2
 *(base+64) = 0x8
 *(base+72) = 0xabc558
 *(base+80) = 0x0
 *(base+88) = 0x0
 *(base+96) = 0xc820321a70
 *(base+104) = 0x0
obj=0xc820129bd0 k=0x6410094 s.start*_PageSize=0xc820128000 s.limit=0xc82012a000 s.sizeclass=2 s.elemsize=16
 *(obj+0) = 0x2
 *(obj+8) = 0x0
fatal error: checkmark found unmarked object

Based on my understanding of gccheckmark from reading the documentation in package runtime, this appears to be a bug in the Go GC. I'll continue to poke on this end, but as always any pointers (ha ha) towards identifying the root issue here would definitely be appreciated!

On Wed, Aug 5, 2015 at 3:05 PM, <rhys.h...@gmail.com> wrote:
The value 859534820080 is 0xC8204482F0 in hex - looks like an address. If t.Length is the first word in the struct (as you indicate), it could be that the memory for t has been reclaimed, leaving the t pointer dangling. I ran into heap corruption in golang.org/issue/11643 - my understanding is that when memory is reclaimed, the first word of the object is overwritten by a pointer to the next free piece of memory and the second word (if applicable) is overwritten with 0xdeaddeaddeaddead to indicate that the remainder of the object needs to be zeroed before use.

Things that helped us debug were 1) running with GODEBUG=gccheckmark=1 and 2) looking through the code at *everything* that touched the problematic memory location (or held pointers to it). In our application, a pointer to the object was passed on an unbuffered channel via a select statement.

HTH,
Rhys

rhys.h...@gmail.com

unread,
Aug 5, 2015, 8:37:59 PM8/5/15
to golang-dev, ca...@stripe.com
The value 859534820080 is 0xC8204482F0 in hex - looks like an address. If t.Length is the first word in the struct (as you indicate), it could be that the memory for t has been reclaimed, leaving the t pointer dangling. I ran into heap corruption in golang.org/issue/11643 - my understanding is that when memory is reclaimed, the first word of the object is overwritten by a pointer to the next free piece of memory and the second word (if applicable) is overwritten with 0xdeaddeaddeaddead to indicate that the remainder of the object needs to be zeroed before use.

Things that helped us debug were 1) running with GODEBUG=gccheckmark=1 and 2) looking through the code at *everything* that touched the problematic memory location (or held pointers to it). In our application, a pointer to the object was passed on an unbuffered channel via a select statement.

HTH,
Rhys

On Wednesday, August 5, 2015 at 1:33:24 PM UTC-7, ca...@stripe.com wrote:

Russ Cox

unread,
Aug 5, 2015, 8:52:03 PM8/5/15
to Carl Jackson, rhys.h...@gmail.com, golang-dev
Hi Carl,

On Wed, Aug 5, 2015 at 8:29 PM, 'Carl Jackson' via golang-dev <golan...@googlegroups.com> wrote:
I've bumped the number of goroutines in the problematic stress test from 50 to 500, and now have a 64-bit linux test binary that pretty reliably (~90% of runs) triggers one of the three weird behaviors. I've been playing around with GODEBUG flags per your suggestion, and have found the following:

- gcstoptheworld=1 reliably eliminates the errors
- gccheckmark=1 reliably fails with an error like the following:

If you replace the gcstoptheworld=1 with either gcshrinkstackoff=1 or gcstackbarrieroff=1, does that also reliably eliminate the errors? Stopping the world effectively implies both of those, and more. Knowing whether each of those helps would narrows down the search.

runtime:greyobject: checkmarks finds unexpected unmarked object obj=0xc820129bd0
runtime: found obj at *(0xc820371d50+0x8)
base=0xc820371d50 k=0x64101b8 s.start*_PageSize=0xc820370000 s.limit=0xc820371ff0 s.sizeclass=8 s.elemsize=112
 *(base+0) = 0x7f36649ca9e8
 *(base+8) = 0xc820129bd0 <==
 *(base+16) = 0xc82005d3b0
 *(base+24) = 0x7f366407d000
 *(base+32) = 0xabbb80
 *(base+40) = 0xc82005d3e0
 *(base+48) = 0xc820129be0
 *(base+56) = 0x2
 *(base+64) = 0x8
 *(base+72) = 0xabc558
 *(base+80) = 0x0
 *(base+88) = 0x0
 *(base+96) = 0xc820321a70
 *(base+104) = 0x0
obj=0xc820129bd0 k=0x6410094 s.start*_PageSize=0xc820128000 s.limit=0xc82012a000 s.sizeclass=2 s.elemsize=16
 *(obj+0) = 0x2
 *(obj+8) = 0x0
fatal error: checkmark found unmarked object

Odd question, I know, but does that object look familiar? We know contains pointers and it's at least 104 bytes long (there's a non-zero word at base+96), maybe 112 (depending on whether that final 0 is a field or padding). It contains pointers, so the object size is a multiple of 8, so those are the only options. If you 'go get rsc.io/sizeof', you can run 'sizeof -p your/package' to get the sizes of all types. Look for ones that are 104 or 112 bytes long. You can run 'sizeof -p your/package -f typename' to see the memory layout of the type.

For example, I just ran:

$ sizeof -p runtime | grep 112
mspan 112
parforthread 112
maptype 112
$

$ sizeof -f -p runtime parforthread
parforthread 112
parforthread.pos 0
parforthread.nsteal 8
parforthread.nstealcnt 16
parforthread.nprocyield 24
parforthread.nosyield 32
parforthread.nsleep 40
parforthread.pad 48
$

Parforthread doesn't match up since basically all the fields are integers, not pointers. (You have to look at the Go source code to find the fields types.) The other runtime types don't match up either. So probably it's one of the struct types in your program.

If you find a candidate, then the interesting question is what writes to that second word (which points at a type not larger than 16 bytes), and how and when, and why the garbage collector might have missed that write in your program.

Thanks very much.
Russ

Carl Jackson

unread,
Aug 5, 2015, 9:36:07 PM8/5/15
to Russ Cox, rhys.h...@gmail.com, golang-dev
The type in question is one from the package under test, and is 104 bytes long. Expanding types into standard ones, it looks like this:

type Thing struct {
a interface{}
b map[int][]byte // <==
c interface{}
d map[int]interface{}
e []byte
f []byte
g map[int][]byte // <==
}

I've marked the two fields of type map[int][]byte—those are the two fields that gccheckmark indicate (normally b, but sometimes g). Perhaps not so curiously, map[int][]byte is also the type being assigned to as part of "oddity one" (the runtime.mapassign1 one from my original post).

Field b is a map literal defined in a var statement at the top level of the package, and is never mutated or reassigned. Field g is instantiated using "make(map[int][]byte)" (with no capacity hint). Both fields b and g are single-assignment and provided at struct instantiation time. In case it matters, field b is actually of type "map[IntAlias][]byte", with "type IntAlias int".

None of gcshrinkstackoff=1, gcstackbarrieroff=1, or the combination of the two have any effect.

Russ Cox

unread,
Aug 6, 2015, 12:03:11 AM8/6/15
to Carl Jackson, rhys.h...@gmail.com, golang-dev
Hi Carl,

What about the 'a interface{}' field?
How is it assigned or obtained?
What concrete type should it contain?
Is it a pointer type or an integer type?

The dump says

 *(base+0) = 0x7f36649ca9e8
 *(base+8) = 0xc820129bd0 <==

An interface is two words, so actually base+8 is the second word of the interface, not the 'b map' field. Also, are fields a and d interface{} or interfaces with methods? They have different representations (http://research.swtch.com/interfaces if you are interested). From the addresses it looks to me like they are both interfaces with methods.

That second word is the actual typed data being stored in the interface (if the data is itself a pointer) or else a pointer to the data. It looks like it points to an object of at most 16 bytes that starts with a 2.

Thanks.
Russ

Carl Jackson

unread,
Aug 6, 2015, 2:00:18 AM8/6/15
to Russ Cox, Rhys Hiltner, golang-dev
Ah, oops. I was aware of how interfaces were represented, but am apparently very bad at counting :)

To make sure I'm communicating as many of the relevant facts as I can without handing out a full code dump, I've made a gist with the output of one gccheckmark=1 of each variety (as much as anything so you can double-check my math!), plus a full type catalog, plus the manner in which the struct in question is constructed. None of the relevant fields of that struct are set outside the initialization code as presented (and I realize now that the fact that initialization involves a lot of struct copying may be a relevant factor here):



Carl Jackson

unread,
Aug 6, 2015, 2:41:42 AM8/6/15
to Russ Cox, Rhys Hiltner, golang-dev
Okay, I found a really short, really reliable reproducer. Works on both Darwin and Linux:

Brad Fitzpatrick

unread,
Aug 6, 2015, 3:08:24 AM8/6/15
to Carl Jackson, Rhys Hiltner, Russ Cox, golang-dev

I can't reproduce on Darwin at least. How long does it take? Any special config required?

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carl Jackson

unread,
Aug 6, 2015, 3:27:44 AM8/6/15
to Brad Fitzpatrick, Rhys Hiltner, Russ Cox, golang-dev
I get the following within about 10 seconds:

~$ uname -a
Darwin devi.local 14.4.0 Darwin Kernel Version 14.4.0: Thu May 28 11:35:04 PDT 2015; root:xnu-2782.30.5~1/RELEASE_X86_64 x86_64
~$ ~/git/go/bin/go version
go version devel +fced03a Thu Aug 6 02:59:16 2015 +0000 darwin/amd64
~$ GODEBUG=gccheckmark=1 ~/git/go/bin/go run -race heavy.go
runtime:greyobject: checkmarks finds unexpected unmarked object obj=0xc82072cde0
runtime: found obj at *(0xc82072cdf0+0x8)
base=0xc82072cdf0 k=0x6410396 s.start*_PageSize=0xc82072c000 s.limit=0xc82072e000 s.sizeclass=2 s.elemsize=16
 *(base+0) = 0xc82072cdd0
 *(base+8) = 0xc82072cde0 <==
obj=0xc82072cde0 k=0x6410396 s.start*_PageSize=0xc82072c000 s.limit=0xc82072e000 s.sizeclass=2 s.elemsize=16
 *(obj+0) = 0x0
 *(obj+8) = 0x0
fatal error: checkmark found unmarked object
(etc etc.)

-----

carl@go15:~$ uname -a
Linux go15 3.16.0-45-generic #60~14.04.1-Ubuntu SMP Fri Jul 24 21:16:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
carl@go15:~$ go version
go version devel +fced03a Thu Aug 6 02:59:16 2015 +0000 linux/amd64
carl@go15:~$ GODEBUG=gccheckmark=1 go run -race heavy.go
runtime:greyobject: checkmarks finds unexpected unmarked object obj=0xc8204170d0
runtime: found obj at *(0xc8204170e0+0x8)
base=0xc8204170e0 k=0x641020b s.start*_PageSize=0xc820416000 s.limit=0xc820418000 s.sizeclass=2 s.elemsize=16
 *(base+0) = 0xc8204170c0
 *(base+8) = 0xc8204170d0 <==
obj=0xc8204170d0 k=0x641020b s.start*_PageSize=0xc820416000 s.limit=0xc820418000 s.sizeclass=2 s.elemsize=16
 *(obj+0) = 0x0
 *(obj+8) = 0x0
fatal error: checkmark found unmarked object

Carl

Caleb Spare

unread,
Aug 6, 2015, 3:28:13 AM8/6/15
to Brad Fitzpatrick, Carl Jackson, Rhys Hiltner, Russ Cox, golang-dev
I see the error ("runtime:greyobject: checkmarks finds unexpected
unmarked object") using tip running the code using the given
invocation after 1-2 seconds.

go version devel +fced03a Thu Aug 6 02:59:16 2015 +0000 darwin/amd64
Macbook Pro 10.10.2

-Caleb

On Thu, Aug 6, 2015 at 12:08 AM, Brad Fitzpatrick <brad...@golang.org> wrote:

rhys.h...@gmail.com

unread,
Aug 6, 2015, 4:47:00 AM8/6/15
to golang-dev, brad...@golang.org, ca...@stripe.com, rhys.h...@gmail.com, r...@golang.org
I'm able to reproduce the crash while running with the race detector. I'm not (yet) able to reproduce the crash without the race detector.

Carl, have you been running with the race detector this whole time? Have you ever seen this crash without the race detector?




$ go version

go version devel +fced03a Thu Aug 6 02:59:16 2015 +0000 darwin/amd64


$ time GODEBUG=gccheckmark=1 go run -race heavy.go 

runtime:greyobject: checkmarks finds unexpected unmarked object obj=0xc82058c550

runtime: found obj at *(0xc82058c560+0x8)

base=0xc82058c560 k=0x64102c6 s.start*_PageSize=0xc82058c000 s.limit=0xc82058e000 s.sizeclass=2 s.elemsize=16

 *(base+0) = 0xc82058c540

 *(base+8) = 0xc82058c550 <==

obj=0xc82058c550 k=0x64102c6 s.start*_PageSize=0xc82058c000 s.limit=0xc82058e000 s.sizeclass=2 s.elemsize=16

 *(obj+0) = 0x0

 *(obj+8) = 0x0

fatal error: checkmark found unmarked object

[snip]

exit status 2


real 0m2.712s

user 0m3.360s

sys 0m0.629s


$ time GODEBUG=gccheckmark=1 go run heavy.go 

^Cexit status 2


real 1m25.349s

user 3m25.085s

sys 0m4.475s

Carl Jackson

unread,
Aug 6, 2015, 5:19:40 AM8/6/15
to Rhys Hiltner, golang-dev, Brad Fitzpatrick, Russ Cox
I think I mentioned this in the original report, but every binary and test I've run has had the race detector enabled. I haven't been able to reproduce this bug without it, although I've only made cursory attempts to do so

Fredrik Ehnbom

unread,
Aug 6, 2015, 7:30:19 AM8/6/15
to golang-dev, rhys.h...@gmail.com, brad...@golang.org, r...@golang.org, ca...@stripe.com
I can also reproduce with "go version devel +fced03a Thu Aug 6 02:59:16 2015 +0000 darwin/amd64", but wanted to point out that it appears to disappear with GOMAXPROCS=1 if that's of any help.

/f

Damian Gryski

unread,
Aug 6, 2015, 9:40:27 AM8/6/15
to golang-dev, rhys.h...@gmail.com, brad...@golang.org, r...@golang.org, ca...@stripe.com
I tried bisecting it twice, once with a test timeout of 10s and once with a test timeout of 60s.  They both pointed me to different commits, and the previous commit (the "last good" commit) still failed when I ran it again.

Probably this test will need to be run several times with a longer timeout to decide if a particular revision is "good" or "bad".

Damian

Damian Gryski

unread,
Aug 6, 2015, 9:51:27 AM8/6/15
to golang-dev, rhys.h...@gmail.com, brad...@golang.org, r...@golang.org, ca...@stripe.com


On Thursday, August 6, 2015 at 3:40:27 PM UTC+2, Damian Gryski wrote:
I tried bisecting it twice, once with a test timeout of 10s and once with a test timeout of 60s.  They both pointed me to different commits, and the previous commit (the "last good" commit) still failed when I ran it again.

I was using go1.5beta1 ( b6ead9f171742cd5b519a22ecc690354b0d1ce27 ) as a known-good revision.  Running the test program five times with a one minute timeout produced no errors.  With the current tip I also get failures in 3-4 seconds.

Damian 

Damian Gryski

unread,
Aug 6, 2015, 10:44:46 AM8/6/15
to golang-dev, rhys.h...@gmail.com, brad...@golang.org, r...@golang.org, ca...@stripe.com
Having now run bisect again, with retrying 5 times with 60s timeout each time to get a failure, I'm pointed at https://github.com/golang/go/commit/64a32ffeee0003a6ead15a890016fea3187fabb1 as the first bad commit.  However, even on this commit it only fails ~1/5 times.  It's possible the true bug was introduced earlier, but this one made it a bit more visible.

Damian
 

Damian Gryski

unread,
Aug 6, 2015, 11:07:13 AM8/6/15
to golang-dev, rhys.h...@gmail.com, brad...@golang.org, r...@golang.org, ca...@stripe.com
Yup, still getting failures on 8f34b25318e712a18c3847bb1cc3e8d87076c211 , although failure rate is  ~1/20.
 
Damian
 

Russ Cox

unread,
Aug 7, 2015, 12:19:09 PM8/7/15
to Damian Gryski, golang-dev, Rhys Hiltner, Brad Fitzpatrick, Carl Jackson
Thanks for the test case. I've filed golang.org/issue/12068.

Reply all
Reply to author
Forward
0 new messages