linux/amd64 and linux/386 builds timing out in runtime tests

476 views
Skip to first unread message

Dave Cheney

unread,
Aug 9, 2017, 10:00:02 PM8/9/17
to golang-dev
Hello,

I submitted a revert [1] after the build dashboard indicated that
f0b362 [2] have broken the build for linux/amd64 and linux/386.

However this has not addressed the breakage. There may be something
sick with those builders.

Thanks

Dave

1. https://go-review.googlesource.com/c/54490
2. https://github.com/golang/go/commit/f0b36269041eff3b8bbdd18e2ff41b06557235d1

Pavel Zholkover

unread,
Aug 10, 2017, 10:28:31 AM8/10/17
to golang-dev
I'm not sure it's related but could the cause be the same as https://github.com/golang/go/issues/21383?
https://maintner.golang.org/apipb.MaintnerService/ service being down?

Austin Clements

unread,
Aug 10, 2017, 11:11:11 AM8/10/17
to golang-dev, Brad Fitzpatrick, Sarah Adams, Herbie Ong, Jessica Frazelle
I can reproduce on a linux-amd64 gomote by running all.bash (just running the runtime test isn't enough, and running locally is fine).

It's getting stuck building testprogcgo in the runtime test. In fact "cd src/runtime/testdata/testprogcgo; go build" takes 1 minute and 40 seconds. Most of that is spent running the cgo tool, which is in turn running gcc. Each gcc invocation is taking a really, really long time. However, the whole build consumes only 2 seconds total of CPU time. From polling process states, it looks like gcc is spending the vast majority of its time in "disk sleep".

So this seems like an IO issue with the builders. Did something change with the builder configuration recently that could cause excessively slow IO?

(Yay gomote ssh!)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sarah Adams

unread,
Aug 10, 2017, 2:00:42 PM8/10/17
to Austin Clements, Go Language Open Source Build System, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle
+go-builders@

Sarah Adams

unread,
Aug 10, 2017, 2:16:06 PM8/10/17
to Austin Clements, Go Language Open Source Build System, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle
$ kubectl get pods
maintner-deployment-705276126-8x8c1        0/1       CrashLoopBackOff   263        12d

and also, corresponding maintner CL deployed 12 days ago: https://go-review.googlesource.com/#/c/build/+/51590/

digging...

Sarah Adams

unread,
Aug 10, 2017, 2:20:28 PM8/10/17
to Austin Clements, Go Language Open Source Build System, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle

Sarah Adams

unread,
Aug 10, 2017, 5:59:27 PM8/10/17
to Austin Clements, Go Language Open Source Build System, Ross Light, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle
+light
I'm heading out to the town hall. Will look more tomorrow if this is still an issue.

No relevant changes in build config that I can see.
This is the last builder CL that landed before builds started breaking: https://go-review.googlesource.com/#/c/build/+/54051/

Though I can't see how that could have caused this timeout thing.

Seems like probably a GCP issue.

Ross Light

unread,
Aug 10, 2017, 6:08:00 PM8/10/17
to Sarah Adams, Austin Clements, Go Language Open Source Build System, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle
Ack.  Any possibility to roll back to an older maintner release?

+go-builders@

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Ross Light

unread,
Aug 10, 2017, 6:15:39 PM8/10/17
to Sarah Adams, Austin Clements, Go Language Open Source Build System, golang-dev, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle
I don't really have enough context here to investigate.  Without too much knowledge of the innards, it seems like rolling back would be best if possible.  If not, then we should try to isolate what changed in maintner.

Sarah Adams

unread,
Aug 10, 2017, 7:53:47 PM8/10/17
to Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
While I think there is a bug in maintner, I think it is probably separate from this issue (sorry for my conflation earlier).

I think this is related to kube/gce. We should investigate if there are any open kube issues that look relevant RN.

On phone now but I can look a little later.

Austin Clements

unread,
Aug 11, 2017, 10:08:48 AM8/11/17
to Sarah Adams, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
I didn't see any follow-up, but this appears to be resolved now.

Sarah, was there a specific fix, or did this just clear up on its own?


+go-builders@

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.

Sarah Adams

unread,
Aug 11, 2017, 12:19:44 PM8/11/17
to Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
I did nothing. So it must of cleared up on its own.
We should probably figure out if it was a G{CE,KE} issue though. I'll look into this today.

The maintner issue has also "fixed itself" (for now). It is a data-based issue, so it will keep happening (crash cycle) periodically until the fix goes in.
I wrote this CL to fix it yesterday (https://go-review.googlesource.com/#/c/54751/), but didn't have time to verify that that was actually the issue before leaving the office, so didn't merge it.

But it definitely is the fix we want. Anyway, got a ping from Brad on twitter last night (lol) to merge it.
I'll get that in and deployed today.

Sarah

Sarah Adams

unread,
Aug 11, 2017, 4:16:42 PM8/11/17
to Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
maintner fix is live.

Brad Fitzpatrick

unread,
Aug 11, 2017, 4:43:51 PM8/11/17
to Austin Clements, Sarah Adams, Go Language Open Source Build System, Herbie Ong, Jessica Frazelle, Ross Light, golang-dev
The CL is correct because the Gerrit query API call can use data that's ahead of maintner for brief windows. The mirror into maintner tries to be real-time but won't always, especially if Gerrit emails are behind. 

And once it crashed, it'll miss the pubsubhelper wakeup from Gerrit emails later and fall back to backup Gerrit poller giving even larger window to crash. That is, get unlucky timing wise once and then bug will keep showing. 

Fix is correct to assume CL referenced by Gerrit is necessarily in maintner. 

There are other open bugs to improve the real time sync and backup poller latency though. 

(Brief relaxing phone break while newborn sleeps in left hand :))

maintner fix is live.

+go-builders@

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Austin Clements

unread,
Aug 11, 2017, 4:52:06 PM8/11/17
to Sarah Adams, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
On Fri, Aug 11, 2017 at 12:19 PM, Sarah Adams <sha...@google.com> wrote:
I did nothing. So it must of cleared up on its own.
We should probably figure out if it was a G{CE,KE} issue though. I'll look into this today.

Sarah Adams

unread,
Aug 11, 2017, 5:10:31 PM8/11/17
to Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
yea these look like the same failures.

looks like this is the failing tests for the first:
runtime/crash_cgo_test.go#L333
TestCgoPprofPIE

this for the second two:
runtime/crash_test.go#L413
TestNetpollDeadlock

both tests shell out to another program.

Sarah Adams

unread,
Aug 11, 2017, 5:32:59 PM8/11/17
to Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
Sorry I'm not more help, this is a little over my head. 

+light +jess - any chance you've got a spare mo' to help debug this?

Jessica Frazelle

unread,
Aug 11, 2017, 5:35:30 PM8/11/17
to Sarah Adams, Jessica Frazelle, golang-dev, Austin Clements, Ross Light, Herbie Ong, Brad Fitzpatrick, Go Language Open Source Build System
Hmm catching up and will let you know if I have ideas...

Ross Light

unread,
Aug 11, 2017, 5:52:41 PM8/11/17
to Jessica Frazelle, Sarah Adams, Jessica Frazelle, golang-dev, Austin Clements, Herbie Ong, Brad Fitzpatrick, Go Language Open Source Build System
Do we have a contact on GCE side that we can ask questions?  If it's a question of I/O performance becoming unreliable, then perhaps something changed on their end?

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Ian Lance Taylor

unread,
Aug 11, 2017, 6:16:36 PM8/11/17
to Sarah Adams, Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
On Fri, Aug 11, 2017 at 2:32 PM, Sarah Adams <sha...@google.com> wrote:
> Sorry I'm not more help, this is a little over my head.
>
> +light +jess - any chance you've got a spare mo' to help debug this?

I took a look at a sample trybot failure at

https://storage.googleapis.com/go-build-log/ab3bfefd/linux-amd64-race_200d4737.log

Many of the runtime tests work by building a test program and running
it. In this case the test TestCgoPprofThreadNoTraceback is waiting
for the `go build` command to complete. A lock is used to build only
one test program at a time, and many tests are sleeping on that lock.
A few other tests are completed, and are waiting sending on a channel
to report their completion.

The stack trace of the goroutine waiting for `go build` to complete is

goroutine 22714 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0x531, 0xc420e6d7d8, 0x1000004, 0x0, 0x0,
0x0, 0x0, 0xc42001a640)
/tmp/workdir/go/src/syscall/asm_linux_amd64.s:44 +0x5
fp=0xc420e6d780 sp=0xc420e6d778 pc=0x4bf955
os.(*Process).blockUntilWaitable(0xc420ed4000, 0x0, 0x0, 0x5)
/tmp/workdir/go/src/os/wait_waitid.go:31 +0xcd fp=0xc420e6d898
sp=0xc420e6d780 pc=0x4efcdd
os.(*Process).wait(0xc420ed4000, 0xc4211f8120, 0xc420ed4000, 0xc420012240)
/tmp/workdir/go/src/os/exec_unix.go:22 +0x72 fp=0xc420e6d938
sp=0xc420e6d898 pc=0x4e8072
os.(*Process).Wait(0xc420ed4000, 0xc42033d9d0, 0xc4211f80b0, 0x4394db)
/tmp/workdir/go/src/os/exec.go:115 +0x39 fp=0xc420e6d980
sp=0xc420e6d938 pc=0x4e6f39
os/exec.(*Cmd).Wait(0xc4211f8000, 0x0, 0x0)
/tmp/workdir/go/src/os/exec/exec.go:446 +0xf6 fp=0xc420e6da30
sp=0xc420e6d980 pc=0x5ec066
os/exec.(*Cmd).Run(0xc4211f8000, 0x68, 0x16)
/tmp/workdir/go/src/os/exec/exec.go:289 +0x87 fp=0xc420e6da78
sp=0xc420e6da30 pc=0x5eae57
os/exec.(*Cmd).CombinedOutput(0xc4211f8000, 0xc4211f8000, 0x9,
0x83cd8d, 0xb, 0xc42001a520)
/tmp/workdir/go/src/os/exec/exec.go:506 +0x2be fp=0xc420e6daf0
sp=0xc420e6da78 pc=0x5ecd8e
runtime_test.buildTestProg(0xc4201444b0, 0x83cd8d, 0xb, 0xc420e46020,
0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
/tmp/workdir/go/src/runtime/crash_test.go:142 +0x526
fp=0xc420e6dd00 sp=0xc420e6daf0 pc=0x72c026
runtime_test.testCgoPprof(0xc4201444b0, 0x0, 0x0, 0x843a9c, 0x19)
/tmp/workdir/go/src/runtime/crash_cgo_test.go:281 +0xf8
fp=0xc420e6df40 sp=0xc420e6dd00 pc=0x728518
runtime_test.TestCgoPprofThreadNoTraceback(0xc4201444b0)
/tmp/workdir/go/src/runtime/crash_cgo_test.go:341 +0x60
fp=0xc420e6df78 sp=0xc420e6df40 pc=0x729770
testing.tRunner(0xc4201444b0, 0x84f6a8)
/tmp/workdir/go/src/testing/testing.go:749 +0x16d fp=0xc420e6dfd0
sp=0xc420e6df78 pc=0x5612fd
runtime.goexit()
/tmp/workdir/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc420e6dfd8
sp=0xc420e6dfd0 pc=0x48dcd1
created by testing.(*T).Run
/tmp/workdir/go/src/testing/testing.go:792 +0x569



Unfortunately this doesn't tell me much. It appears that the `go
build` is taking too long, but I don't know why.

Ian




> On Fri, Aug 11, 2017 at 2:10 PM, Sarah Adams <sha...@google.com> wrote:
>>
>> yea these look like the same failures.
>>
>> looks like this is the failing tests for the first:
>> runtime/crash_cgo_test.go#L333
>> TestCgoPprofPIE
>>
>> this for the second two:
>> runtime/crash_test.go#L413
>> TestNetpollDeadlock
>>
>> both tests shell out to another program.
>>
>> On Fri, Aug 11, 2017 at 1:50 PM, Austin Clements <aus...@google.com>
>> wrote:
>>>
>>> On Fri, Aug 11, 2017 at 12:19 PM, Sarah Adams <sha...@google.com> wrote:
>>>>
>>>> I did nothing. So it must of cleared up on its own.
>>>> We should probably figure out if it was a G{CE,KE} issue though. I'll
>>>> look into this today.
>>>
>>>
>>> I think this may be back, based on various trybot failures that just
>>> happened:
>>>
>>>
>>> https://go-review.googlesource.com/c/45511#message-1138454e5080610c8e6d59a8a9b0a9522e66f034
>>>
>>> https://go-review.googlesource.com/c/55050#message-4926134c551d383b5692c15b6ace101babbeab7a
>>>
>>> https://go-review.googlesource.com/c/55032#message-30bc91fbc63fb2432407041270ceaf21c2e662b6
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Go Language Open Source Build System" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to go-builders...@google.com.
> To post to this group, send email to go-bu...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/go-builders/CAOFTY0b1yy54YbOYpfvepJyTjggTWvGUbttyGg_pFhYOFDHPwQ%40mail.gmail.com.

Ian Lance Taylor

unread,
Aug 11, 2017, 6:18:28 PM8/11/17
to Sarah Adams, Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
By the way, I've tried to recreate the problem using gomote, but no luck.

Ian

Ralph Corderoy

unread,
Aug 12, 2017, 5:39:27 AM8/12/17
to Sarah Adams, Ian Lance Taylor, Austin Clements, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
Hi,

Ian wrote:
> In this case the test TestCgoPprofThreadNoTraceback is waiting for the
> `go build` command to complete.
...
> The stack trace of the goroutine waiting for `go build` to complete is
>
> goroutine 22714 [syscall]:
> syscall.Syscall6(0xf7, 0x1, 0x531, 0xc420e6d7d8, 0x1000004, 0x0, 0x0,
> 0x0, 0x0, 0xc42001a640)

Sarah, if you want to poke around this more another time, that Syscall6
broken down is

0xf7 = waitid(2)
0x1 = P_PID
0x531 = PID 1329
0xc420e6d7d8 = &siginfo_t
0x1000004 = WNOWAIT | WEXITED

So the third parameter is the PID you're interested in examining.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

Austin Clements

unread,
Aug 13, 2017, 11:35:55 AM8/13/17
to Ian Lance Taylor, Sarah Adams, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
Assuming this is the same problem that was happening a few days ago, each individual invocation of GCC by the cgo tool (itself invoked by go build) is taking several seconds, and 99% of that time is spent in I/O wait.

Sarah Adams

unread,
Aug 14, 2017, 3:22:40 PM8/14/17
to Austin Clements, Ian Lance Taylor, Go Language Open Source Build System, Ross Light, Brad Fitzpatrick, Herbie Ong, Jessica Frazelle, golang-dev
It seems to me that this is happening under load.
Though it could of course just be a random correlation.

These errors occurred when we had 100+ trybots running at once.
Similarly, these errors on the master builds also occurred under load.

Not sure yet how the builders are configured on the gcp instances. Wondering if they are all running on the same box / trying to share some resource.

Will keep digging.
Reply all
Reply to author
Forward
0 new messages