Seek help about a quirky CI failure and the environment of amd64-longtest

157 views
Skip to first unread message

Andy Pan

unread,
Apr 29, 2021, 10:06:19 PM4/29/21
to golang-dev
Hi all,
I submitted a CL https://go-review.googlesource.com/c/go/+/310149 to Go repo recently, which improves the process of draining P's local runnable queue in GC mark worker, it passes all tests on all golang build dashboard except for amd64-longtest, details: https://build.golang.org/log/6af9fb147fa3101154db10e7ce055e8267cd4c93https://storage.googleapis.com/go-build-log/1a3f4eea/linux-amd64-longtest_43bbc9fb.log.

It looked as if the runqdrain() will mess up some G's statuses, but I failed to figure out the root cause cuz runqdrain() is call only by the own P, some of my attempts can be seen here: https://go-review.googlesource.com/c/go/+/313009, so I hope anyone who specializes in Go runtime would help me out here or provide some valuable opinions, thanks in advance.

BTW, I also want to know how to set up an exact environment like amd64-longtest, I have never succeeded in reproducing this problem on my machines, all tests always pass, which makes it hard for me to debug this weird bug, I don't want to bother the reviewers to shoot a TryBot to verify every new experimental solution for me, thus I would appreciate it if someone helps me out of setting up amd64-longtest, thanks~

Keith Randall

unread,
Apr 30, 2021, 10:58:49 AM4/30/21
to Andy Pan, golang-dev
I think this should just be "GO_TEST_SHORT=0 all.bash".
You might also need to set GO_TEST_TIMEOUT_SCALE to >1 if your machine is slow.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/72deab9e-a160-418f-85b2-4e9d4a3d6127n%40googlegroups.com.

Andy Pan

unread,
Apr 30, 2021, 11:06:25 AM4/30/21
to golang-dev
I've tried $GoDir/bin/go test -short=false runtime before which will achieve the same effect as your approach from my point of view, and it always succeeded as I mentioned before.

Michael Pratt

unread,
Apr 30, 2021, 11:08:50 AM4/30/21
to Andy Pan, golang-dev
Hi Andy,

I'm able to reproduce this issue locally at fbb600b28349a41742d35f1d2417c5843c6ba6e4 with: `while env GOMAXPROCS=16 ../bin/go test -count=4 -timeout=60s -run=TestChanSendBarrier$ runtime >/tmp/out.txt 2>&1; do echo again; done`

It takes a few minutes, but this lands me some timeouts as well as other throws. I'm also taking a look to see why it is failing.

--

Andy Pan

unread,
Apr 30, 2021, 11:20:34 AM4/30/21
to golang-dev
Hi Keith and Michael, thanks for your time here.

Michael, there are three errors caused by this CL: runtime timeout, newproc1 with bad g status (not Gdead) and goready with bad g status (expected _Gwaiting but got _Gdead).

Andy Pan

unread,
Apr 30, 2021, 11:28:52 AM4/30/21
to golang-dev
I've been trying to track down the root cause that messes up the status of some goroutines in p.gFree and work.assistQueue.q but got no luck.

Michael Pratt

unread,
Apr 30, 2021, 1:10:47 PM4/30/21
to Andy Pan, golang-dev
The immediate cause is clearly two threads using the same gp.schedlink (used by gQueue/gList). In the Gdead case, gfget is returning a g from the local p.gFree gList. There is no corresponding gfput, so the most likely cause is a corrupt gp.schedlink on an earlier entry in the list.

Here's a potential scenario where I think this can go wrong:

1. P1: call runqdrain, read head / tail.
2. P1: Push G1 to drainQ, setting g1.schedlink = 0, drainQ.head = g1.
3. P2: Call runqsteal, read head / tail, steal G1, update head.
4. P2: Execute G1, which exits.
5. P2: In goexit0, call p.gFree.push(g1), which sets g1.schedlink = p.gFree.head.
6. P1: Push G2 to drainQ, setting g2.schedlink = 0, drainQ.head.schedlink (== g1.schedlink) = g2.
7. P1: Attempt to CAS update head, which fails due to P2's steal.
8. P1: Retry, this time only draining G2.

runqdrain thinks it was successful, but step 6 is the critical error. In step 6, runqdrain has linked g2 onto P2's p.gFree list, at which point now a live goroutine can be erroneously returned from gfget as a "dead" G.

I'd have to think about it a bit more, but I think the correct summary would be to say that gp.schedlink may only be used by a caller that fully owns the G. runqdrain does not fully own the Gs it puts in the queue until it successfully CAS updates head, thus it is unsafe. Note that runqgrab doesn't have this problem because it places Gs into an array (actually another runq) rather than using a gQueue / gList.

Reply all
Reply to author
Forward
0 new messages