code review 6545051: pkg/syscall, pkg/os: Fix a fork-exec/wait race in Plan 9. (issue 6545051)

36 views
Skip to first unread message

se...@mail.nanosouffle.net

unread,
Sep 21, 2012, 6:27:11 AM9/21/12
to r...@golang.org, rmin...@gmail.com, n...@plan9.bell-labs.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com
Reviewers: rsc, rminnich, npe1,

Message:
Hello r...@golang.org, rmin...@gmail.com, n...@plan9.bell-labs.com (cc:
golan...@googlegroups.com),

I'd like you to review this change to
https://go.googlecode.com/hg/


Description:
pkg/syscall, pkg/os: Fix a fork-exec/wait race in Plan 9.

On Plan 9, only the parent of a given process can enter its wait
queue. When a Go program tries to fork-exec a child process
and subsequently waits for it to finish, the goroutines doing
these two tasks do not necessarily tie themselves to the same
(or any single) OS thread. In the case that the fork and the wait
system calls happen on different OS threads (say, due to a
goroutine being rescheduled somewhere along the way), the
wait() will either return an error or end up waiting for a
completely different child than was intended.

This change forces the fork and wait syscalls to happen in the
same goroutine and ties that goroutine to its OS thread until
the child exits. The PID of the child is recorded upon fork and
exit, and de-queued once the child's wait message has been read.
The Wait API, then, is translated into a synthetic implementation
that simply waits for the requested PID to show up in the queue
and then reads the associated stats.

Please review this at http://codereview.appspot.com/6545051/

Affected files:
M src/pkg/os/exec_plan9.go
M src/pkg/syscall/exec_plan9.go
M src/pkg/syscall/syscall_plan9.go


andrey mirtchovski

unread,
Sep 21, 2012, 12:12:11 PM9/21/12
to se...@mail.nanosouffle.net, r...@golang.org, rmin...@gmail.com, n...@plan9.bell-labs.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com
nice! tests now actually pass, wow!

go test -short std broke at database/sql. it appears to be a race
because go test -short in the directory itself passes. here's what i
see:

andrey 305474 0:01 0:26 797892K Pread go
andrey 305475 0:00 0:00 797892K Tsemacqu go
andrey 305476 0:00 0:02 797892K Semacqui go
andrey 305486 0:00 0:00 797892K Semacqui go
andrey 305487 0:00 0:00 797892K Semacqui go
andrey 305794 0:00 0:00 794000K Semacqui sql.test
andrey 305795 0:00 0:00 794000K Broken sql.test
% acid 305795
/proc/305795/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: lstk()
<stdin>:2: (error) no stack frame: can't translate address 0xf06bb5fe
acid:
echo kill > /proc/305795/ctl
% acid 305794
/proc/305794/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: lstk()
runtime.plan9_semacquire()+0x7 /usr/andrey/go/src/pkg/runtime/sys_plan9_386.s:50
runtime.semasleep(ns=0xffffffff)+0x50
/usr/andrey/go/src/pkg/runtime/thread_plan9.c:289
runtime.notesleep(n=0x1061e084)+0xb4
/usr/andrey/go/src/pkg/runtime/lock_sema.c:160
nextgandunlock()+0x19b /usr/andrey/go/src/pkg/runtime/proc.c:635
gp=0x1e2b74
schedule(gp=0x1061a080)+0xb4 /usr/andrey/go/src/pkg/runtime/proc.c:920
runtime.mcall(fn=0x1061a080)+0x34 /usr/andrey/go/src/pkg/runtime/asm_386.s:175
0x1061a080 ?file?:0
acid:

% cd pkg/database/sql && go test -short
PASS
ok database/sql 0.375s
%

here are the passing tests before that:

% time go test std -short '-timeout=120s'
ok cmd/api 0.277s
? cmd/cgo [no test files]
ok cmd/fix 2.337s
ok cmd/go 0.297s
? cmd/godoc [no test files]
ok cmd/gofmt 0.330s
? cmd/vet [no test files]
? cmd/yacc [no test files]
ok archive/tar 0.153s
ok archive/zip 0.690s
ok bufio 0.218s
ok bytes 0.415s
ok compress/bzip2 0.433s
ok compress/flate 1.425s
ok compress/gzip 0.165s
ok compress/lzw 0.550s
ok compress/zlib 2.670s
ok container/heap 0.143s
ok container/list 0.139s
ok container/ring 0.146s
? crypto [no test files]
ok crypto/aes 0.178s
ok crypto/cipher 0.209s
ok crypto/des 0.219s
ok crypto/dsa 0.300s
ok crypto/ecdsa 0.199s
ok crypto/elliptic 0.178s
ok crypto/hmac 0.157s
ok crypto/md5 0.159s
ok crypto/rand 0.364s
ok crypto/rc4 0.146s
ok crypto/rsa 0.513s
ok crypto/sha1 0.151s
ok crypto/sha256 0.149s
ok crypto/sha512 0.153s
ok crypto/subtle 0.156s
ok crypto/tls 0.623s
ok crypto/x509 1.799s
? crypto/x509/pkix [no test files]

rmin...@gmail.com

unread,
Sep 21, 2012, 3:46:26 PM9/21/12
to se...@mail.nanosouffle.net, r...@golang.org, n...@plan9.bell-labs.com, mirtc...@gmail.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com
This gets us over a major hurdle.
LGTM


http://codereview.appspot.com/6545051/

andrey mirtchovski

unread,
Sep 21, 2012, 9:01:33 PM9/21/12
to se...@mail.nanosouffle.net, r...@golang.org, rmin...@gmail.com, n...@plan9.bell-labs.com, mirtc...@gmail.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com
what I encountered after a day of running go test with this CL in:

- go test go/ast always fails with the last process in Pwrite state
- occasionally go test will fail in a random place with the last proc
in Broken state. the whole runtime appears to be blocked because after
the broken process is killed the runtime will fire the (long expired)
timeout set up by the go binary itself
- if a binary is just being started a keyboard interrupt will
occasionally cause:

split stack overflow: 0x1096ff3c < 0x10970000
throw: runtime: split stack overflow

- i saw one "throw: negative mcpu in scheduler" while running go test

attached is a log of go test -short -timeout 30s where the hung
processes (in broken state) are manually killed after a couple of
minutes.
gotest.out

Akshat Kumar

unread,
Sep 22, 2012, 2:25:41 PM9/22/12
to andrey mirtchovski, r...@golang.org, rmin...@gmail.com, n...@plan9.bell-labs.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com
Yes, there are definitely more issues in the go runtime
for Plan 9 to be resolved. :-)

But from what I can see, these seem to be orthogonal
to what this patch is doing; rather, they're coming up
because you're now able to get further into the process.

r...@golang.org

unread,
Sep 24, 2012, 11:46:11 AM9/24/12
to se...@mail.nanosouffle.net, rmin...@gmail.com, n...@plan9.bell-labs.com, mirtc...@gmail.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go
File src/pkg/syscall/exec_plan9.go (right):

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode519
src/pkg/syscall/exec_plan9.go:519: // Plan 9, only the parent thread can
ever enter the
s/ever enter the wait queue/wait/
There's no queue.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode539
src/pkg/syscall/exec_plan9.go:539: defer runtime.UnlockOSThread()
This will happen automatically when the goroutine dies. Delete.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode542
src/pkg/syscall/exec_plan9.go:542: forkc <- rval
You should probably wait to do this until you have entered the pid in
forks.
That means you'll need a forkc <- rval in the next if statement too.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode544
src/pkg/syscall/exec_plan9.go:544: // No one should be waiting
// If fork fails there is nothing to wait for.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode546
src/pkg/syscall/exec_plan9.go:546: if err != nil || pid == -1 {
err := rval.err first

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode552
src/pkg/syscall/exec_plan9.go:552: forks[pid] = true
You need a lock here.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode552
src/pkg/syscall/exec_plan9.go:552: forks[pid] = true
You can avoid having two maps by having

waitmsg map[int]*Waitmsg

and use nil to mean it's running and not done yet.

Also, why is pid int? Didn't we just establish that was too short?

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/exec_plan9.go#newcode553
src/pkg/syscall/exec_plan9.go:553: defer delete(forks, pid)
You need a lock here (in the code that runs at defer). Probably better
to just do the lock+delete at the bottom. The defer isn't saving
anything.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go
File src/pkg/syscall/syscall_plan9.go (right):

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go#newcode46
src/pkg/syscall/syscall_plan9.go:46: type Processes struct {
Please don't export this.

type processes struct

Also you can put the sync.Mutex and the sync.Cond here.
In fact you don't even need the type or the extra variables.

var procs struct {
once sync.Once
sync.RWMutex
cond sync.Cond
forks map[int]bool
waits map[int]Waitmsg
}

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go#newcode59
src/pkg/syscall/syscall_plan9.go:59: procs Processes
In general please don't put unexported vars in the same block as
exported vars. It complicates the godoc output.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go#newcode213
src/pkg/syscall/syscall_plan9.go:213: func makeProcs() {
Please move all this to the other file, so that all the code
manipulating procs is in one place. The procs struct too.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go#newcode239
src/pkg/syscall/syscall_plan9.go:239: return NewError("no forked process
with the given pid")
NewError("process not found") is fine.

http://codereview.appspot.com/6545051/diff/4001/src/pkg/syscall/syscall_plan9.go#newcode248
src/pkg/syscall/syscall_plan9.go:248: delete(waits, w.Pid)
Deleting from the map while holding a reader lock is a no-no. I would
just use a Mutex instead of an RWMutex. Everyone needs to write things.

http://codereview.appspot.com/6545051/
Reply all
Reply to author
Forward
0 new messages