Measuring resource usage of child processes

525 views
Skip to first unread message

Brian Candler

unread,
Jun 24, 2020, 4:50:41 AM6/24/20
to golang-nuts
Here's a program with a couple of problems.

It runs three concurrent child processes, and measures the resource usage for each of them separately.  I'm using a dummy child which is /bin/sh -c "yes >/dev/null", and let it run for a few seconds before forcibly terminating it.

package main

import (
"context"
"fmt"
"os/exec"
"syscall"
"time"
)

func child(n int, done chan int) {
defer func() { done <- 0 }()
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(n)*time.Second)
defer cancel()

cmd := exec.CommandContext(ctx, "/bin/sh", "-c", "yes >/dev/null")
err := cmd.Run()
if err != nil {
fmt.Printf("%d Run(): %v\n", n, err)
}
if cmd.ProcessState == nil {
fmt.Printf("%d nil ProcessState", n)
return
}
if rusage, ok := cmd.ProcessState.SysUsage().(*syscall.Rusage); ok {
fmt.Printf("rusage %d: Utime=%v, Stime=%v, Maxrss=%v\n", n, rusage.Utime, rusage.Stime, rusage.Maxrss)
} else {
fmt.Printf("%d no rusage\n", n)
}
}

func main() {
done := make(chan int)
go child(4, done)
go child(1, done)
go child(2, done)
<-done
<-done
<-done
fmt.Println("Bye!")
}

Problem 1: when the context timeout expires, the shell is killed, but its descendant process ("yes") isn't.  This leaves three orphaned "yes" processes running, burning all CPU on your machine, which have to be manually found and killed.  (Aside: that's why I didn't want to post it on play.golang.org, although I expect it has strong protections against this sort of thing)

When a context timeout occurs, it's ambiguous in the documentation whether Process.Kill sends a SIGTERM or a SIGKILL (since "kill" is both the name of the syscall and the name of a signal).  Looking at the implementation, it appears to send SIGKILL, which means that there's no opportunity for the process to kill its descendants.

I'm not sure what the right solution is here, but I think it's something about sending a signal to a process group (-pid) rather than a single process, which could be done if the child runs in its own progress group (setpgid? setsid?)

Problem 2: the Utime/Stime CPU usage printed is very low.  I believe it's showing me the resource usage for the parent shell, but not the child "yes" process.  I'd like to have the resource usage for the subprocess *and* its descendants.

As far as I can see, the usage comes from wait4() here: https://github.com/golang/go/blob/master/src/os/exec_unix.go#L43.  The manpage for wait4 says:

       If  rusage  is  not NULL, the struct rusage to which it points will be filled with accounting information about the child.
       See getrusage(2) for details.

However it doesn't say if it uses RUSAGE_CHILDREN or RUSAGE_SELF, which getrusage() lets you specify.  A bit of Googling turns up that some systems have a wait6 which returns both forms of usage.

Although Go lets me call Getrusage() directly, this isn't much use if there are multiple concurrent children.  And as far as i can see, Go doesn't let me fork() my own child explicitly so I could measure its descendants separately.

Right now I'm thinking I'll have to invoke a wrapper binary, e.g.

exec.CommandContext(ctx, "measure_resource", "real_program", "arg1", "arg2")

where "measure_resource" calls Getrusage(RUSAGE_CHILDREN) and writes it to stderr just before terminating, and the parent extracts this from stderr.  It could also apply its own session with setsid, and/or implement a softer timeout than the hard SIGKILL that exec.CommandContext() generates.

Can anyone think of a cleaner solution to this?

Many thanks,

Brian.

Brian Candler

unread,
Jun 24, 2020, 5:53:05 AM6/24/20
to golang-nuts
I have a kind-of workaround.  Firstly, I see that Go has the ability to start a new session for the child with Setsid called.  However on timeout I still need to kill the process group (-pid) instead of the process, which I can do by implementing the context deadline manually:

        cmd := exec.Command("/bin/sh", "-c", "exec yes >/dev/null")
        if cmd.SysProcAttr == nil {
                cmd.SysProcAttr = &syscall.SysProcAttr{}
        }
        cmd.SysProcAttr.Setsid = true

        go func() {
                select {
                case <-ctx.Done():
                        // Race by doing it here and not in os/exec; maybe some other
                        // process gets the pid in the mean time
                        if cmd.Process.Pid > 1 && (cmd.ProcessState == nil || !cmd.ProcessState.Exited()) {
                                err := syscall.Kill(-cmd.Process.Pid, os.Kill.(syscall.Signal))
                                if err != nil && err != syscall.ESRCH {
                                        fmt.Printf("Ooops, signal failed: %v\n", err)
                                }
                        }
                }
        }()

        err := cmd.Run()
        // rest of code the same

This appears to solve both problems: I get sensible resource accounting, *and* the shell and its children are killed.

1 Run(): signal: killed
rusage 1: Utime={0 271899}, Stime={0 727730}, Maxrss=1892
2 Run(): signal: killed
rusage 2: Utime={0 459969}, Stime={1 539899}, Maxrss=2004
4 Run(): signal: killed
rusage 4: Utime={0 979834}, Stime={3 19489}, Maxrss=1904
Bye!

To make this work without the race, I would like to suggest that the Process.Kill() function checks the value of SysProcAttr.Setsid, and if it's true, sends the kill signal to the process group rather than just the process.  This way, the regular exec.CommandContext would clean up properly.

(Note: I would *not* change Process.Signal(); any signal sent that way, including the Kill signal, should just go to the single process).

This is a nominally a change in behaviour, although I can't see that the current behaviour is actually very useful - that is, killing the parent, but letting its children re-attach themselves to pid 1 as (in effect) unsupervised daemons.  In addition, it would only be a change in behaviour for those people who have applied Setsid=true.

However if necessary it could be made backwards-compatible by adding a new flag to SysProcAttr, e.g. "KillProcessGroup" which would make Process.Kill send to -pid instead of pid.

Thoughts?

Ian Lance Taylor

unread,
Jun 24, 2020, 11:06:37 PM6/24/20
to Brian Candler, golang-nuts
My concern is that once you get into this area, you are dealing with
issues that are Unix-specific. The os package tries to be roughly
OS-independent. Perhaps it was a mistake to add os.Process.Kill and
os.Process.Signal at all, but we can't remove them now. But for
adding new mechanisms, I think it's time to use golang.org/x/sys/unix,
in this case unix.Kill with a negative number to send the signal to
the process group.

Ian

Brian Candler

unread,
Jun 25, 2020, 1:41:37 AM6/25/20
to golang-nuts
My problem is actually around exec.CommandContext. I mentioned os.Process.Kill is because that's the interface that exec.CommandContext uses:

> The provided context is used to kill the process (by calling os.Process.Kill) if the context becomes done before the command completes on its own.

It would be fine to change only exec.CommandContext to do the right thing: *if* the process was started with Setsid, *and* we're running on Unix, then call unix.Kill with negative pid - otherwise fall back to os.Process.Kill.  Would that be reasonable?

Alternatively, could you provide a hook for a user-defined killing function?

Ian Lance Taylor

unread,
Jun 25, 2020, 6:46:15 PM6/25/20
to Brian Candler, golang-nuts
Ah, I see. I'm not sure whether this is a good idea or not.
exec.CommandContext is essentially a helper function. It doesn't do
anything you can't do yourself. Basically it starts a goroutine that
does

select {
case <-c.ctx.Done():
c.Process.Kill()
case <-c.waitDone:
}

where c.waitDone is a channel that is closed when cmd.Wait returns.
You could do that yourself with a bit of work. So the question is
whether the chance of increased complexity and potential confusion in
CommandContext is worth changing the way that it works. There are
other problems with it too--not everyone wants their program killed
outright. But we've avoided adding a hook, as you suggest, because
it's relatively easy for someone to do whatever they want anyhow.

Ian

Brian Candler

unread,
Jun 26, 2020, 1:43:55 AM6/26/20
to golang-nuts
That goroutine is launched at the end of Cmd.Start, rather than in exec.CommandContext.  That makes sense: you don't want to start the time bomb until the process has been assigned a pid.

Cmd.Run just does Cmd.Start followed by Cmd.Wait, so I can redo that logic:

        cmd.SysProcAttr = &syscall.SysProcAttr{Setsid: true}
        err := cmd.Start()
        if err == nil {
                waitDone := make(chan struct{})
                go func() {
                        select {
                        case <-ctx.Done():
                                unix.Kill(-cmd.Process.Pid, os.Kill.(syscall.Signal))
                        case <-waitDone:
                        }
                }()
                err = cmd.Wait()
                close(waitDone)
        }

That seems to do the trick - thank you.
Reply all
Reply to author
Forward
0 new messages