Scott Yockel <
syo...@g.harvard.edu> writes:
> Slurm-Dev,
>
> Is there anything in the works to add the capacity of TotalCPU to also
> track the child process user and system time? I see that currently
> TotalCPU is defined: "provides a measure of the task’s parent process
> and does not include CPU time of child processes.”
In my experience, that description might not be accurate. It seems also
child processes are included, as long as the job doesn't time out. Here
is an email I wrote about it last year:
From: Bjørn-Helge Mevik <
b.h....@usit.uio.no>
Subject: [slurm-dev] UserCPU etc. for subprocesses not registered when a job times out.
To: slurm-dev <
slur...@schedmd.com>
Date: Fri, 12 Sep 2014 06:25:16 -0700
Reply-To: slurm-dev <
slur...@schedmd.com>
We would like to use the UserCPU, SystemCPU and TotalCPU values from
sacct to assess the efficiency of jobs. When a job exits normally,
these values are reported for the batch script step and includes the
time spent by sub-processes.
However, if the job times out, these values only includ the CPU time
spent by the batch script process itself, not its sub-processes. Se
below for an illustration.
Is this intended behaviour? If so, is there any other way to gather CPU
times from jobs, even when they don't exit normally?
Illustration:
407 (1) $ cat shell-loop.sh
#!/bin/bash
echo Starting loop
## Loop that only uses shell builtins:
while true; do echo -n ; done
echo This is the end...
408 (1) $ cat
timeout-in-subprocess.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-subprocess-%j.out
## Execute shell-loop.sh in subprocess:
./shell-loop.sh
409 (1) $ cat
timeout-in-shell.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-shell-%j.out
## Run shell-loop.sh in this shell:
source shell-loop.sh
410 (1) $ sbatch
timeout-in-subprocess.sm
Submitted batch job 40
411 (1) $ sbatch
timeout-in-shell.sm
Submitted batch job 41
[... after a couple of minutes ...]
412 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 40,41
JobID State Elapsed UserCPU SystemCPU TotalCPU
------------ ---------- ---------- ---------- ---------- ----------
40 TIMEOUT 00:02:27 00:00.001 00:00.001 00:00.003
40.batch CANCELLED 00:02:27 00:00.001 00:00.001 00:00.003
41 TIMEOUT 00:02:24 02:23.415 00:00:00 02:23.416
41.batch CANCELLED 00:02:24 02:23.415 00:00:00 02:23.416
i.e., time spent in subprocesses is not reported.
Doing the same thing, but now with loops that terminate so the jobs
don't time out, we get:
416 (1) $ sbatch
work-in-subprocess.sm
Submitted batch job 42
417 (1) $ sbatch
work-in-shell.sm
Submitted batch job 43
[... after a couple of minutes ...]
418 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 42,43
JobID State Elapsed UserCPU SystemCPU TotalCPU
------------ ---------- ---------- ---------- ---------- ----------
42 COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187
42.batch COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187
43 COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403
43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403
i.e., time spent in subprocesses is reported.
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
--
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo