[slurm-dev] Job steps not listed

0 views
Skip to first unread message

Prashanth Tamraparni

unread,
Nov 13, 2008, 12:14:03 PM11/13/08
to slur...@lists.llnl.gov
Hello,
On SLURM-1.2.25 and also in the latest version of SLURM-1.2.35 on
linux (RHEL5.1 based), I see a small problem: A job allocated with
"-J" option doesn't provide information on steps when the job is
running (after the job is done/cancelled, I can see accouting of all
steps)

(I verified that the same code snippet is also in 1.3.10 version)

[root@bling16 ~]# srun -A --no-shell -n1 -J tmp
SLURM_JOBID=87
[root@bling16 ~]#
[root@bling16 ~]# export SLURM_JOBID=87
[root@bling16 ~]#
[root@bling16 ~]# srun /bin/hostname
bling15
[root@bling16 ~]# srun /bin/hostname
bling15
[root@bling16 ~]#

[root@bling16 ~]# sacct -j 87
JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------
87 tmp tp 1 RUNNING 0


However, if I don't use the option '-J', sacct reports both job/step
accounting:

[root@bling16 ~]# srun -A --no-shell -n1
SLURM_JOBID=86
[root@bling16 ~]# export SLURM_JOBID=86
[root@bling16 ~]#
[root@bling16 ~]# srun /bin/hostname
bling14
[root@bling16 ~]# srun /bin/hostname
bling14
[root@bling16 ~]#

[root@bling16 ~]# sacct -j 86
JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------
86 allocation tp 1 RUNNING 0
86.0 hostname tp 1 COMPLETED 0
86.1 hostname tp 1 COMPLETED 0

I looked at the code and common_job_start_slurmctld() in
src/plugins/jobacct/common/common_slurmctld.c is NOT setting
track_steps variable if we provide a job name.
(This reflects in jobacct.log file - value "0" next to "tmp":
87 tp 20081113151543 1226589343 - - JOB_START 1 16 0 0 tmp 0 -65553 1
bling15 (null)

if ((tmp = strlen(job_ptr->name))) {
jname = xmalloc(++tmp);
for (i=0; i<tmp; i++) {
if (isspace(job_ptr->name[i]))
jname[i]='_';
else
jname[i]=job_ptr->name[i];
}
} else {
jname = xstrdup("allocation");
track_steps = 1;
}


Is there any specfic reason why this has been done? Or is this a bug ?

--Prashanth

Danny Auble

unread,
Nov 13, 2008, 1:02:26 PM11/13/08
to slur...@lists.llnl.gov
Hi Prashanth,

I just verified this problem with 1.2. It does appear to be a bug. I have also verified the bug does not seem to exist in 1.3. All the accounting_storage plugins work as you would expect. The problem in 1.3 was fixed when we moved the code from sacct to the plugin. Here is a patch for 1.2 to make this work like 1.3.

Danny

Index: src/sacct/options.c
===================================================================
--- src/sacct/options.c (revision 15484)
+++ src/sacct/options.c (working copy)
@@ -1102,6 +1102,8 @@
job->sacct.ave_rss /= list_count(job->steps);
job->sacct.ave_vsize /= list_count(job->steps);
job->sacct.ave_pages /= list_count(job->steps);
+ if(list_count(job->steps) > 1)
+ job->track_steps = 1;
}

/* JOB_START */
@@ -1802,7 +1804,10 @@
}
print_fields(JOB, job);
}
-
+
+ if(list_count(job->steps) > 1)
+ job->track_steps = 1;
+
if (do_jobsteps && (job->track_steps || !job->show_full)) {
itr_step = list_iterator_create(job->steps);
while((step = list_next(itr_step))) {


--

Prashanth Tamraparni

unread,
Nov 14, 2008, 1:45:30 AM11/14/08
to slur...@lists.llnl.gov
Hello Danny,
The fix worked, but partially. Step information was shown only when
I launch more than one job.

[root@bling16 ~]# srun -AI --no-shell -Jtest
SLURM_JOBID=130
[root@bling16 ~]# export SLURM_JOBID=130


[root@bling16 ~]#
[root@bling16 ~]#

[root@bling16 ~]# sacct -j 130


JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------

130 test tp 1 PENDING 0


[root@bling16 ~]#
[root@bling16 ~]# srun /bin/hostname
bling14

[root@bling16 ~]# sacct -j 130


JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------

130 test tp 1 RUNNING 0

[root@bling16 ~]# srun /bin/hostname
bling14

[root@bling16 ~]# sacct -j 130


JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------

130 test tp 1 RUNNING 0
130.0 hostname tp 1 COMPLETED 0
130.1 hostname tp 1 COMPLETED 0

I looked at the code changes and modified if statments to "if
(list_count(job->steps) > 0)...
WIth this change, sacct now gives correctly:
After first job was run:
[root@bling16 ~]# /tmp/sacct-nov14 -j 130


JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------

130 test tp 1 RUNNING 0
130.0 hostname tp 1 COMPLETED 0

After a seoncdjob is run:
[root@bling16 ~]# /tmp/sacct-nov14 -j 130


JobID Jobname Partition Ncpus Status ExitCode
---------- ------------------ ---------- ------- -------------------- --------

130 test tp 1 RUNNING 0
130.0 hostname tp 1 COMPLETED 0
130.1 hostname tp 1 COMPLETED 0

Does that sound ok?

--Prashanth

Danny Auble

unread,
Nov 14, 2008, 11:26:29 AM11/14/08
to slur...@lists.llnl.gov

The issue with this patch is you will get a step for job ran directly from srun even though you don't want that to happen. I will look closer at this to see what else can be done. But I don't think this is the way we would want to operate.

Danny


--

Reply all
Reply to author
Forward
0 new messages