makeflow does not recognise rule completion

17 views
Skip to first unread message

Roger Mason

unread,
May 13, 2022, 11:51:49 AM5/13/22
to cctoo...@googlegroups.com
Hello,

This is really a continuation of a previous thread from a month ago.
Briefly, I create a makeflow file (below) and a script to send it to a
remote machine that runs slurm for execution. The makeflow file is
executed on the remote machine using ssh. I expect makeflow to run rule
3 using slurm and, when the job completes rule 4 should create a tarball
of the working directory. The submit script then should recover the
tarball and store it.

What actually happens is that the slurm job runs but rule 4 is never
executed, so makeflow must not be recognising completion of rule 3. I
can get it to work if rule 3 is set as 'LOCAL' but then slurm does not
respect the options set using -B "blah blah" and does not write a
slurm-*.out file with stdout from the execution of the job.

Maybe I'm doing something wrong, if so I appreciate it being pointed
out.

Thanks,
Roger

The makeflow file:

URL=""
SC="/home/rmason/Scratch/elkjobs"
WD="1x1x1_220513_125610"
CP=/bin/cp
MV=/bin/mv
PWD=/bin/pwd
SD=/usr/home/rmason/Research/Projects/Quartz/Elk/alpha-Quartz/Create
CD=cd $(SC)/$(WD)
CDS=cd $(SC)

$(SC)/$(WD)/runspecies.txt: $(SC)/$(WD)/runspecies.sh
LOCAL $(CD) && ./runspecies.sh > runspecies.txt

$(SC)/$(WD)/dirs: $(SC)/$(WD)/runspecies.txt
LOCAL $(CD) && find . -depth 1 -type d > dirs

$(SC)/$(WD)/slurm-$(WD).out: $(SC)/$(WD)/job.sh $(SC)/$(WD)/runspecies.txt
$(CD) && ./job.sh && touch $(SC)/$(WD)/slurm-$(WD).out

$(SC)/$(WD).tgz: $(SC)/$(WD)/slurm-$(WD).out
LOCAL $(CDS) && tar czf $(WD).tgz $(WD)

The submit script:
#!/usr/local/bin/zsh -f

echo "dummy" > 1x1x1_220513_125610/machine
tar czf 1x1x1_220513_125610.tar.gz 1x1x1_220513_125610
scp 1x1x1_220513_125610.tar.gz dummy.local:/home/rmason/Scratch/elkjobs
mv 1x1x1_220513_125610.tar.gz $HOME/Research/Projects/Quartz/Elk/alpha-Quartz/Input/Archives/
scp runslurm_1x1x1_220513_125610.mfl dummy.local:/home/rmason/Scratch/elkjobs
ssh dummy.local 'cd /home/rmason/Scratch/elkjobs/ && tar xzf 1x1x1_220513_125610.tar.gz'
scp runslurm_1x1x1_220513_125610.mfl dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610
scp /usr/home/rmason/Research/Projects/Quartz/Elk/alpha-Quartz/Create/submit_1x1x1_220513_125610.sh dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610
ssh dummy.local 'cd /home/rmason/Scratch/elkjobs && $HOME/.local/cctools/bin/makeflow -T slurm -B " --mem-per-cpu 250 -t 00:00:30 --job-name=1x1x1_220513_125610 --output=slurm-%j.out --mail-user=rma...@mun.ca --mail-type=ALL" runslurm_1x1x1_220513_125610.mfl'

scp dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610.tgz $HOME/Research/Projects/Quartz/Elk/alpha-Quartz/Output/Archives/
rm -rf 1x1x1_220513_125610

A transcript of submission:
./submit_1x1x1_220513_125610.sh
1x1x1_220513_125610.tar.gz 100% 5137 645.6KB/s 00:00
runslurm_1x1x1_220513_125610.mfl 100% 614 150.8KB/s 00:00
runslurm_1x1x1_220513_125610.mfl 100% 614 147.4KB/s 00:00
submit_1x1x1_220513_125610.sh 100% 1148 314.5KB/s 00:00
parsing runslurm_1x1x1_220513_125610.mfl...
local resources: 1.000 cores, 0 MB memory, 741774 MB disk
max running remote jobs: 100
max running local jobs: 1
checking runslurm_1x1x1_220513_125610.mfl for consistency...
runslurm_1x1x1_220513_125610.mfl has 4 rules.
creating new log file runslurm_1x1x1_220513_125610.mfl.makeflowlog...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./runspecies.sh > runspecies.txt
submitted job 39873
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out
submitted job 1292
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && find . -depth 1 -type d > dirs
submitted job 40124
2022/05/13 12:59:21.92 makeflow[39791] error: job 1292 does not appear to be running anymore.
job 1292 completed
cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out crashed with signal 1 (Hangup)
2022/05/13 12:59:21.92 makeflow[39791] error: rule 2 failed, cannot move outputs
2022/05/13 12:59:21.92 makeflow[39791] error: hook Fail Dir:node_fail returned 1
deleted /home/rmason/Scratch/elkjobs/1x1x1_220513_125610/slurm-1x1x1_220513_125610.out

Ben Tovar

unread,
May 13, 2022, 2:39:51 PM5/13/22
to cctoo...@googlegroups.com
Roger,

It does seem that the slurm job is running, and completing with an error:


2022/05/13 12:59:21.92 makeflow[39791] error: job 1292 does not appear to be running anymore.
job 1292 completed
cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out crashed with signal 1 (Hangup)

I wonder if there is something inside the script job.sh that is generating that hangup signal...


Ben

--
You received this message because you are subscribed to the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/y65y1z5ctwz.fsf%40mun.ca.

Roger Mason

unread,
May 14, 2022, 6:08:16 AM5/14/22
to cctoo...@googlegroups.com
Hello Ben,

Ben Tovar <bto...@nd.edu> writes:

> Roger,
>
> It does seem that the slurm job is running, and completing with an error:
>
> 2022/05/13 12:59:21.92 makeflow[39791] error: job 1292 does not appear to be running anymore.
> job 1292 completed
> cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out
> crashed with signal 1 (Hangup)
>
> I wonder if there is something inside the script job.sh that is generating that hangup signal...
>

This is job.sh:

DIR=$(sed -n "${SLURM_ARRAY_TASK_ID}p" job_list);

cd $DIR;

srun --mpi=pmi2 elk.sh

exit 0

I tailed slurm.status while the job was running:

tail -f slurm.status.1294
start 1652521925
alive 1652521955
alive 1652521985
alive 1652522015
alive 1652522046
alive 1652522076
alive 1652522106
alive 1652522136
alive 1652522166
alive 1652522196

sacct suggests that the job ran:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1293 220514_07+ macpro 4 COMPLETED 0:0
1293.batch batch 4 COMPLETED 0:0
1293.0 elk.sh 1 COMPLETED 0:0
1294 220514_07+ macpro 4 COMPLETED 0:0
1294.batch batch 4 COMPLETED 0:0
1294.0 elk.sh 1 COMPLETED 0:0

The final rule of the workflow did not run.

Thanks for your help,
Roger

Ben Tovar

unread,
May 16, 2022, 9:23:05 AM5/16/22
to cctoo...@googlegroups.com
Roger,

I believe you do not need the srun call inside job.sh.  The problem is that srun is meant to be an interactive command, and that would explain the hang signal, as you would only get hang signals in a terminal command.
If you need to pass the "--mpi=pmi2" option, I think you can try with the batch options in makeflow, as:

makeflow -Tslurm --batch-options "--mpi=pmi2" ...rest of your options...



Ben



--
You received this message because you are subscribed to the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cctools-nd+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages