Hello,
This is really a continuation of a previous thread from a month ago.
Briefly, I create a makeflow file (below) and a script to send it to a
remote machine that runs slurm for execution. The makeflow file is
executed on the remote machine using ssh. I expect makeflow to run rule
3 using slurm and, when the job completes rule 4 should create a tarball
of the working directory. The submit script then should recover the
tarball and store it.
What actually happens is that the slurm job runs but rule 4 is never
executed, so makeflow must not be recognising completion of rule 3. I
can get it to work if rule 3 is set as 'LOCAL' but then slurm does not
respect the options set using -B "blah blah" and does not write a
slurm-*.out file with stdout from the execution of the job.
Maybe I'm doing something wrong, if so I appreciate it being pointed
out.
Thanks,
Roger
The makeflow file:
URL=""
SC="/home/rmason/Scratch/elkjobs"
WD="1x1x1_220513_125610"
CP=/bin/cp
MV=/bin/mv
PWD=/bin/pwd
SD=/usr/home/rmason/Research/Projects/Quartz/Elk/alpha-Quartz/Create
CD=cd $(SC)/$(WD)
CDS=cd $(SC)
$(SC)/$(WD)/runspecies.txt: $(SC)/$(WD)/runspecies.sh
LOCAL $(CD) && ./runspecies.sh > runspecies.txt
$(SC)/$(WD)/dirs: $(SC)/$(WD)/runspecies.txt
LOCAL $(CD) && find . -depth 1 -type d > dirs
$(SC)/$(WD)/slurm-$(WD).out: $(SC)/$(WD)/job.sh $(SC)/$(WD)/runspecies.txt
$(CD) && ./job.sh && touch $(SC)/$(WD)/slurm-$(WD).out
$(SC)/$(WD).tgz: $(SC)/$(WD)/slurm-$(WD).out
LOCAL $(CDS) && tar czf $(WD).tgz $(WD)
The submit script:
#!/usr/local/bin/zsh -f
echo "dummy" > 1x1x1_220513_125610/machine
tar czf 1x1x1_220513_125610.tar.gz 1x1x1_220513_125610
scp 1x1x1_220513_125610.tar.gz dummy.local:/home/rmason/Scratch/elkjobs
mv 1x1x1_220513_125610.tar.gz $HOME/Research/Projects/Quartz/Elk/alpha-Quartz/Input/Archives/
scp runslurm_1x1x1_220513_125610.mfl dummy.local:/home/rmason/Scratch/elkjobs
ssh dummy.local 'cd /home/rmason/Scratch/elkjobs/ && tar xzf 1x1x1_220513_125610.tar.gz'
scp runslurm_1x1x1_220513_125610.mfl dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610
scp /usr/home/rmason/Research/Projects/Quartz/Elk/alpha-Quartz/Create/submit_1x1x1_220513_125610.sh dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610
ssh dummy.local 'cd /home/rmason/Scratch/elkjobs && $HOME/.local/cctools/bin/makeflow -T slurm -B " --mem-per-cpu 250 -t 00:00:30 --job-name=1x1x1_220513_125610 --output=slurm-%j.out --mail-user=
rma...@mun.ca --mail-type=ALL" runslurm_1x1x1_220513_125610.mfl'
scp dummy.local:/home/rmason/Scratch/elkjobs/1x1x1_220513_125610.tgz $HOME/Research/Projects/Quartz/Elk/alpha-Quartz/Output/Archives/
rm -rf 1x1x1_220513_125610
A transcript of submission:
./submit_1x1x1_220513_125610.sh
1x1x1_220513_125610.tar.gz 100% 5137 645.6KB/s 00:00
runslurm_1x1x1_220513_125610.mfl 100% 614 150.8KB/s 00:00
runslurm_1x1x1_220513_125610.mfl 100% 614 147.4KB/s 00:00
submit_1x1x1_220513_125610.sh 100% 1148 314.5KB/s 00:00
parsing runslurm_1x1x1_220513_125610.mfl...
local resources: 1.000 cores, 0 MB memory, 741774 MB disk
max running remote jobs: 100
max running local jobs: 1
checking runslurm_1x1x1_220513_125610.mfl for consistency...
runslurm_1x1x1_220513_125610.mfl has 4 rules.
creating new log file runslurm_1x1x1_220513_125610.mfl.makeflowlog...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./runspecies.sh > runspecies.txt
submitted job 39873
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out
submitted job 1292
submitting job: cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && find . -depth 1 -type d > dirs
submitted job 40124
2022/05/13 12:59:21.92 makeflow[39791] error: job 1292 does not appear to be running anymore.
job 1292 completed
cd "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610" && ./job.sh && touch "/home/rmason/Scratch/elkjobs"/"1x1x1_220513_125610"/slurm-"1x1x1_220513_125610".out crashed with signal 1 (Hangup)
2022/05/13 12:59:21.92 makeflow[39791] error: rule 2 failed, cannot move outputs
2022/05/13 12:59:21.92 makeflow[39791] error: hook Fail Dir:node_fail returned 1
deleted /home/rmason/Scratch/elkjobs/1x1x1_220513_125610/slurm-1x1x1_220513_125610.out