job system exit value and exit signal value discrepancy in makeflow

Willem Marais

unread,

Aug 26, 2022, 11:19:14 AM8/26/22

to Cooperative Computing Tools

Hi

I am experiencing the following issue with makeflow. I submit slurm jobs via makeflow on a Linux system. The scripts that are executed give system exits of 0 and for each script the max resident memory is considerably less than what is set via makeflow and ulimit.

The problem is that makeflow sometimes report that the exit signal is 1 even though the system exit value is 0 for the script that is executed within the slurm job. The jobs for which this discrepancy happens appear to be random.

Are there makeflow flags that I can use to further diagnose what the problem is? Here is the makeflow version that I am using:

makeflow version 7.4.5 FINAL (released 2022-04-10 12:10:38 +0000)
Built by conda@b76c04cd38d1 on 2022-04-10 12:10:38 +0000
System: Linux b76c04cd38d1 5.13.0-1021-azure #24~20.04.1-Ubuntu SMP Tue Mar 29 15:34:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Configuration: --debug --prefix /home/willemm/miniconda3 --with-base-dir /home/willemm/miniconda3 --with-python3-path /home/willemm/miniconda3 --with-perl-path no --with-readline-path no --with-fuse-path no --without-system-parrot --without-system-prune --without-system-umbrella --without-system-weaver

Here is an example INFO.json file and /usr/bin/time output of script that was executed by makeflow/slurm.

INFO.json output:

{
"exit_signal":1,
"command":"ulimit -d 8388608 -m 8388608 -v 8388608; /bin/bash /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/wrapper_execute.bash /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False >> /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/log/02_experiment_random_column_I50/task_id_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6]_mk406.log 2>&1",
"inputs":
[

{
"size":1862,
"dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_74_key_02_experiment_random_column_I50_InferInitTask-P[0].p"
}
],
"outputs":
[

{
"size":1073741824,
"dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p"
}
],
"environment":
{
"MEMORY":"8192",
"BATCH_OPTIONS":"--job-name=406_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6] --time=6:00:00",
"CORES":"1",
"CORES":"1"
},
"category":"category_1",
"resources":
{
"memory":
[
8192,
"MB"
],
"cores":
[
1,
"cores"
]
}
}

/usr/bin/time output:

Command being timed: "python /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/mak
eflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].
p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False"
User time (seconds): 2151.67
System time (seconds): 22.78
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 36:52.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 261624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 218
Minor (reclaiming a frame) page faults: 2959833
Voluntary context switches: 778
Involuntary context switches: 5766
Swaps: 0
File system inputs: 526352
File system outputs: 45400
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Willem

Ben Tovar

unread,

Aug 26, 2022, 11:50:08 AM8/26/22

to cctoo...@googlegroups.com

Willem,

Could you post the contents of:

/home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/wrapper_execute.bash

If I understand correctly, the "Exit status: 0" corresponds only to the python script, correct? The signal (sighup) is being received by wrapper_execute.bash. The fact that is a sighup is weird, as slurm jobs should be executed without a controlling terminal.

One thing to consider is that perhaps the slurm command line is getting confused by the [] of filenames (e.g.stage_nm=STAGE1C_CV_VARY_TP]-P[6]), as the brackets have special meaning to the shell.

Ben

--
You received this message because you are subscribed to the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/0a804704-8df1-471e-8fb3-99f5ea650445n%40googlegroups.com.

Willem Marais

unread,

Aug 26, 2022, 12:02:40 PM8/26/22

to cctoo...@googlegroups.com

Ben,

If I understand correctly, the "Exit status: 0" corresponds only to the python script, correct?

Correct.

One thing to consider is that perhaps the slurm command line is getting confused by the [] of filenames (e.g.stage_nm=STAGE1C_CV_VARY_TP]-P[6]), as the brackets have special meaning to the shell.

The thing is that when I rerun the failed jobs, the jobs successfully executes and do not fail again. I can try to replace the [] characters with something else.

Attached is the wrapper execute script.

Willem

You received this message because you are subscribed to a topic in the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cctools-nd/bCvrIoF0aQg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/CAMik99V%3DM15x4C3JDWrVqjR%3Dt7ku6UxM9nTuSH%2BBprOyoc%3DnuA%40mail.gmail.com.

wrapper_execute.bash

Ben Tovar

unread,

Aug 26, 2022, 12:12:41 PM8/26/22

to cctoo...@googlegroups.com

Got the script, thanks!

To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/CAKsVgOsfUvB-xGVVNbb4-b0rnQfeyaVq%2Bb3JGnux9-t4mOFOzw%40mail.gmail.com.

Ben Tovar

unread,

Aug 26, 2022, 12:21:33 PM8/26/22

to cctoo...@googlegroups.com

Willem,

One
 thing that immediately jumps at me is that you are using flock. flock does not work well 
across machines in a shared file system. Unless all the jobs are 
executing in the same compute node, then flock may not always be preventing 
concurrent write access. This produces race conditions that match your 
experience with random errors.

Ben

Willem Marais

unread,

Aug 26, 2022, 12:30:09 PM8/26/22

to cctoo...@googlegroups.com

Ben,

The flock is used on a local file system / disk and not over any distributed / shared file system. The part of the code that uses flock untars a miniconda environment on a local scratch disk, and the flock is used to prevent multiple jobs untarring the same miniconda environment over each other.

Willem

To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/CAMik99WhDGFhqzTP2yLSqEjXpchjWKG5q3hvLLkmm4tPmFTdrA%40mail.gmail.com.

Reply all

Reply to author

Forward