Hi
I am experiencing the following issue with makeflow. I submit slurm jobs via makeflow on a Linux system. The scripts that are executed give system exits of 0 and for each script the max resident memory is considerably less than what is set via makeflow and ulimit.
The problem is that makeflow sometimes report that the exit signal is 1 even though the system exit value is 0 for the script that is executed within the slurm job. The jobs for which this discrepancy happens appear to be random.
Are there makeflow flags that I can use to further diagnose what the problem is? Here is the makeflow version that I am using:
makeflow version 7.4.5 FINAL (released 2022-04-10 12:10:38 +0000)
Built by conda@b76c04cd38d1 on 2022-04-10 12:10:38 +0000
System: Linux b76c04cd38d1 5.13.0-1021-azure #24~20.04.1-Ubuntu SMP Tue Mar 29 15:34:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Configuration: --debug --prefix /home/willemm/miniconda3 --with-base-dir /home/willemm/miniconda3 --with-python3-path /home/willemm/miniconda3 --with-perl-path no --with-readline-path no --with-fuse-path no --without-system-parrot --without-system-prune --without-system-umbrella --without-system-weaver
Here is an example INFO.json file and /usr/bin/time output of script that was executed by makeflow/slurm.
INFO.json output:
{
"exit_signal":1,
"command":"ulimit -d 8388608 -m 8388608 -v 8388608; /bin/bash /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/wrapper_execute.bash /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False >> /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/log/02_experiment_random_column_I50/task_id_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6]_mk406.log 2>&1",
"inputs":
[
{
"size":1862,
"dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_74_key_02_experiment_random_column_I50_InferInitTask-P[0].p"
}
],
"outputs":
[
{
"size":1073741824,
"dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p"
}
],
"environment":
{
"MEMORY":"8192",
"BATCH_OPTIONS":"--job-name=406_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6] --time=6:00:00",
"CORES":"1",
"CORES":"1"
},
"category":"category_1",
"resources":
{
"memory":
[
8192,
"MB"
],
"cores":
[
1,
"cores"
]
}
}
/usr/bin/time output:
Command being timed: "python /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/mak
eflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].
p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False"
User time (seconds): 2151.67
System time (seconds): 22.78
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 36:52.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 261624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 218
Minor (reclaiming a frame) page faults: 2959833
Voluntary context switches: 778
Involuntary context switches: 5766
Swaps: 0
File system inputs: 526352
File system outputs: 45400
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Willem