job system exit value and exit signal value discrepancy in makeflow

21 views
Skip to first unread message

Willem Marais

unread,
Aug 26, 2022, 11:19:14 AM8/26/22
to Cooperative Computing Tools
Hi 

I am experiencing the following issue with makeflow. I submit slurm jobs via makeflow on a Linux system. The scripts that are executed give system exits of 0 and for each script the max resident memory is considerably less than what is set via makeflow and ulimit. 

The problem is that makeflow sometimes report that the exit signal is 1 even though the system exit value is 0 for the script that is executed within the slurm job. The jobs for which this discrepancy happens appear to be random. 

Are there makeflow flags that I can use to further diagnose what the problem is? Here is the makeflow version that I am using:
makeflow version 7.4.5 FINAL (released 2022-04-10 12:10:38 +0000)
    Built by conda@b76c04cd38d1 on 2022-04-10 12:10:38 +0000
    System: Linux b76c04cd38d1 5.13.0-1021-azure #24~20.04.1-Ubuntu SMP Tue Mar 29 15:34:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    Configuration: --debug --prefix /home/willemm/miniconda3 --with-base-dir /home/willemm/miniconda3 --with-python3-path /home/willemm/miniconda3 --with-perl-path no --with-readline-path no --with-fuse-path no --without-system-parrot --without-system-prune --without-system-umbrella --without-system-weaver


Here is an example INFO.json file and /usr/bin/time output of script that was executed by makeflow/slurm.

INFO.json output:
{
  "exit_signal":1,
  "command":"ulimit -d 8388608 -m 8388608 -v 8388608; /bin/bash /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/wrapper_execute.bash /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False >> /home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/log/02_experiment_random_column_I50/task_id_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6]_mk406.log 2>&1",
  "inputs":
    [

      {
        "size":1862,
        "dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_74_key_02_experiment_random_column_I50_InferInitTask-P[0].p"
      }
    ],
  "outputs":
    [

      {
        "size":1073741824,
        "dag_name":"/home/willemm/data/gcloptimus/projects/01_cross_validation_methods/02_experiment_random_column_I50/task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p"
      }
    ],
  "environment":
    {
      "MEMORY":"8192",
      "BATCH_OPTIONS":"--job-name=406_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6] --time=6:00:00",
      "CORES":"1",
      "CORES":"1"
    },
  "category":"category_1",
  "resources":
    {
      "memory":
        [
          8192,
          "MB"
        ],
      "cores":
        [
          1,
          "cores"
        ]
    }
}


/usr/bin/time output:
        Command being timed: "python /scratch/willemm/packages/orchid/02_experiment_random_column_I50/py-cc-futures/python/py_cc_futures/bin/makeflow_execute.py --tqdm-disable --log-level=INFO /home/willemm/clustercomputing/mak
eflow/02_experiment_random_column_I50/data_gatekeeper/02_experiment_random_column_I50_data_gatekeeper.json task_graph/args_kwargs/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].
p False task_graph/features/task_406_key_02_experiment_random_column_I50_SeqQuadApproxTask[stage_nm=STAGE1C_CV_VARY_TP]-P[6].p False"
        User time (seconds): 2151.67
        System time (seconds): 22.78
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 36:52.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 261624
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 218
        Minor (reclaiming a frame) page faults: 2959833
        Voluntary context switches: 778
        Involuntary context switches: 5766
        Swaps: 0
        File system inputs: 526352
        File system outputs: 45400
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Willem

Ben Tovar

unread,
Aug 26, 2022, 11:50:08 AM8/26/22
to cctoo...@googlegroups.com
Willem,

Could you post the contents of:

/home/willemm/clustercomputing/makeflow/02_experiment_random_column_I50/wrapper_execute.bash

If I understand correctly, the "Exit status: 0" corresponds only to the python script, correct? The signal (sighup) is being received by wrapper_execute.bash. The fact that is a sighup is weird, as slurm jobs should be executed without a controlling terminal.

One thing to consider is that perhaps the slurm command line is getting confused by the [] of filenames (e.g.stage_nm=STAGE1C_CV_VARY_TP]-P[6]), as the brackets have special meaning to the shell.


Ben


--
You received this message because you are subscribed to the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/0a804704-8df1-471e-8fb3-99f5ea650445n%40googlegroups.com.

Willem Marais

unread,
Aug 26, 2022, 12:02:40 PM8/26/22
to cctoo...@googlegroups.com
Ben,

If I understand correctly, the "Exit status: 0" corresponds only to the python script, correct?
Correct. 

One thing to consider is that perhaps the slurm command line is getting confused by the [] of filenames (e.g.stage_nm=STAGE1C_CV_VARY_TP]-P[6]), as the brackets have special meaning to the shell.
The thing is that when I rerun the failed jobs, the jobs successfully executes and do not fail again. I can try to replace the [] characters with something else. 

Attached is the wrapper execute script.

Willem


You received this message because you are subscribed to a topic in the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cctools-nd/bCvrIoF0aQg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/CAMik99V%3DM15x4C3JDWrVqjR%3Dt7ku6UxM9nTuSH%2BBprOyoc%3DnuA%40mail.gmail.com.
wrapper_execute.bash

Ben Tovar

unread,
Aug 26, 2022, 12:12:41 PM8/26/22
to cctoo...@googlegroups.com

Ben Tovar

unread,
Aug 26, 2022, 12:21:33 PM8/26/22
to cctoo...@googlegroups.com
Willem,

One thing that immediately jumps at me is that you are using flock. flock does not work well across machines in a shared file system. Unless all the jobs are executing in the same compute node, then flock may not always be preventing concurrent write access. This produces race conditions that match your experience with random errors.

Ben


Willem Marais

unread,
Aug 26, 2022, 12:30:09 PM8/26/22
to cctoo...@googlegroups.com
Ben,

The flock is used on a local file system / disk and not over any distributed / shared file system. The part of the code that uses flock untars a miniconda environment on a local scratch disk, and the flock is used to prevent multiple jobs untarring the same miniconda environment over each other. 

Willem

Reply all
Reply to author
Forward
0 new messages