Random jobs listed as failed even though they produce no errors - makeflow (version 7.3.5 FINAL) - PBS Pro

36 views
Skip to first unread message

Willem Marais

unread,
Nov 12, 2021, 6:23:41 PM11/12/21
to Cooperative Computing Tools
Hi

My colleague at NCAR and I use makeflow (version 7.3.5 FINAL) with PBS Pro as the job scheduler. We started to see an odd anomaly where makeflow marks a jobas failed, at random, even though 
1. the bash script exit status is 0,
2. and the expected output file is created by the job. 

In the makeflow.failed.*/INFO.json file we have something like this:
{
  "exit_signal":1,
  "command":"/bin/bash /glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/wrapper_execute.bash  /glade/u/home/wmarais/Projects/GitHub/gcloptimus/python/gcloptimus/bin/makeflow_execute.py --tqdm-disable --log-level=DEBUG /glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3.zodb 571615861467056640 648278539409302784 /glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_648278539409302784.txt /glade/scratch/wmarais/scratch/slurm exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3 >> /glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/log/group_id_571615861467056640_group_node_id_648278539409302784.log 2>&1",
  "inputs":
    [

      {
        "size":8,
        "dag_name":"/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_900655430731727360.txt"
      },

      {
        "size":8,
        "dag_name":"/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_696473906547114752.txt"
      }
    ],
  "outputs":
    [

      {
        "size":1073741824,
        "dag_name":"/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_648278539409302784.txt"
      }
    ],
  "environment":
    {
      "MEMORY":"2048",
      "BATCH_OPTIONS":"-m n -M obli...@void.com -l walltime=6:00:00 -q regular -A NEOL0007 -N exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3_571615861467056640_648278539409302784 -l select=1:ncpus=1:mem=2048mb",
      "CORES":"1",
      "CORES":"1"
    },
  "category":"default",
  "resources":
    {
      "memory":
        [
          2048,
          "MB"
        ],
      "cores":
        [
          1,
          "cores"
        ]
    }
}

What is strange about the the INFO.json file is that it indicates that the exit signal is 1, though at the end of our bash script we have an `exit 0`. We know the our bash script reaches to the instruction `exit 0` based on the log output. Furthermore, the output

 `/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_648278539409302784.txt

is in 

`makeflow.failed.*/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_648278539409302784.txt

and the content of

`mkfl_indc_571615861467056640_648278539409302784.txt

is correct. 

Could you give us guidance of how to diagnose this problem? 

Willem
Message has been deleted
Message has been deleted

Willem Marais

unread,
Nov 12, 2021, 7:10:40 PM11/12/21
to cctoo...@googlegroups.com
Hi

When I use `--disable-heartbeat` the problem described above disappears. Based on the INFO.json, it seems like that makeflow thinks that the file

/glade/u/home/wmarais/clustercomputing/makeflow/demonstration/exe_graph_mpd_5_20190427T000000_20190427T235900_300s_test_v3/1_cheyenne.ucar.edu/indicator/mkfl_indc_571615861467056640_648278539409302784.txt

has a size of 1GB, even though the file size is 8 bytes. This file is created with the standard python file routines, e.g.

with open(file_path_str, 'w') as file_obj:
    file_obj.write('Success')

Could it be that makeflow, by accident, incorrectly determines the file size of the output file? The file is written to the UCAR glade file space, which is based on the IBM GPFS shared file system.

Willem 

--
You received this message because you are subscribed to a topic in the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cctools-nd/hPirJEPbtFw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/4f886754-88ab-4ac8-8dab-8ebc466e6ce9n%40googlegroups.com.

Ben Tovar

unread,
Nov 15, 2021, 8:42:59 AM11/15/21
to cctoo...@googlegroups.com
Willem,

When a file does not declare its expected size, we use a default of 1GB. This only affects garbage collection modes, so I don't think it is related to the incorrect exit status of the jobs.

If the signal that terminates the process is 1, then as you discovered, it could be that the heartbeat check failed. In your makeflow console output do you see by any chance a message that looks like: "job NNNN does not appear to be running anymore."?

Ben

You received this message because you are subscribed to the Google Groups "Cooperative Computing Tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cctools-nd+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cctools-nd/CAKsVgOvE5ThCOEvJzUv4%2B-N5V-vQ%3DzEVT6n0hXqaoBANY-DgAQ%40mail.gmail.com.

Willem Marais

unread,
Nov 15, 2021, 10:35:57 AM11/15/21
to cctoo...@googlegroups.com
Hi Ben

I did not record the standard output of the makeflow command. I resubmitted my job while recording the standard output (> with 2>&1)  and I will be able to answer your question by tomorrow. 

Willem

Willem Marais

unread,
Nov 15, 2021, 10:57:28 AM11/15/21
to cctoo...@googlegroups.com
Hi Ben

Yes, for every job that is marked as failed, there is a text in the makeflow console that says "job NNNN does not appear to be running anymore.".

Willem

Ben Tovar

unread,
Nov 15, 2021, 11:32:02 AM11/15/21
to cctoo...@googlegroups.com
Willem,

We think the problem is created by a delay in the shared file system. When the job finishes, the heartbeat information is not yet synced to the file system and makeflow incorrectly thinks that the job was lost. For now we think it should be safe for you to run with the disable heartbeat command line option. (It only comes into play when the batch system itself terminates the job, e.g., because of a timeout.)

Since your jobs are finishing correctly anyway, we think we can improve the heartbeat check to take this into account, but we need to do some tests. In any case, something you may need to do is to run the workflows with --wait-for-files-upto=N (where N is number of seconds, e.g. 60) so that you give the file system a chance to sync.

It may be worth it to run a test with --wait-for-files-upto=60 and without the --disable-heartbeat option, although it won't make a difference if the problem is not that the heartbeat file is not created, but rather that it is not updated on time.

Ben


Willem Marais

unread,
Nov 15, 2021, 12:09:23 PM11/15/21
to cctoo...@googlegroups.com
Hi Ben

Thank you for the feedback. We'll do some tests during the week with --wait-for-files-upto=60.

Willem

Reply all
Reply to author
Forward
0 new messages