SLURM job is recognized as a failure when it exits with 0.

708 views
Skip to first unread message

Bong-Hyun Kim

unread,
Jul 5, 2016, 1:35:53 PM7/5/16
to Nextflow
Hi,

I have another issue with SLURM. Now the job submission works well, but my trimgalore jobs are reported as failure when the exits normally (exit code 0).
The interesting part is that this is not a problem when I run the same code with local executor.

The slurm job seems to be exited normally given the exit code in the work directory;

############################################################################
#Exit code:
############################################################################
[kimb8@cn0059 3ec1a5c255448650cc468024b83543]$ cat .exitcode
0[kimb8@cn0059 3ec1a5c255448650cc468024b83543]$ head .command.env


############################################################################
#Command line:
############################################################################

[kimb8@cn0059 3ec1a5c255448650cc468024b83543]$ cat .command.sh
#!/bin/bash -ue
trim_galore polii_1d_ra.fastq



############################################################################
#Standard error:
############################################################################

[kimb8@cn0059 3ec1a5c255448650cc468024b83543]$ tail .command.err
74      435     0.6     1       276 159
75      714     0.6     1       66 648
76      45545   0.6     1       850 44695


RUN STATISTICS FOR INPUT FILE: polii_1d_ra.fastq
=============================================
41118759 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 20 bp:  109107 (0.3%)



############################################################################
#nextflow output;
############################################################################

N E X T F L O W  ~  version 0.20.2-SNAPSHOT
Launching NGI-ChIPseq/main.nf
====================================
 ChIP-seq: v1.1
====================================
Reads        : fastq/*.fastq
Genome       : hg19
BWA Index    : /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa
MACS Config  : macssetup.config
Extend Reads : 100 bp
Current home : /home/kimb8
Current user : kimb8
Current path : /home/kimb8/projects/ccbr482_new/analysis/polii
R libraries  : /home/kimb8/R/nxtflow_libs/
Script dir   : /gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/NGI-ChIPseq
Working dir  : /gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/work
Output dir   : ./results
====================================
===================
d: DataflowQueue(queue=[DataflowVariable(value=[polii_1d_ra, nextflow.util.ArrayBag@54e22bdd]), DataflowVariable(value=[input_1d_co, nextflow.util.ArrayBag@3bd418e4]), DataflowVariable(value=[mycn_2d_co, nextflow.util.ArrayBag@544820b7]), DataflowVariable(value=[h3k4me3_1d_co, nextflow.util.ArrayBag@6b98a075]), DataflowVariable(value=[input_2d_co, nextflow.util.ArrayBag@2e61d218]), DataflowVariable(value=[mycn_2d_ra, nextflow.util.ArrayBag@3569fc08]), DataflowVariable(value=[h3k4me3_1d_ra, nextflow.util.ArrayBag@20b12f8a]), DataflowVariable(value=[polii_1d_co, nextflow.util.ArrayBag@e84a8e1]), DataflowVariable(value=[input_2d_ra, nextflow.util.ArrayBag@2e554a3b])])
===================
[input_1d_ra, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/input_1d_ra.fastq]]
[polii_1d_ra, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/polii_1d_ra.fastq]]
[input_1d_co, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/input_1d_co.fastq]]
[mycn_2d_co, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/mycn_2d_co.fastq]]
[h3k4me3_1d_co, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/h3k4me3_1d_co.fastq]]
[input_2d_co, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/input_2d_co.fastq]]
[mycn_2d_ra, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/mycn_2d_ra.fastq]]
[h3k4me3_1d_ra, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/h3k4me3_1d_ra.fastq]]
[polii_1d_co, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/polii_1d_co.fastq]]
[input_2d_ra, [/gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/fastq/input_2d_ra.fastq]]
[warm up] executor > slurm
[93/3ec1a5] Submitted process > trim_galore (polii_1d_ra)
[f5/1a6d72] Submitted process > trim_galore (input_1d_co)
[f9/d77850] Submitted process > trim_galore (h3k4me3_1d_ra)
[81/af2691] Submitted process > trim_galore (h3k4me3_1d_co)
[28/f3d440] Submitted process > trim_galore (input_1d_ra)
[95/e14d2f] Submitted process > fastqc (polii_1d_co)
[7e/84241f] Submitted process > fastqc (h3k4me3_1d_co)
[3a/6e7c9e] Submitted process > fastqc (input_2d_co)
[50/342db3] Submitted process > fastqc (mycn_2d_co)
[7f/38282c] Submitted process > trim_galore (mycn_2d_ra)
[6d/7c7953] Submitted process > fastqc (input_1d_ra)
[58/f537b7] Submitted process > trim_galore (polii_1d_co)
[71/b9a9b6] Submitted process > fastqc (polii_1d_ra)
[35/49a8ad] Submitted process > fastqc (h3k4me3_1d_ra)
[ad/e35988] Submitted process > trim_galore (input_2d_co)
[5f/da730c] Submitted process > fastqc (input_1d_co)
[ab/472207] Submitted process > trim_galore (mycn_2d_co)
[90/59cc5d] Submitted process > fastqc (mycn_2d_ra)
[a9/da0e79] Submitted process > trim_galore (input_2d_ra)
[f1/f6d736] Submitted process > fastqc (input_2d_ra)
Error executing process > 'trim_galore (polii_1d_ra)'

Caused by:
  Process 'trim_galore (polii_1d_ra)' terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  trim_galore polii_1d_ra.fastq

Command exit status:
  -

Command output:
  1.8

Command error:
  /usr/local/lmod/lmod/lmod/init/bash: line 96: PS1: unbound variable
  No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
 
  Path to Cutadapt set as: 'cutadapt' (default)
  Cutadapt seems to be working fine (tested command 'cutadapt --version')
 
 
  AUTO-DETECTING ADAPTER TYPE
  ===========================
  Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> polii_1d_ra.fastq <<)
 
  Found perfect matches for the following adapter sequences:
  Adapter type  Count   Sequence        Sequences analysed      Percentage
  Illumina      3794    AGATCGGAAGAGC   1000000 0.38
  Nextera       3       CTGTCTCTTATA    1000000 0.00
  smallRNA      0       ATGGAATTCTCG    1000000 0.00
  Using Illumina adapter for trimming (count: 3794). Second best hit was Nextera (count: 3)
 
  Writing report to 'polii_1d_ra.fastq_trimming_report.txt'
 
  SUMMARISING RUN PARAMETERS
  ==========================
  Input filename: polii_1d_ra.fastq
  Trimming mode: single-end
  Trim Galore version: 0.4.0
  Cutadapt version: 1.8
  Quality Phred score cutoff: 20
  Quality encoding type selected: ASCII+33
  Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
  Maximum trimming error rate: 0.1 (default)
  Minimum required adapter overlap (stringency): 1 bp
  Minimum required sequence length before a sequence gets removed: 20 bp
 
  Writing final adapter and quality trimmed output to polii_1d_ra_trimmed.fq
 
 
    >>> Now performing quality (cutoff 20) and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file polii_1d_ra.fastq <<<
  10000000 sequences processed
  20000000 sequences processed

Work dir:
  /gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/work/93/3ec1a5c255448650cc468024b83543

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option '-resume'

Execution cancelled -- Finishing pending tasks before exit


Paolo Di Tommaso

unread,
Jul 6, 2016, 6:10:59 AM7/6/16
to nextflow
Hi, 

It seems that the job completed successfully, but if you look in the error report you will see: 

Command exit status:
  -

That means that nextflow was unable to read the exist status file within a certain amount of time. This usual may happen to the latencies in the shared file system. What shared file system are you using? 

You can check it using the command: 

stat -f -c %T gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/work


Cheers,
Paolo

Bong-Hyun Kim

unread,
Jul 6, 2016, 6:36:52 AM7/6/16
to nextflow
Hi Paolo,

It looks like GPFS.

[kimb8@biowulf ~]$     stat -f -c %T /gpfs/gsfs4/users/CCBR/user/kimb8/ccbr482_new/analysis/polii/work
gpfs

Thanks.

Bong-Hyun


--
You received this message because you are subscribed to a topic in the Google Groups "Nextflow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nextflow/IG4wQ59flzI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Paolo Di Tommaso

unread,
Jul 6, 2016, 7:29:33 AM7/6/16
to nextflow
Hi, 

I've observed this problem mostly with NFS. However you may want to try to increase the exit timeout to something around 7 minutes (default it's 4.5 mins) by adding the following line in your `nextflow.config` file: 

executor.$slurm.exitReadTimeout = '7 mins'


If you it continues to report the same issue, please include the `.nextflow.log` file in your comment.


Hope it helps. 



Cheers,
Paolo


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Bong-Hyun Kim

unread,
Jul 7, 2016, 2:17:14 AM7/7/16
to Nextflow
Hi Paolo,

I uploaded the log file. I tested the wait time extremely long (1h), but still the problem persists.

Bong-Hyun
.nextflow.log

Paolo Di Tommaso

unread,
Jul 7, 2016, 7:38:43 AM7/7/16
to nextflow
Something strange is happening here, it looks the job is in an inconsistent status. I need more debugging information. Please do the following: 

- Update your nextflow snapshot using the following command: 

CAPSULE_RESET=1 NXF_VER=0.20.2-SNAPSHOT nextflow info

- Remove the line `executor.$slurm.exitReadTimeout` in the config file 

- Launch nextflow adding the trace flag as shown below: 

NXF_VER=0.20.2-SNAPSHOT nextflow -trace nextflow.executor run <your command line options..>

- When it stops send me the `.nextflow.log` file 


Thanks,
Paolo



Bong-Hyun Kim

unread,
Jul 7, 2016, 11:56:15 AM7/7/16
to Nextflow
Hi Paolo,

I updated my nextflow (instead of module 20.2-snapshot installed by Wolfgang, our sysadmin).
And ran the pipeline. I got the same error and attached the log file.

Thanks!!

Bong-Hyun
.nextflow.log

Bong-Hyun Kim

unread,
Jul 7, 2016, 12:10:55 PM7/7/16
to Nextflow
I just noticed that the previous exit read timeout 1h test is done on the centrally installed module, which might be buggy. Now I am retesting with my locally installed one. I will update the results, soon.

Thanks.

Bong-Hyun

You received this message because you are subscribed to a topic in the Google Groups "Nextflow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nextflow/IG4wQ59flzI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Jul 7, 2016, 12:21:15 PM7/7/16
to nextflow
There are no trace information in the log you have sent. 

Make sure to add the flag -trace e.g. 

    NXF_VER=0.20.2-SNAPSHOT nextflow -trace nextflow.executor run ... 

Bong-Hyun Kim

unread,
Jul 7, 2016, 3:10:58 PM7/7/16
to Nextflow
Here it is. Now it seems working with large wait time (1h).
The failure is due to job submission failure which is caused by cluster's temporary limits.
.nextflow.log.gz

Paolo Di Tommaso

unread,
Jul 7, 2016, 3:31:46 PM7/7/16
to nextflow
That is a different problem (on your side) but I found the issue that was causing nextflow to recognised the stop completion. 

The problem is that the job ID returned by your sbatch wrapper contains a trailing new line char that was causing nextflow to not match correctly the status for that task. 

I've just fixed it and uploaded a new snapshot. Now it should be fine. Upload your nextflow version with 

CAPSULE_RESET=1 NXF_VER=0.20.2-SNAPSHOT nextflow info

Then run using the command 

NXF_VER=0.20.2-SNAPSHOT nextflow run ... 


Please confirm that it is ok. 


Cheers,
Paolo

Reply all
Reply to author
Forward
0 new messages