Potential communication error with PBS

381 views
Skip to first unread message

Jessica K

unread,
Dec 17, 2015, 5:01:40 PM12/17/15
to Nextflow
Hi,

I have written up an RNA-Seq pipeline and when I test it with small files (100,000 lines), it works fine. However, when I test it with real dataset files (millions of lines) it will crash every time for obscure reasons, which I suspect are problems with nextflow communicating to PBS. It always fails at the first process which runs trimmomatic and normally this step would take about an hour to run per file, but it always throws this error after 5-10 minutes of running normally. And I say it runs normally because I can see the output files being made before it throws the error. 

I am attaching a shortened version of this script (which also throws the same error) and providing the command used to run it and the error it emits below. I am running this on a PBS/Torque cluster and it as nextflow version 0.15.6. I will go through all of the attempts I've taken to debug this and then explain why I think its a communication error. 

Things I've tried:
- Changing the number of jobs allowed to run at one time and also tried getting rid of '-qs' altogether. I also tried specifying queusize in the config file instead. Nothing worked.
- Doing the suggested tip that the error gives, and looked through .command.run in the process directory. I also tried "qsub .command.run" and this worked fine, so its not a problem with script that was generated by nextflow.
- Shortening the script to just run FASTQC because I know that it runs very quickly. This worked as expected (4 jobs at a time), which leads me to believe that the length of the trimmomatic jobs has something to do with it quitting. I know its not the trimmomatic command itself because it works for very small datasets, and even with the large datasets it at least starts working before it quits unexpectedly, and also because running .command.run works.
- Our cluster allows for unlimited time to be used, but just in case, I tried adding the time directive to the processes and allowed 3 hours for each (time '3h'). This did not work, so it doesn't seem to be limited in this way.
- The last thing I tried was updating nextflow to version 0.16.4 to see if this problem may have been taken care of already. I still get the error.

Oddities I noticed:
- The trimmomatic output files that are created have no hour:minute timestamp
- After about about 5-10 minutes, Nextflow tries to run a 5th job (when I limited queuesize to 4) even though the other 4 jobs are not even close to finishing yet and this is when the error is thrown.
- When nextflow throws this error and quits, it doesn't kill all the jobs, only some of them. It will kill 4, and everything else will still be listed as running, even though they aren't actually doing anything.

I believe there is some sort of communication error happening after the 4 jobs start and before the jobs finish. I'm guessing that at a certain period of time, nextflow checks the status of these jobs, isn't getting the correct status back, assumes that they are finished, then starts a fifth job, and then crashes. And perhaps the script that only runs FASTQC works as expected because the jobs finish so quickly. But then again, maybe this is wrong because I've also tried running it without limiting the queuesize and it gave the same error. Do you know more about why I'm getting this error? 

Thanks,
Jessica


Nextflow command
nextflow run trim_fastqc.nf -qs 4


Error thrown:

N E X T F L O W  ~  version 0.15.6

Launching trimming_only.nf

[warm up] executor > pbs

[b1/8b1a5a] Submitted process > trimmomatic (1)

[e5/fe2ec1] Submitted process > trimmomatic (3)

[7f/36617a] Submitted process > trimmomatic (2)

[3e/19b72a] Submitted process > trimmomatic (4)

[47/7a8775] Submitted process > trimmomatic (5)

Error executing process > 'trimmomatic (1)'


Caused by:

  Process 'trimmomatic (1)' terminated for an unknown reason -- Likely it has been terminated by the external system


Command executed:


  echo "starting"    

  java -jar /home/apps/trimmomatic/trimmomatic-0.33/trimmomatic-0.33.jar SE -threads 4 -phred33 -trimlog     /home/groups/test_nextflow/results/logs/101NS1_R1_001.trimmomatic.log 101NS1_R1_001.fastq  /home/groups/test_nextflow/results/trimmomatic/101NS1_R1_001.qualtrim.fastq     ILLUMINACLIP:/home/apps/trimmomatic/trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:15:10 LEADING:30 TRAILING:30 MINLEN:30


Command exit status:

  -


Command output:

  starting


Command error:

  TrimmomaticSE: Started with arguments: -threads 4 -phred33 -trimlog /home/groups/test_nextflow/results/logs/101NS1_R1_001.trimmomatic.log 101NS1_R1_001.fastq /home/groups/test_nextflow/results/trimmomatic/101NS1_R1_001.qualtrim.fastq ILLUMINACLIP:/home/apps/trimmomatic/trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:15:10 LEADING:30 TRAILING:30 MINLEN:30

  Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'

  Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'

  ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences


Work dir:

  /home/groups/test_nextflow/src/scripts/work/b1/8b1a5a6f4b2f8cb1ab93e12ecf90f3


Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named: '.command.sh'


WARN: Killing pending processes (4)

 



trim_fastqc.nf

Paolo Di Tommaso

unread,
Dec 18, 2015, 4:52:32 AM12/18/15
to nextflow
Hi Jessica, 

Thanks a lot for this detailed report. Could you please send me also the file `.nextflow.log` that you can find in the folder where you launched the pipeline execution? 

Could you also check if a file named `.exitcode` exist in the working directory for a failed task? (make sure to check a task for which you do *not* execute the qsub .command.run command).


Cheers,
Paolo







--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Jessica K

unread,
Dec 18, 2015, 11:13:41 AM12/18/15
to Nextflow
Hi Paolo,

I reran it just to be sure and have attached the nextflow.log. I didn't see a .exitcode in my working directory, however, I did find them in the subdirectories of the work/ directory.

Below are the processes submitted and the contents of .exitcode below them:

[0b/a7ca9f] Submitted process > trimmomatic (3) 

0

[9b/870e35] Submitted process > trimmomatic (2) 

0

[0c/f9f591] Submitted process > trimmomatic (1) 

0

[35/4be79f] Submitted process > trimmomatic (4) 

0


These were also submitted before the above finished, but had no .exitcode files:

[9e/e659c4] Submitted process > trimmomatic (5)

[cf/f4b428] Submitted process > trimmomatic (6)

[4a/44c49b] Submitted process > trimmomatic (7)

[9c/91a3ac] Submitted process > trimmomatic (8)



Thanks,

Jessica

nextflow.log

Paolo Di Tommaso

unread,
Dec 18, 2015, 12:15:09 PM12/18/15
to nextflow
Hi Jessica, 

Yes, the file `.exitcode` is created in each task working directory when the PBS jobs complete. This confirms that jobs are correctly executed by the cluster but for some reason nextflow is unable to read the exit file. 

This may be caused by the network file system caching mechanism. What happens is that the file is created but it is not immediately visible to the application. 

You may try to increase the time that nextflow waits for that file before returning an error condition. You can do that adding in your `nextflow.config` file the exitReadTimeout setting as shown below: 

executor {
    exitReadTimeout = '9 mins'
}


Hope this helps. 

Cheers,
Paolo

Jessica K

unread,
Dec 18, 2015, 3:26:19 PM12/18/15
to Nextflow
Hi Paolo,

I tried this and set it to 15 minutes, but it still crashed. I also noticed that this time the exitcode was '143'. Does this mean that the cluster system is killing the jobs?

Thanks,
Jessica

Paolo Di Tommaso

unread,
Dec 18, 2015, 3:31:36 PM12/18/15
to nextflow

I guess so. But are the same jobs as before crashing?

Jessica K

unread,
Dec 18, 2015, 5:00:42 PM12/18/15
to Nextflow
Yes, the same jobs. I went back and looked at the exit codes for all my failed attempts and half of them were 0's and the other half were 143's.  

I also tried running this script on a different queue on our cluster and got the same error as I've been getting. However, when I ran it on this queue with the exitReadTimeout '15m', it gave me a completely different error that also doesn't make sense. It says that there are missing output files and its because its looking for something located in home/ rather than /home/, but as you can see in my command, there is no mention of home/ without the first forward slash.

Error:

Caused by:

  Missing output file(s): 'home/groups/nf_test_results/trimmomatic/trimmed_fastq/K_L_I_TTAGGCAT_L00M_R1_001.qualtrim.fastq'

 expected by process: trimmomatic (3)


Command executed:

  if [ ! -e /home/groups/nf_test_results/trimmomatic/logs/ ]; then   

      mkdir -p "/home/groups/nf_test_results/trimmomatic/logs/"

  fi

  if [ ! -e /home/groups/nf_test_results/trimmomatic/trimmed_fastq/ ]; then

      mkdir -p "/home/groups/nf_test_results/trimmomatic/trimmed_fastq/"

  fi

  java -jar /home/apps/trimmomatic/trimmomatic-0.33/trimmomatic-0.33.jar SE -threads 4 -phred33 -trimlog     /home/groups/n

f_test_results/trimmomatic/logs/K_L_I_TTAGGCAT_L00M_R1_001.trimmomatic.log K_L_I_TTAGGCAT_L00M_R1_001.fastq  /home/groups/n

f_test_results/trimmomatic/trimmed_fastq/K_L_I_TTAGGCAT_L00M_R1_001.qualtrim.fastq ILLUMINACLIP:/home/apps/trimmomatic/trimmomatic-0.33/adapters/TruSeq3-SE

.fa:2:15:10 LEADING:28 TRAILING:28 MINLEN:30


Command exit status:

  0


Command output:

  (empty)


Command error:

  TrimmomaticSE: Started with arguments: -threads 4 -phred33 -trimlog /home/groups/nf_test_results/trimmomatic/logs/K_L_I_T

TAGGCAT_L00M_R1_001.trimmomatic.log K_L_I_TTAGGCAT_L00M_R1_001.fastq /home/groups/nf_test_results/trimmomatic/trimmed_fastq

/K_L_I_TTAGGCAT_L00M_R1_001.qualtrim.fastq ILLUMINACLIP:/home/apps/trimmomatic/trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:15:10 LEADING:28 TRAILING:28 MINLE

N:30

  Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'

  Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'

  ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

  Input Reads: 22839637 Surviving: 22837640 (99.99%) Dropped: 1997 (0.01%)

  TrimmomaticSE: Completed successfully


Work dir:

  /home/groups/nextflow_test/work/35/77516a2e1cba6458f4d681fc9df637


Tip: you can replicate the issue by changing to the process work dir and entering the command: 'bash .command.run'

Paolo Di Tommaso

unread,
Dec 19, 2015, 10:57:49 AM12/19/15
to nextflow
Hi Jessica, 

OK, this time it was able to read the exit file (the error message reports: "exit status 0") but can't access the expected output file because it has been created outside the task working directory. 

This is a common misunderstanding with nextflow. Processes must not create output files outside the task working directory assigned to it by nextflow. In practical it means you must not use absolute path in your task to save your results. 

Your script should be something like this: 



If you need to copy the output files in a specific folder, use the directive publishDir  (see for example here).


Finally, I've made a patch to try mitigate your file system caching problem. To use it define in your environment the following variable: NXF_VER=0.16.6-SNAPSHOT

The following time you launch nextflow it will download that patched version. 

To recap I would suggest: 
1) modify your script so that output files are create the in task current work directory 
2) update nextflow to the snapshot I've just uploaded and try to execute your script again. 


If it still won't work please include the `.nextfow.log` file in your email. 


Cheers,
Paolo




Jessica K

unread,
Dec 21, 2015, 12:55:46 PM12/21/15
to Nextflow
Hi Paolo,

Unfortunately these fixes aren't working either. I did change my script to use the publishDir directive instead of specifying absolute paths. And I added that variable to my local environment, but perhaps it needs to be added to the global environment because it didn't download anything? I am attaching a couple log files. The first is when I tried running nextflow with the new environment variable only. The second is when I ran it with the new environment variable and included the exitReadTimeout in the config file. For some reason I was only able to get it give a different error other than "Process 'trimmomatic (1)' terminated for an unknown reason -- Likely it has been terminated by the external system" that one time, because including exitReadTimeout has not worked for me since.

Could there be something about our cluster system that isn't compatible with nextflow? How does nextflow check the status of jobs on PBS/Torque? 

Thanks,
Jessica
nextflow_varOnly.log
nextflow_var-and-exitReadTimeout.log

Paolo Di Tommaso

unread,
Dec 21, 2015, 3:23:00 PM12/21/15
to nextflow
Hi Jessica, 

Nextflow checks both the jobs status using `qstat` and waits for the file `.exitcode` file created in each task working directory. 

I'm almost sure that the problem is due to the metadata caching in your file system and the snapshot I've uploaded should fix it. 

However checking the logs it looks you didn't manage to download it. 

To be sure to use the version I've uploaded, launch nextflow prefixing the command with the NXF_VER variable as shown below: 

NXF_VER=0.16.6-SNAPSHOT nextflow run <your script here> .. 


This must download the patched version (on start it will print "version 0.16.6-SNAPSHOT"). 


Let me know. 


Cheers,
Paolo


Jessica K

unread,
Dec 21, 2015, 4:21:19 PM12/21/15
to Nextflow
Hi Paolo,

I tried that and it did download this time, but I noticed a few errors during downloading (see download-snapshot-results.txt attachment).  I let nextflow run, but after several minutes I still got the same error as before and I've attached the log file from that as well.

Thanks,
Jessica
nextflow_VARdownloaded.log
download-snapshot-results.txt

Paolo Di Tommaso

unread,
Dec 21, 2015, 4:38:47 PM12/21/15
to nextflow
It was downloaded fine (that messages are just warnings). 

Could you please run this command and send me the output: 

stat -f -c %T  /home/groups/hpcbio/projects/tapping/Dec2015/scripts/nextflow_test/work



Cheers,
Paolo

Jessica K

unread,
Dec 21, 2015, 4:42:37 PM12/21/15
to Nextflow
The output is:

gpfs

Paolo Di Tommaso

unread,
Dec 21, 2015, 4:46:54 PM12/21/15
to nextflow
I see. Please execute also this two commands and send the output as it is printed in the terminal 

printf 'echo Hello; sleep 5' | qsub -q <your queue>

qstat 


Cheers,

p

Jessica K

unread,
Dec 21, 2015, 4:52:28 PM12/21/15
to Nextflow

[krkptrc2@galaxy nextflow_test]$ printf 'echo Hello; sleep 5' | qsub -q default

1584224.biocluster.igb.illinois.edu

[krkptrc2@galaxy nextflow_test]$ qstat

Job ID                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

1584184.biocluster         STDIN            krkptrc2        00:00:00 R biotech        

1584224.biocluster         STDIN            krkptrc2               0 R default 

Paolo Di Tommaso

unread,
Dec 21, 2015, 5:04:28 PM12/21/15
to nextflow
OK. The problem is not the file system (you are using gpfs that is not affected by that problem as nfs). 

The problem is that in your system the job ID returned by `qsub` contains host name (.igb.illinois.edu) and in the IDs returned by `qstat` that prefix is missing. Thus nextflow won't be able to match the job status information, so it assumes that the job has been terminated and kill the execution.


I can try to make a workaround for that on tomorrow. In the meanwhile I think you should be able to run your script (the one with the `publishDir` directive) using the snapshot you have download and setting an exitReadTimeout stupidly large (10 hours or more). 



Thanks for your patience!


Cheers,
Paolo

Paolo Di Tommaso

unread,
Dec 21, 2015, 5:11:14 PM12/21/15
to nextflow
Sorry, one more thing. Could you please execute these commands one more time: 

printf 'echo Hello; sleep 5' | qsub -q <your queue>

qstat -l 
 

(now qstat has the `-l` option). 


Thanks!
Paolo


Jessica K

unread,
Dec 21, 2015, 6:41:23 PM12/21/15
to Nextflow

Hi Paolo,


That makes a lot more sense. Thanks for taking the time to solve this. Here's the results of what you asked for:


[krkptrc2@galaxy ~]$ printf 'echo Hello; sleep 5' | qsub -q default

1584288.biocluster.igb.illinois.edu

[krkptrc2@galaxy ~]$ qstat -l

Job ID                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

1584288.biocluster.igb.illino STDIN            krkptrc2               0 R default



Thank you,
Jessica

Paolo Di Tommaso

unread,
Dec 22, 2015, 10:06:25 AM12/22/15
to nextflow
Hi Jessica, 


I've just uploaded a new snapshot the should solve your problem. You can download it with the following command: 

CAPSULE_RESET=1 NXF_VER=0.16.6-SNAPSHOT nextflow info


I will take a few seconds to pull the required dependencies. Make sure that at the end it prints the following version number: 

Version: 0.16.6-SNAPSHOT build 3428
Modified: 22-12-2015 14:55 UTC (15:55 CEST)


Then, you can try to run again your pipeline using this command: 

NXF_VER=0.16.6-SNAPSHOT nextflow run <your script>  ..etc.


Let me know if this solve the issue. 


Cheers,
Paolo

Jessica K

unread,
Dec 22, 2015, 12:47:57 PM12/22/15
to Nextflow
Hi Paolo,

It seems to be running normally now. I am about to test it on a much larger pipeline, but so far no problems. Thank you for fixing this so quickly!

Jessica 

Paolo Di Tommaso

unread,
Dec 22, 2015, 1:06:19 PM12/22/15
to nextflow

Thanks to you for such valuable feedback!

Cheers, Paolo

--
Reply all
Reply to author
Forward
0 new messages