My pipelines keep failing on our new SLURM HPC. I suspect its not actually Nextflow's fault, but maybe someone on here might have some clues as to what might be going wrong.
When I run the pipeline, I get errors that look like this:
ERROR ~ Error executing process > 'fastqc_trim (Sample1)'
Caused by:
Process `fastqc_trim (Sample1)` terminated with an error exit status (127)
Command executed:
fastqc -o . "Sample1_R1.trim.fastq.gz"
fastqc -o . "Sample1_R2.trim.fastq.gz"
Command exit status:
127
Command output:
(empty)
Command wrapper:
USER:none SLURM_JOB_ID:335130 SLURM_JOB_NAME:nf-fastqc_trim_(Sample1) HOSTNAME:cn-0009 PWD:/gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275 NTHREADS:none
/cm/local/apps/slurm/var/spool/job335130/slurm_script: line 77: module: command not found
Work dir:
/gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (53)
>>> Nextflow completed, stdout log file: logs/nextflow.1541792629.stdout.log
Obviously, it says that the 'module' program is not available when SLURM tries to run the script. The .command.run file looks like this:
$ cat -n /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275/.command.run
1 #!/bin/bash
2 #SBATCH -D /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275
3 #SBATCH -J nf-fastqc_trim_(Sample1)
4 #SBATCH -o /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275/.command.log
5 #SBATCH --no-requeue
6 #SBATCH -p cpu_short
7 #SBATCH --ntasks-per-node=1 --export=NONE --export=NTHREADS
...
...
...
64 # user `beforeScript`
65 printf "USER:${USER:-none} SLURM_JOB_ID:${SLURM_JOB_ID:-none} SLURM_JOB_NAME:${SLURM_JOB_NAME:-none} HOSTNAME:${HOSTNAME:-none} PWD:$PWD NTHREADS:${NTHREADS:-none}
66 "; TIMESTART=$(date +%s)
67 nxf_module_load(){
68 local mod=$1
69 local ver=${2:-}
70 local new_module="$mod"; [[ $ver ]] && new_module+="/$ver"
71
72 if [[ ! $(module list 2>&1 | grep -o "$new_module") ]]; then
73 old_module=$(module list 2>&1 | grep -Eow "$mod\/[^\( \n]+" || true)
74 if [[ $ver && $old_module ]]; then
75 module switch $old_module $new_module
76 else
77 module load $new_module
78 fi
79 fi
80 }
81
82 nxf_module_load singularity 2.5.2
83
So, line 77 here is where the error is being reported. I am trying to load our HPC's Singularity module here.
Is it possible that there might be some race condition where SLURM is trying to run the .command.run script before the system's 'module' program has loaded? When I try to load modules manually, or do 'bash .command.run', there is no issue. I even entered the node in question and checked there as well and everything is fine. The pipeline was working fine for weeks now this error has suddenly started appearing, seemingly at random.
I am using Nextflow version 0.32.0 build 4897