Error: 'module: command not found'

895 views
Skip to first unread message

Steve

unread,
Nov 9, 2018, 4:25:53 PM11/9/18
to Nextflow
My pipelines keep failing on our new SLURM HPC. I suspect its not actually Nextflow's fault, but maybe someone on here might have some clues as to what might be going wrong.

When I run the pipeline, I get errors that look like this:


ERROR
~ Error executing process > 'fastqc_trim (Sample1)'

Caused by:
 
Process `fastqc_trim (Sample1)` terminated with an error exit status (127)

Command executed:

  fastqc
-o . "Sample1_R1.trim.fastq.gz"
  fastqc
-o . "Sample1_R2.trim.fastq.gz"

Command exit status:
 
127

Command output:
 
(empty)

Command wrapper:
  USER
:none SLURM_JOB_ID:335130 SLURM_JOB_NAME:nf-fastqc_trim_(Sample1) HOSTNAME:cn-0009 PWD:/gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275 NTHREADS:none
  /cm/local/apps/slurm/var/spool/job335130/slurm_script: line 77: module: command not found

Work dir:
 
/gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 
-- Check '.nextflow.log' file for details
WARN
: Killing pending tasks (53)
>>> Nextflow completed, stdout log file: logs/nextflow.1541792629.stdout.log


Obviously, it says that the 'module' program is not available when SLURM tries to run the script. The .command.run file looks like this:


$ cat
-n /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275/.command.run
     
1  #!/bin/bash
     
2  #SBATCH -D /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275
     
3  #SBATCH -J nf-fastqc_trim_(Sample1)
     
4  #SBATCH -o /gpfs/data/molecpathlab/production/NGS580/180720_NB501073_0060_AHFMGJBGX7_test/work/25/c7424f434945630f68bce259f95275/.command.log
     
5  #SBATCH --no-requeue
     
6  #SBATCH -p cpu_short
     
7  #SBATCH --ntasks-per-node=1 --export=NONE --export=NTHREADS
     
...
   
...
   
...
   
64  # user `beforeScript`
   
65  printf "USER:${USER:-none} SLURM_JOB_ID:${SLURM_JOB_ID:-none} SLURM_JOB_NAME:${SLURM_JOB_NAME:-none} HOSTNAME:${HOSTNAME:-none} PWD:$PWD NTHREADS:${NTHREADS:-none}
    66  "
; TIMESTART=$(date +%s)
   
67  nxf_module_load(){
   
68    local mod=$1
   
69    local ver=${2:-}
   
70    local new_module="$mod"; [[ $ver ]] && new_module+="/$ver"
   
71
   
72    if [[ ! $(module list 2>&1 | grep -o "$new_module") ]]; then
   
73      old_module=$(module list 2>&1 | grep -Eow "$mod\/[^\( \n]+" || true)
   
74      if [[ $ver && $old_module ]]; then
   
75        module switch $old_module $new_module
   
76      else
    77        module load $new_module
   
78      fi
   
79    fi
   
80  }
   
81
   
82  nxf_module_load singularity 2.5.2
   
83


So, line 77 here is where the error is being reported. I am trying to load our HPC's Singularity module here.

Is it possible that there might be some race condition where SLURM is trying to run the .command.run script before the system's 'module' program has loaded? When I try to load modules manually, or do 'bash .command.run', there is no issue. I even entered the node in question and checked there as well and everything is fine. The pipeline was working fine for weeks now this error has suddenly started appearing, seemingly at random.

I am using Nextflow version 0.32.0 build 4897


Paolo Di Tommaso

unread,
Nov 12, 2018, 4:47:53 AM11/12/18
to next...@googlegroups.com
Since you are using Slurm, to replicate the exact behaviour you should use `sbatch .command.run` instead of `bash .command.run`. 

I suspect that module is not accessible from the job environment. A common way I have seen to handle this is to load the singularity module *before* the pipeline execution. 


p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Steve

unread,
Nov 12, 2018, 12:09:14 PM11/12/18
to Nextflow
Thanks for the suggestions. I will try these and see if I can't isolate the error or keep it from appearing again.

Steve

unread,
Nov 12, 2018, 1:44:56 PM11/12/18
to Nextflow
As expected, I am unable to reproduce the behavior by manually running 'sbatch .command.run', it still happens on a seemingly random basis when using Nextflow

Also, I tried removing `process.module = "singularity/2.5.2"` from my nextflow.config and instead just running 'module load singularity/2.5.2' before running Nextflow. However, this does not work because I am also using `process.clusterOptions = '--export=NONE', among others, because I do not want users' stray commands to pollute the environment and propagate into the pipeline environment. I end up with


Command error:
  env
: singularity: No such file or directory




when my pipeline tries to execute.



On Monday, November 12, 2018 at 4:47:53 AM UTC-5, Paolo Di Tommaso wrote:

Steve

unread,
Nov 28, 2018, 10:13:25 AM11/28/18
to Nextflow
Still getting this error, however a new one came up yesterday where 'ps' was not found:

$ cat .command.log
/cm/local/apps/environment-modules/4.0.0//init/bash: line 15: MODULES_USE_COMPAT_VERSION: unbound variable
WARNING
: LD_LIBRARY_PATH_modshare exists ( /cm/shared/apps/slurm/17.11.7/lib64/slurm:1:/gpfs/share/apps/gcc/8.1.0/lib/:1:/cm/shared/apps/slurm/17.11.7/lib64:1:/gpfs/share/apps/gcc/8.1.0/lib64/:1 ), but LD_LIBRARY_PATH doesn't. Environment is corrupted.
/gpfs/data/molecpathlab/production/Demultiplexing/181116_NB501073_0080_AHCWWJAFXY/work/b9/682283280acba832c435239358585b/.command.stub: line 45: ps: command not found


I spoke more with our system admins about the issues, and they suggested that I might need to source `/etc/profile.d/modules.sh` at the start of my sbatch script, since the SLURM jobs appear to be running before the environment has finished loading, specifically the 'module' command. However I am not sure that this would fix the 'ps' command missing error.
Reply all
Reply to author
Forward
0 new messages