raxml-ng-mpi job not checkpointing

Nick

unread,

Sep 28, 2022, 11:43:40 PM9/28/22

to raxml

I'm not sure what’s going on with my raxml-ng-mpi analysis. No errors, it looks to be running on the HPC, and htop into the nodes check out. However, the log file shows that it's been stuck in the same step for the last 24 hours and hasn’t continued with bootstrapping where it left off via checkpointing.

Here is my slurm script:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --partition=general
#SBATCH --time=48:00:00
#SBATCH --job-name raxml-110-75
#SBATCH --output=raxml-110_75_o%j
#SBATCH --error=raxml-110_75_e%j
#SBATCH --mail-type=BEGIN,FAIL,END

#load conda environment
module load miniconda3/4.10.3-yikv
eval "$(conda shell.bash hook)"
conda activate phyluce-env

#variable to represent working directory
src=$SLURM_SUBMIT_DIR

#get a list of the assigned nodes
scontrol show hostname > ./node_list_${SLURM_JOB_ID}

#run raxml mpi
mpiexec -np 64 -machinefile ./node_list_${SLURM_JOB_ID} raxml-ng-mpi \
--all \
--msa $src/myzo110_75per_raxml/myzo110_75per_raxml.phylip \
--model GTR+G \
--bs-trees autoMRE{100}

Here is the output from the job, which hasn't moved for the last 24 hours:

RAxML-NG v. 1.0.1 released on 19.09.2020 by The Exelixis Lab.

Developed by: Alexey M. Kozlov and Alexandros Stamatakis.

Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.

Latest version: https://github.com/amkozlov/raxml-ng

Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 32 cores, 92 GB RAM

RAxML-NG was called at 27-Sep-2022 10:13:53 as follows:

raxml-ng-mpi --all --msa /carc/scratch/projects/andersen2016005/myzo/myzo110_75per_raxml/myzo110_75per_raxml.phylip --model GTR+G --bs-trees autoMRE{100}

Analysis options:

run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)

start tree(s): random (10) + parsimony (10)

bootstrap replicates: max: 100 + bootstopping (autoMRE, cutoff: 0.030000)

random seed: 1664295234

tip-inner: OFF

pattern compression: ON

per-rate scalers: OFF

site repeats: ON

branch lengths: proportional (ML estimate, algorithm: NR-FAST)

SIMD kernels: AVX2

parallelization: coarse-grained (auto), MPI (64 ranks)

WARNING: The model you specified on the command line (GTR+G) will be ignored

since the binary MSA file already contains a model definition.

If you want to change the model, please re-run RAxML-NG

with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: /carc/scratch/projects/andersen2016005/myzo/myzo110_75per_raxml/myzo110_75per_raxml.phylip.raxml.rba

[00:05:50] Alignment comprises 113 taxa, 1 partitions and 3285067 patterns

Partition 0: noname

Model: GTR+FO+G4m

Alignment sites / patterns: 10094589 / 3285067

Gaps: 58.77 %

Invariant sites: 77.93 %

Parallelization scheme autoconfig: 1 worker(s) x 64 thread(s)

Parallel reduction/worker buffer size: 1 KB / 0 KB

Alexey Kozlov

unread,

Sep 29, 2022, 5:41:20 PM9/29/22

to ra...@googlegroups.com

Hi Nick,

please if possible upgrade to the latest version (1.1.0) and re-run with "--log debug".

Also, are you trying to resume a terminated run from a checkpoint? I don't see it in the output you
posted...

Best,
Alexey

On 29.09.22 05:43, Nick wrote:
> I'm not sure what’s going on with my raxml-ng-mpi analysis. No errors, it looks to be running on the
> HPC, and htop into the nodes check out. However, the log file shows that it's been stuck in the same
> step for the last 24 hours and hasn’t continued with bootstrapping where it left off via checkpointing.
>

> *Here is my slurm script:
> *

> #!/bin/bash
>
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=32
> #SBATCH --partition=general
> #SBATCH --time=48:00:00
> #SBATCH --job-name raxml-110-75
> #SBATCH --output=raxml-110_75_o%j
> #SBATCH --error=raxml-110_75_e%j
> #SBATCH --mail-type=BEGIN,FAIL,END
>
> #load conda environment
> module load miniconda3/4.10.3-yikv
> eval "$(conda shell.bash hook)"
> conda activate phyluce-env
>
> #variable to represent working directory
> src=$SLURM_SUBMIT_DIR
>
> #get a list of the assigned nodes
> scontrol show hostname > ./node_list_${SLURM_JOB_ID}
>
> #run raxml mpi
> mpiexec -np 64 -machinefile ./node_list_${SLURM_JOB_ID} raxml-ng-mpi \
> --all \
> --msa $src/myzo110_75per_raxml/myzo110_75per_raxml.phylip \
> --model GTR+G \
> --bs-trees autoMRE{100}
>

> *Here is the output from the job, which hasn't moved for the last 24 hours:
> *

> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/5fbeb424-a13c-44a9-a93a-45d367faf2d3n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/5fbeb424-a13c-44a9-a93a-45d367faf2d3n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward

Message has been deleted