error generating binary file in exabayes?

4 views
Skip to first unread message

nepenthesbaphomet

unread,
Feb 6, 2026, 8:53:19 AMFeb 6
to ExaBayes
I have successfully used exabayes on a separate dataset. I had started exabayes on this most recent dataset, but it stopped with no error log information. (5e6, but stalled at about 2.6 million; asdsf was ca. 7.8%). I stupidly deleted the original directory after several failed attempts to restart (e.g., -r myrun and -n myrun_restart) so I don't have those results. Ever since then I've had the same problem and can no longer run exabayes. Originally it took a couple minutes to parse the phylip file, but now it just "stalls."

There is no indication of an issue with the reattempted runs slurm output. I am just trying to perform a preliminary concatenated analysis, I have 500 taxa, 2400 short loci, and a 320501 bp matrix.

From what I can see in the google help group, this hasn't been seen before but I might not have the appropriate search requirements. Here is what I tried:

Because I suspected an issue with mpi, I tried adjusting separate runs with 2400 (with -R 4), 1800 (-R 3), and 900 (-R 3) threads. These all "stalled" so I ended up killing the jobs. I know Im not hitting the CPU cap, because there are ca. 5700 CPU's available. And the cluster is not very active right now.

I tried recompiling exabayes, and the issue persists.

With a recompiled exabayes I tried using 9 tasks and the job gets past parsing the phylip file, but dies again with no error information. 

...
Trying to parse it...

The (binary) alignment file you provided, consists of the following
partitions:

number: 0
name: unnamedPartition
#patterns: 260495
type: DNA

Parameter names will reference the above partition numbers.
================================================================

Parameters to be integrated (initial values derived from prior):
0 topo{0}
sub-id: 0
prior: Uniform(0.00,0.00)
init value: parsimony
1 brlens{0}
sub-id: 0
prior: Exponential(10.00)
init value: 0.10
2 statefreq{0}
sub-id: 0
prior: Dirichlet(1.00,1.00,1.00,1.00)
init value: 0.25,0.25,0.25,0.25
3 revmat{0}
sub-id: 0
prior: Dirichlet(1.00,1.00,1.00,1.00,1.00,1.00)
init value: 0.17,0.17,0.17,0.17,0.17,0.17
4 ratehet{0}
sub-id: 0
prior: Uniform(0.00,200.00)
init value: 100.00
================================================================

Will employ the following proposal mixture (frequency,id,type,affected variables):
15.38% 0 stNNI( topo{0};brlens{0} )
15.38% 1 eSPR( topo{0};brlens{0};stopProb=0.50 )
15.38% 2 parsSpr( topo{0};brlens{0};radius=12,warp=0.10 )
5.13% 3 likeSPR( topo{0};brlens{0};radius=9,warp=1.00 )
17.95% 4 blMult( brlens{0} )
2.56% 5 TL-Mult( brlens{0} )
17.95% 6 blDistGamma( brlens{0} )
1.28% 7 freqSlider( statefreq{0};brlens{0} )
1.28% 8 freqDirich( statefreq{0};brlens{0} )
2.56% 9 revMatSlider( revmat{0};brlens{0} )
2.56% 10 revMatDirich( revmat{0};brlens{0} )
2.56% 11 rateHetMulti( ratehet{0} )

Will execute 3 runs in parallel.

initialized diagnostics file test_results/ExaBayes_diagnostics.test_results
initialized file test_results/ExaBayes_topologies.run-0.test_results
initialized file test_results/ExaBayes_parameters.run-0.test_results
initialized file test_results/ExaBayes_topologies.run-1.test_results
initialized file test_results/ExaBayes_parameters.run-1.test_results
initialized file test_results/ExaBayes_topologies.run-2.test_results
initialized file test_results/ExaBayes_parameters.run-2.test_results


diagnostic exabayes file has:
$ cat test_results/ExaBayes_diagnostics.test_results
GEN asdsf stNNI( topo{0};brlens{0} )$run0 eSPR( topo{0};brlens{0};stopProb=0.5 )$run0 parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run0 likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run0 blMult( brlens{0} )$run0 TL-Mult( brlens{0} )$run0 blDistGamma( brlens{0} )$run0 freqSlider( statefreq{0};brlens{0} )$run0 freqDirich( statefreq{0};brlens{0} )$run0 revMatSlider( revmat{0};brlens{0} )$run0 revMatDirich( revmat{0};brlens{0} )$run0 rateHetMulti( ratehet{0} )$run0 stNNI( topo{0};brlens{0} )$run1 eSPR( topo{0};brlens{0};stopProb=0.5 )$run1 parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run1 likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run1 blMult( brlens{0} )$run1 TL-Mult( brlens{0} )$run1 blDistGamma( brlens{0} )$run1 freqSlider( statefreq{0};brlens{0} )$run1 freqDirich( statefreq{0};brlens{0} )$run1revMatSlider( revmat{0};brlens{0} )$run1 revMatDirich( revmat{0};brlens{0} )$run1 rateHetMulti( ratehet{0} )$run1 stNNI( topo{0};brlens{0} )$run2 eSPR( topo{0};brlens{0};stopProb=0.5 )$run2 parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run2 likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run2 blMult( brlens{0} )$run2 TL-Mult( brlens{0} )$run2 blDistGamma( brlens{0} )$run2 freqSlider( statefreq{0};brlens{0} )$run2 freqDirich( statefreq{0};brlens{0} )$run2 revMatSlider( revmat{0};brlens{0} )$run2 revMatDirich( revmat{0};brlens{0} )$run2 rateHetMulti( ratehet{0} )$run2


my slurm script is pretty standard and designed to run a concatenated and partitioned analysis at the same time; here is the latest one I ran:

#!/bin/bash
#SBATCH --time 48:00:00 # in hours
#SBATCH --partition public-cpu
#SBATCH --ntasks 9
#SBATCH --mem 10000 # in MB
#SBATCH --job-name 
[REDACTED]_exabayes
#SBATCH --array 1
#1-4 # run N jobs

cd /home/users/c/cardenac/phylo/[REDACTED]/exabayes

JOB_LIST=${PWD}/job.list
ALIGN=$(cat ${JOB_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "," -f 1)
PARTI=$(cat ${JOB_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "," -f 2)
PREFIX=$(echo ${PARTI} | cut -d "." -f 1)

#mkdir ${PREFIX}_results
mkdir test_results
SEED=6669420

# call environment
module load  GCC/7.3.0-2.30  OpenMPI/3.1.1

# run exabayes
mpirun -np ${SLURM_NTASKS} /home/users/c/cardenac/exabayes-1.5.1/exabayes \
-f data/${ALIGN} \
-m DNA \
-n test_results \
-s ${SEED} \
-c config.nex \
-w test_results \
-R 3


I checked the resource utilization (seff slurm command) from a 900 thread run, which sat for an hour trying to parse the phylip file:

Job ID: 3420051
Array Job ID: 3420051_1
Cluster: bamboo
User/Group: cardenac/hpc_users
State: CANCELLED (exit code 0)
Nodes: 8
Cores per node: 112
CPU Utilized: 4-13:54:53
CPU Efficiency: 13.69% of 33-10:45:00 core-walltime
Job Wall-clock time: 00:53:31
Memory Utilized: 13.57 GB
Memory Efficiency: 5.43% of 250.00 GB (31.25 GB/node)


I would be happy to share the files with you, but I'd prefer to send them over a private e-mail rather than a public forum.

Anyways, here is some relevant software information:

software:
GCC/7.3.0-2.30
OpenMPI/3.1.1
exabayes v 1.5.1

I'm not sure if its relevant but I can also provide the hardware. But I believe it should be fine given I've successfully run exabayes before.

Thanks for your time!

Alexandros Stamatakis

unread,
Feb 9, 2026, 5:30:58 AMFeb 9
to exab...@googlegroups.com
This all sounds pretty weird and unexpected.

Has anything changed in the cluster configuration?

has the code been re-compiled since the last time it executed successfully?

Could it be that you are over-subscribing physical cores with more
threads than there are physical cores? My guess is that this is likely
to be the case.

Can you parse the input file with the sequential ExaBayes version?

The sequential version should also generate a binary file that you could
subsequently use to initiate a parallel run on it.

Those are the first ideas that come to my mind.

Alexis
> 0topo{0}
> sub-id:0
> prior:Uniform(0.00,0.00)
> init value:parsimony
> 1brlens{0}
> sub-id:0
> prior:Exponential(10.00)
> init value:0.10
> 2statefreq{0}
> sub-id:0
> prior:Dirichlet(1.00,1.00,1.00,1.00)
> init value:0.25,0.25,0.25,0.25
> 3revmat{0}
> sub-id:0
> prior:Dirichlet(1.00,1.00,1.00,1.00,1.00,1.00)
> init value:0.17,0.17,0.17,0.17,0.17,0.17
> 4ratehet{0}
> sub-id:0
> prior:Uniform(0.00,200.00)
> init value:100.00
> ================================================================
>
> Will employ the following proposal mixture (frequency,id,type,affected
> variables):
> 15.38%0stNNI( topo{0};brlens{0} )
> 15.38%1eSPR( topo{0};brlens{0};stopProb=0.50 )
> 15.38%2parsSpr( topo{0};brlens{0};radius=12,warp=0.10 )
> 5.13%3likeSPR( topo{0};brlens{0};radius=9,warp=1.00 )
> 17.95%4blMult( brlens{0} )
> 2.56%5TL-Mult( brlens{0} )
> 17.95%6blDistGamma( brlens{0} )
> 1.28%7freqSlider( statefreq{0};brlens{0} )
> 1.28%8freqDirich( statefreq{0};brlens{0} )
> 2.56%9revMatSlider( revmat{0};brlens{0} )
> 2.56%10revMatDirich( revmat{0};brlens{0} )
> 2.56%11rateHetMulti( ratehet{0} )
>
> Will execute 3 runs in parallel.
>
> initialized diagnostics file test_results/ExaBayes_diagnostics.test_results
> initialized file test_results/ExaBayes_topologies.run-0.test_results
> initialized file test_results/ExaBayes_parameters.run-0.test_results
> initialized file test_results/ExaBayes_topologies.run-1.test_results
> initialized file test_results/ExaBayes_parameters.run-1.test_results
> initialized file test_results/ExaBayes_topologies.run-2.test_results
> initialized file test_results/ExaBayes_parameters.run-2.test_results
>
> diagnostic exabayes file has:
> $ cat test_results/ExaBayes_diagnostics.test_results
> GENasdsfstNNI( topo{0};brlens{0} )$run0eSPR( topo{0};brlens{0};stopProb=0.5 )$run0parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run0likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run0blMult( brlens{0} )$run0TL-Mult( brlens{0} )$run0blDistGamma( brlens{0} )$run0freqSlider( statefreq{0};brlens{0} )$run0freqDirich( statefreq{0};brlens{0} )$run0revMatSlider( revmat{0};brlens{0} )$run0revMatDirich( revmat{0};brlens{0} )$run0rateHetMulti( ratehet{0} )$run0stNNI( topo{0};brlens{0} )$run1eSPR( topo{0};brlens{0};stopProb=0.5 )$run1parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run1likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run1blMult( brlens{0} )$run1TL-Mult( brlens{0} )$run1blDistGamma( brlens{0} )$run1freqSlider( statefreq{0};brlens{0} )$run1freqDirich( statefreq{0};brlens{0} )$run1revMatSlider( revmat{0};brlens{0} )$run1revMatDirich( revmat{0};brlens{0} )$run1rateHetMulti( ratehet{0} )$run1stNNI( topo{0};brlens{0} )$run2eSPR( topo{0};brlens{0};stopProb=0.5 )$run2parsSpr( topo{0};brlens{0};radius=12,warp=0.1 )$run2likeSPR( topo{0};brlens{0};radius=9,warp=1 )$run2blMult( brlens{0} )$run2TL-Mult( brlens{0} )$run2blDistGamma( brlens{0} )$run2freqSlider( statefreq{0};brlens{0} )$run2freqDirich( statefreq{0};brlens{0} )$run2revMatSlider( revmat{0};brlens{0} )$run2revMatDirich( revmat{0};brlens{0} )$run2rateHetMulti( ratehet{0} )$run2
>
> my slurm script is pretty standard and designed to run a concatenated
> and partitioned analysis at the same time; here is the latest one I ran:
>
> #!/bin/bash
> #SBATCH --time 48:00:00 # in hours
> #SBATCH --partition public-cpu
> #SBATCH --ntasks 9
> #SBATCH --mem 10000 # in MB
> #SBATCH --job-name *[REDACTED]*_exabayes
> #SBATCH --array 1
> #1-4 # run N jobs
>
> cd /home/users/c/cardenac/phylo/*[REDACTED]*/exabayes
> --
> You received this message because you are subscribed to the Google
> Groups "ExaBayes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to exabayes+u...@googlegroups.com
> <mailto:exabayes+u...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> exabayes/01772ba4-97d9-4ffc-90d7-08dbd8c12b3an%40googlegroups.com
> <https://groups.google.com/d/msgid/
> exabayes/01772ba4-97d9-4ffc-90d7-08dbd8c12b3an%40googlegroups.com?
> utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

OpenPGP_0x23E4F7A18C2F9989.asc
OpenPGP_signature.asc

nepenthesbaphomet

unread,
Feb 10, 2026, 11:20:29 PMFeb 10
to ExaBayes
As far as I know there was an update to the slurm scheduler, but as I indicated, it was running fine since this update. It does look like their is an issue with interactive jobs, but I've only run exabayes by submitting scripts, so I'm not sure that its applicable: "Since the latest slurm update, some interactive jobs (using srun or salloc) are killed prematurely. For the user it appears as if the job reached its timelimit, but in the admin logs it is indicated that the job was killed due to inactivity timeout. We opened a case at schedMD." I can reach out to the system administrators about this, if you think it might diagnose the issue I've seen.

The exabyaes code was first compiled in December 2025 and ran successfully. I used this same compiled version when exabayes randomly crashed in Feburary. After several failed attempts to get the software working again, I recompiled and had the same issue.

I also thought that I might over requesting cores too but as I noted I adjusted the number of threads quite a bit to be sure. The successful December 2025 run used 800 threads just fine. The original run that stopped without an error message in February 2026 only had 1200 threads. In creased the number of threads, to see if this would help, while also reducing it closer to a number of threads similar to what worked previously. However, I believe even the 2100 thread job is well within the 5700 total CPU's available across the cluster. Either way, even with the 9 thread example, the job parses the phylip file fine and dies unexpectedly (as I described in my first message).

I ran yggdrasil and the job parsed the phylip file fine, but it ran into an OOM event when starting the MCMC sampling. However, there is a warning in the log file:

Problem with pinning! Probably the number of processes and threads

started on this machine exceeds the number of available cores. Thread

pinning is disabled. In the worst case,ExaBayes will run substantially slower (use a tool like htop to monitor,

whether all cores are loaded).

My slurm script contradicts this, as I've requested 15 threads for this test. 


#!/bin/bash
#SBATCH --time 48:00:00 # in hours
#SBATCH --partition public-cpu

#SBATCH --ntasks 15
#SBATCH --mem 64000 # in MB
#SBATCH --job-name yggdrasil


#SBATCH --array 1
#1-4 # run N jobs

# fight variables
cd /home/users/c/cardenac/phylo/[REDACTED]/exabayes


JOB_LIST=${PWD}/job.list
ALIGN=$(cat ${JOB_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "," -f 1)
PARTI=$(cat ${JOB_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "," -f 2)
PREFIX=$(echo ${PARTI} | cut -d "." -f 1)
#mkdir ${PREFIX}_results

mkdir test_yggdrasil_results


SEED=6669420

# call environment
module load  GCC/7.3.0-2.30  OpenMPI/3.1.1

# run exabayes

#mpirun -np ${SLURM_NTASKS} /home/users/c/cardenac/exabayes-1.5.1/exabayes \
/home/users/c/cardenac/exabayes-1.5.1/yggdrasil \


-f data/${ALIGN} \
-m DNA \
-n test_results \
-s ${SEED} \
-c config.nex \

-w test_yggdrasil_results \
-R 3 \
-T 15



I would like to confirm how to use the generated binary file in exabayes, do I just try and "restart" the run using exabayes? If this is the case, why couldn't I use the binary file create by the successful 9 thread MPI run?

I really appreciate your assistance,
Cody

Alexandros Stamatakis

unread,
Feb 16, 2026, 2:30:40 AMFeb 16
to exab...@googlegroups.com
Dear Cody,

I see, the OOM message you get is probably because your alignment file
is very large and there is insufficient memory to allocate the
conditional likelihood vectors we need to start computing likelihoods
which happens AFTER the alignment file has been parsed.

You should be able to use the binary file created by yggdrasil to
restart. Regarding the yggdrasil run, how many physical cores does the
machine have on which you started it?

The main problem we have with ExaBayes is that the PhD student who
implemented it left for industry over 10 years ago now, and we con only
provide very limited support unfortunately.

Alexis
> cd /home/users/c/cardenac/phylo/[*REDACTED*]/exabayes
> <https://groups.google.com/d/msgid/>
> > exabayes/01772ba4-97d9-4ffc-90d7-08dbd8c12b3an%40googlegroups.com
> <http://40googlegroups.com>
> > <https://groups.google.com/d/msgid/ <https://groups.google.com/d/
> msgid/>
> > exabayes/01772ba4-97d9-4ffc-90d7-08dbd8c12b3an%40googlegroups.com
> <http://40googlegroups.com>?
> > utm_medium=email&utm_source=footer>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for Research and
> Technology - Hellas
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.biocomp.gr <http://www.biocomp.gr> (Crete lab)
> www.exelixis-lab.org <http://www.exelixis-lab.org> (Heidelberg lab)
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "ExaBayes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to exabayes+u...@googlegroups.com
> <mailto:exabayes+u...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> exabayes/fedd0f33-f8bf-46ee-8243-96a66910e085n%40googlegroups.com
> <https://groups.google.com/d/msgid/exabayes/fedd0f33-
> f8bf-46ee-8243-96a66910e085n%40googlegroups.com?
OpenPGP_0x23E4F7A18C2F9989.asc
OpenPGP_signature.asc

nepenthesbaphomet

unread,
Feb 16, 2026, 6:17:19 AMFeb 16
to ExaBayes
Alexis,

I understand that the OOM was due to a lack of memory, I only provided enough resources for it to run and generate the binary file like you suggested.
I believe most nodes are 128 cores. However, I've not used yggdrasil because I'm not sure that 128 cores would be sufficient with the available wall time (4 days).

What I should do about that last "pinning" warning message? This shouldn't have anything to do with an OOM error. I provided 15 threads and asked for 15 threads in yggdrasil. Could this point to a scheduling or hardware issue on the HPC?

I understand the developer left for industry, I thought it might be good to document and try to trouble shoot these kinds of errors for that reason. I'll contact the HPC admins to see if they have a sense of what might be causing this error on the hardware side.

Thanks, 
Cody
Reply all
Reply to author
Forward
0 new messages