pargenes-raxml-ng mpi scheduler failed

220 views
Skip to first unread message

quattrinia

unread,
Nov 29, 2018, 5:52:13 PM11/29/18
to raxml
i have recently run the program ParGenes. I was able to get through the model-test on 13 cores in hardly any time (~300 gene trees with 200+ taxa in each), but running raxml-ng on each gene tree failed after the second ML search with this error. any advice? 
thanks!
0:40:45] end of parsing modeltest results
Calling mpi-scheduler: mpiexec -n 13 /data/mcfadden/aquattrini/PROGRAMS/ParGenes/MPIScheduler/build/mpi-scheduler --split-scheduler /data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/../raxml-ng/bin/raxml-ng-mpi.so pargenes-uce-75p-run1/parse_run/parse_command.txt pargenes-uce-75p-run1/parse_run 0
Logs will be redirected to pargenes-uce-75p-run1/parse_run/logs.txt
Average number of taxa: 204
Max number of taxa: 246
Average number of sites: 243
Max number of sites: 728
Recommended number of cores: 1388
[0:40:46] end of the second parsing step
Calling mpi-scheduler: mpiexec -n 13 /data/mcfadden/aquattrini/PROGRAMS/ParGenes/MPIScheduler/build/mpi-scheduler --split-scheduler /data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/../raxml-ng/bin/raxml-ng-mpi.so pargenes-uce-75p-run1/mlsearch_run/mlsearch_command.txt pargenes-uce-75p-run1/mlsearch_run 0
Logs will be redirected to pargenes-uce-75p-run1/mlsearch_run/logs.txt
[Error] mpi-scheduler execution failed with error code 1
[Error] Will now exit...
[Error] <type 'exceptions.RuntimeError'> mpi-scheduler execution failed with error code 1

quattrinia

unread,
Nov 29, 2018, 6:10:15 PM11/29/18
to raxml
also, here is the command that I used to start the program;
pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-phylip -o pargenes-uce-75p-run1 -c 13 -d nt -m -b 100

Benoît Morel

unread,
Nov 30, 2018, 3:39:15 AM11/30/18
to raxml
Thanks a lot for your report. At first glance, your command looks fine to me.
Could you please send us the file pargenes-uce-75p-run1/mlsearch_run/logs.txt?

Benoit

Benoît Morel

unread,
Nov 30, 2018, 6:08:37 AM11/30/18
to raxml
Even better, you could send me the output of the report script I just added in the repository.

- Please update your git repository (type "git pull" from pargenes repository to get the script.
- Then type: python pargenes/report.py pargenes-uce-75p-run1 report.txt
- and send me the file report.txt

The script extracts all the information I need (hopefully!) to understand what went wrong.

Benoit

quattrinia

unread,
Nov 30, 2018, 12:30:11 PM11/30/18
to raxml
Thanks Benoit,

I actually deleted that directory and started over. I thought it was a memory issue, so I wanted to start from scratch without the -b option. 

This time, the program failed after just a few model test searches. It may still be a memory issue? I am running two other programs on 42 cores currently. But, it appears that I have enough memory?? 
KiB Mem : 52833952+total, 13425785+free, 10607100 used, 38347459+buff/cache

attached is the report. thanks~
andrea
report.txt

Benoît Morel

unread,
Dec 2, 2018, 3:49:31 PM12/2/18
to raxml
Thanks for the report.
Can you send me the file:
modeltest_uce-865_fasta_out.txt
modeltest_uce-865_fasta_err.txt
modeltest_uce-619_fasta_err.txt
modeltest_uce-619_fasta_out.txt

(they are in the directory
/data/mcfadden/aquattrini/DATA/UCESeqs/FinalSet/pargenes-uce-75p-run1/modeltest_run/running_jobs)

It looks like pargenes stopped while running these two jobs.

In parallel, can you try to run the same command, but with another output directory and with the additional argument "--scheduler onecore"?

Thank you
Benoit

quattrinia

unread,
Dec 3, 2018, 1:45:44 PM12/3/18
to raxml
Hi Benoit,

Attached are the files from run 1. I then ran the command again with the --scheduler onecore option, and it failed very quickly. looking at the report, there seems to be a problem with my .phylip alignments. so, i reran it again using fasta alignments, and it is currently running....I am not sure whether alignment formatting has been the problem all along, since my first run computed modeltests on all of my .phylip alignments. 

if this current run works, do you recommend I use the --scheduler onecore option moving forward? 

also is it possible to use the "all" raxml-ng option? 

best.
andrea
modeltest_uce-865_fasta_out.txt
modeltest_uce-619_fasta_out.txt
pargenes_logs_run2.txt
report_run2.txt

quattrinia

unread,
Dec 3, 2018, 1:50:38 PM12/3/18
to raxml


On Thursday, November 29, 2018 at 2:52:13 PM UTC-8, quattrinia wrote:

quattrinia

unread,
Dec 3, 2018, 10:23:41 PM12/3/18
to raxml
Attached is the report from the third run with the following command: python /data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta -o pargenes-uce-75p-run3 -c 25 -d nt -m --scheduler onecore  

Looks like it failed again. 
thanks,
andrea


On Thursday, November 29, 2018 at 2:52:13 PM UTC-8, quattrinia wrote:
report_run3.txt

Benoît Morel

unread,
Dec 4, 2018, 6:07:02 AM12/4/18
to ra...@googlegroups.com
Hi Andrea,

I think things start making sense:

- first, I think that you are right and that your phylip files cannot be parsed. I would continue with the fasta files. Can you please send me one of the failing phylip file (for instance uce-1092.phylip) such that I can check whether we have something wrong in our parser?
- the second issue concerns modeltest. For some reason, at least one modeltest run crashes. In run1, this interrupts the whole run. In run3, ParGenes survives because you use the (safer) onecore mode. Then it fails because my code does not seem to handle this specific case.

I think I can fix my pargenes bug today, and then you should be able to run the whole analysis using the onecore mode. (I will keep you in touch when it's fixed)

About the modeltest job that fails, this might be more complicated, and I might need the help from modeltest maintainer. Can you send me the following fasta files?
- mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta/uce-566.fasta
- mafft-nexus-internal-trimmed-gblocks-clean-75p-phylip/uce-865.fasta
- mafft-nexus-internal-trimmed-gblocks-clean-75p-phylip/uce-619.fasta


In general: continue with the onecore option. In your case, it should not be slower because all your alignments are short.

About the raxml --all option: you can use it, but then you should not use the -b, -s and -p options from ParGenes, which are exactly doing the same (running raxml from several starting trees, selecting the best one, computing bootstrap trees and support values).
The advantage of using ParGenes options is that ParGenes will run all of these operations in parallel. Raxml --all runs then sequentially.
I would rather use ParGenes -b, -s and -p options, but both methods will work.
(the pargenes equivalent for raxml --all is -s 20 -b 100)


Cheers,
Benoit


--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/3q21UKriGoI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benoît Morel

unread,
Dec 4, 2018, 6:29:00 AM12/4/18
to raxml
I think I fixed the issue. Please run "git pull" on your pargenes repository, and then try to restart run3. If it does not work, try restarting it from scratch (but it should not be necessary). 
To unsubscribe from this group and all its topics, send an email to raxml+unsubscribe@googlegroups.com.

quattrinia

unread,
Dec 4, 2018, 5:36:24 PM12/4/18
to raxml
hi benoit,

thanks much. the pipeline worked. i am just re-running it again with the -s 20 and -b 100 options. 

attached are my .fasta files and a phylip example. you can clearly see that there is something wrong on my end with a formatting error of the phylip file. i must have used a different script to generate the phylip files for my first attempted run.  

btw, thanks for the help and i appreciate the quick responses!
andrea
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
uce-865.phylip
uce-619.fasta
uce-865.fasta
uce-566.fasta

quattrinia

unread,
Dec 4, 2018, 6:31:52 PM12/4/18
to raxml
Attached are my reports from run 3 and run4. it appears to still be  having issues
run3: /data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta -o pargenes-uce-75p-run3 -c 25 -d nt -m --scheduler onecore --continue
run4: /data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta -o pargenes-uce-75p-run4 -c 25 -d nt -m -b 100 -s 20 --scheduler onecore
report_run3.txt
report_run4.txt

Benoît Morel

unread,
Dec 5, 2018, 5:46:29 AM12/5/18
to ra...@googlegroups.com
Hi Andrea,

thanks for your nice feedback :-)

About the modeltest bug, I opened an issue here: https://github.com/ddarriba/modeltest/issues/23
I hope Diego will find some time to fix it.

About run3: could you please `git pull` your pargenes repository again, and restart it from scratch? (without bootstraps etc.)

About run4: I am not sure what it is:
- Could you please send me the fasta files: uce-505.fasta, uce-1233.fasta, and uce-78.fasta? It looks like one of them is guilty and I need to reproduce the issue on my machine.
- It might be a raxml-ng known bug. If you want, you can try to run the same command line (from scratch, do not use checkpointing) with the following additional parameter:     -R "--blopt nr_safe".


Benoit

Benoît Morel

unread,
Dec 5, 2018, 6:09:22 AM12/5/18
to ra...@googlegroups.com
Actually, since there seems to be several issues on our tools, you can also send me the whole set of alignments at once (if you don't mind), such that I can try to make a whole run pass on my machine (with less bootstraps).

quattrinia

unread,
Dec 5, 2018, 1:06:18 PM12/5/18
to raxml
Hi,

I will send via email.

thanks,
andrea

Benoît Morel

unread,
Dec 6, 2018, 8:54:55 AM12/6/18
to raxml
Thank you. I started the analysis on two different machines with the exact same command as yours, but it did not fail so far.

Meanwhile, could you try what I suggested for run4 (running it from scratch with -R "--blopt nr_safe")?
Be sure to git pull your pargenes repository again, because I added some more information in the report script.

Even if it does not work, I would like to see if the issue happens at the same moment.

Benoit

quattrinia

unread,
Dec 6, 2018, 4:10:47 PM12/6/18
to raxml
Here is the report from the failed run4 with 

/data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta -o pargenes-uce-75p-run4 -c 25 -d nt -m -b 100 -s 20 --scheduler onecore -R --blopt nr_safe


thanks for the output from the test that you did!

just some more information in case this is causing an issue. i am running two other raxml jobs. one with raxml-ng-mpi and the other with raxml-pthtreads rapid b.s. also, i have raxml-ng installed in the pargenes directory and raxml-ng-mpi in another directory.

cheers,
andrea
report_run4safe.txt

Benoît Morel

unread,
Dec 7, 2018, 4:12:11 AM12/7/18
to ra...@googlegroups.com
Thank you. Since you get the same problem, you don't need the "safe" command if you run pargenes again.

Running other instances of raxml or having other installations should not be an issue. The only risk would be not to have enough memory, but I don't think that it's the explanation.

The annoying thing is that I cannot reproduce on any of our machines and that I don't find any information in the logs. Could you also send me all the raxml logs? From the pargenes run4 directory:
mkdir raxml_logs
cp mlsearch_run/results/*/*.log raxml_logs/
cp mlsearch_run/bootstraps/*/*.log raxml_logs/
And then send me a tar or zip of raxml_logs.

Here are some things you can try (separately or all together) that might give us more information:
- run the same command but on the other dataset you have (trans), to see if you get the same issue
- run the same command with a different number of cores
- run the same command when nothing else is running (I don't believe that this one would change anything)



- run pargenes without -b and -s options





Benoît Morel

unread,
Dec 7, 2018, 4:23:02 AM12/7/18
to ra...@googlegroups.com
I also just realized that in both failing runs, the only common fasta file that is processed at the failure is uce-505. Maybe you could try creating a directory with this file only, and run pargenes on it, to see whether it also fail.
If it fails, it might make the issue much easier to reproduce.

quattrinia

unread,
Dec 9, 2018, 4:25:48 PM12/9/18
to raxml
Hi Benoit,

I ran pargenes successfully on one file-uce-505 with this command. the program finished.
/data/mcfadden/aquattrini/PROGRAMS/ParGenes/pargenes/pargenes.py -a 505 -o pargenes505 -c 25 -d nt -m -b 100 -s 20 --scheduler onecore


But, as you suggested, I ran the same command on other datasets, with new # cores, etc., and all runs failed. attached are the reports. i am still waiting for other raxml runs to finish before i try running the program with nothing else running at the same time.

also, the raxml_log files from the run4 bootstrapping folder are attached. note, that these log files "cp mlsearch_run/results/*/*.log raxml_logs/" do not exist. from what i can tell, the uce-*_fasta results folders only have multiple_runs directories, and all subsequent directories in those are empty. 
report_trans.txt
report_42cores.txt
report_8cores.txt
raxml_logs.tar.gz

Benoît Morel

unread,
Dec 10, 2018, 4:03:08 AM12/10/18
to ra...@googlegroups.com
Hi Andrea

Thanks a lot for all this runs. I don't think running again ParGenes when no other raxml instance is running will help, after all.

One more idea: can you (temporary) remove the file uce_566.fasta from the alignment directory, and run pargenes without --scheduler onecore?
This way we could get rid of the very first bug you had, and continue with the "normal" implementation, and see if we still get this strange issue.

Can you also restart any of the onecore failing runs from the checkpoint? I would like to know whether it quickly fails again.

Also do you have any other machine on which you could try the same experiment?

Interestingly, all the 3 last runs you sent me failed after between 7800s and 7850s. I might be a coincidence because the other runs finish after different times. Is there any chance that some jobs are killed after 2h or something like that? (I don't think this is what's happening, but I prefer asking, just in case)


I am really sorry that you need to do all these runs. That's very unfortunate that I cannot reproduce this one our machines. I will try again with the second dataset.

Thanks a lot
Benoit

Benoît Morel

unread,
Dec 10, 2018, 7:14:01 AM12/10/18
to raxml
I wrote " I don't think running again ParGenes when no other raxml instance is running will help, after all."

But I just changed my mind. I think your  machine has 64 logical CPUs, but only 32 physical CPUs (this is what modeltest logs say)
This means that raxml/pargenes/modeltest tools will run efficiently on up to 32 cores in total. If you run these tools on more cores (and I think you wrote that you already run raxml instances on 42 cores), you are likely to experience slowdowns.
Theoretically, this should not cause crashes, but since we can't reproduce the issue on other machines, I think you should try running pargenes when your machine is less busy.

Benoit
To unsubscribe from this group and all its topics, send an email to raxml+unsubscribe@googlegroups.com.

quattrinia

unread,
Dec 10, 2018, 10:23:38 PM12/10/18
to raxml
Hi Benoit,

So, it is interesting it keeps failing at a certain time. I ran the program without the uce-566 file, using this command, and yes, it failed at ~7800 sec. Then, I continued it, and it failed again at the same timepoint. I restarted it again...
pargenes.py -a mafft-nexus-internal-trimmed-gblocks-clean-75p-fasta -o pargenes-uce-75p-run6-no566 -c 25 -d nt -m -b 100 -s 20 --continue

I asked our IT administrator if he had any clue, he said this: (Not sure if any of this is helpful).

"there’s no kernel-level cputime limit set by default:
tcb@purves [~]
(505) % limit cputime
cputime         unlimited

Maybe you’re running to an MPI limit? How exactly do you run your program? I know that “mpirun”/“mpiexec"/etc. uses the MPIEXEC_TIMEOUT variable to control how long programs can run but it looks like the default is not to have any limit."

Benoît Morel

unread,
Dec 11, 2018, 4:41:02 AM12/11/18
to ra...@googlegroups.com
Hi Andrea,

it's interesting indeed, we should definitively investigate this. I am calling mpiexec -n <some program with some arguments>, without setting any environment variable or adding any MPI weird stuff.

But we can try something:
I added a ParGenes option to make the python script restart the mpi program n times after a failure (you need to pull your repository). For instance, if you add --retry 5, it will try 5 more times to run mpiexec <...>.
This way we should be able to answer two questions:
- will the whole run survive longer than 7800s with restarts? (this would tell me whether the python program or the MPI program is "guilty")
- if yes, will ParGenes manage to process all the msas with this bypass? If yes you will be able to process your datasets, but that's not a long term solution for my tool.

Also, could you try sending me the whole output directory for this last run? It might contain more logs than with the onecore option.
Here is the command that will create an archive without the big files I don't need: tar -czvf pargenes_run.tar.gz pargenes_run_dir/ --exclude='*.fasta'  --exclude='*.phy' --exclude='*.rba'
The archive might be too big to fit in a mail.

Benoit

quattrinia

unread,
Dec 11, 2018, 2:42:07 PM12/11/18
to raxml
hi benoit,

attached is the output that you requested. i am running the program now with the --retry option, also, one of my other raxml-pthreads run finished...

cheers,
andrea
pargenes-uce-75p-run6-no566.tar.gz

quattrinia

unread,
Dec 12, 2018, 1:14:14 PM12/12/18
to raxml
Hi Benoit,

With my new run and the --retry option, things are still working. It looks like it only failed once, as there are only 2 log files under the ./mlsearch_run directory. The program has completed bootstrapping on ~1/2 of the alignments. I will update you if/when it finishes.

"Runner is still alive after 72125s"


Cheers,
Andrea

quattrinia

unread,
Dec 14, 2018, 1:44:45 PM12/14/18
to raxml
Hi there,

Attached is the report for the 7th run using the --retry option. Looks like it worked! It took just a few days..

Andrea
report_run7_retry.txt

quattrinia

unread,
Dec 18, 2018, 12:13:43 PM12/18/18
to raxml
Just one more update, with nothing else running, pargenes still fails...

Benoît Morel

unread,
Dec 18, 2018, 4:34:39 PM12/18/18
to ra...@googlegroups.com
Dear Andrea,

I realized that there is something wrong in my retry option: in your report, the first run fails with error code 1, but then the first retry succeeds (error code 0 is a success) but I consider it as a failure. I just fixed this. Could you try it again after a git pull?

I thought ParGenes would work correctly with all the CPUs free... That's very annoying. How many cores did you use for this run?
Does it work with the retries?

Benoit


Reply all
Reply to author
Forward
0 new messages