Estimated Bootstrap Guesser

21 views
Skip to first unread message

Alexandros Stamatakis

unread,
Oct 20, 2024, 1:52:52 AMOct 20
to ra...@googlegroups.com
Dear Users,

Do you want to rapidly predict bootstrap values via machine learning?
You can now use our Educated Bootstrap Guesser:

https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msae215/7825466

This will also be integrated into RAxML-NG next year.

Alexis

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Pfeiffer, Wayne

unread,
Nov 1, 2024, 6:44:22 PMNov 1
to ra...@googlegroups.com, Pfeiffer, Wayne
Hi Alexis,

After receiving this announcement about EBG I was eager to try it out.

I installed the package via conda and typed

ebg -h

which returns the expected output.

So far, however, I have been unable to get it run for actual data. I found that I needed absolute paths to the input files to avoid some errors, but even after adding those paths, I still get errors.

Here is my command line:

ebg -msa /expanse/projects/ngbt/opt/benchmarks/EBG-0.12.0_expanse/218/218.fasta -tree /expanse/projects/ngbt/opt/benchmarks/EBG-0.12.0_expanse/218/218.bestTree -model /expanse/projects/ngbt/opt/benchmarks/EBG-0.12.0_expanse/218/218.bestModel -redo

and here are the error messages:

Traceback (most recent call last):
File "/home/cipres/miniconda3/envs/ebgenv/bin/ebg", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/cipres/miniconda3/envs/ebgenv/lib/python3.12/site-packages/EBG/__main__.py", line 29, in main
predictor = Predictor(args.msa, args.tree, args.model, args.o, args.t, args.raxmlng, args.redo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cipres/miniconda3/envs/ebgenv/lib/python3.12/site-packages/EBG/Prediction/predictor.py", line 64, in __init__
self.feature_extractor = FeatureExtractor(msa_filepath, tree_filepath, model_filepath, o, raxml_ng_path, redo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cipres/miniconda3/envs/ebgenv/lib/python3.12/site-packages/EBG/Features/feature_extractor.py", line 36, in __init__
self.feature_computer = FeatureComputer(msa_file_path, tree_file_path, model_file_path, output_prefix, raxml_ng_path, redo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cipres/miniconda3/envs/ebgenv/lib/python3.12/site-packages/EBG/Features/feature_computer.py", line 80, in __init__
tmp_folder_path = os.path.abspath(os.path.join(os.curdir, output_prefix))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen posixpath>", line 90, in join
File "<frozen genericpath>", line 164, in _check_arg_types
TypeError: join() argument must be str, bytes, or os.PathLike object, not ‘NoneType'

* Is there someone on your team who might suggest what the problem is and how to solve it?

Thanks for whatever help you can provide.

Wayne

> On Oct 19, 2024, at 10:52 PM, Alexandros Stamatakis <alexandros...@gmail.com> wrote:
>
> Dear Users,
>
> Do you want to rapidly predict bootstrap values via machine learning? You can now use our Educated Bootstrap Guesser:
>
> https://urldefense.com/v3/__https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msae215/7825466__;!!Mih3wA!DwLRVEWVr0dum5oC3vVKM35Jl5luV8F-7Cg_YM5Lx5Td4V5rTYAXrYP2yA-obT9jvI8Mrg5INaNEGf7y2k3jRuMWyc4$
> This will also be integrated into RAxML-NG next year.
>
> Alexis
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for Research and Technology - Hellas
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> https://urldefense.com/v3/__http://www.biocomp.gr__;!!Mih3wA!DwLRVEWVr0dum5oC3vVKM35Jl5luV8F-7Cg_YM5Lx5Td4V5rTYAXrYP2yA-obT9jvI8Mrg5INaNEGf7y2k3jRE0KGrA$ (Crete lab)
> https://urldefense.com/v3/__http://www.exelixis-lab.org__;!!Mih3wA!DwLRVEWVr0dum5oC3vVKM35Jl5luV8F-7Cg_YM5Lx5Td4V5rTYAXrYP2yA-obT9jvI8Mrg5INaNEGf7y2k3jKXYJWQU$ (Heidelberg lab)
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
> To view this discussion on the web visit https://urldefense.com/v3/__https://groups.google.com/d/msgid/raxml/7ec87fcd-dac0-49ea-aa61-b146a4be4d8b*40gmail.com__;JQ!!Mih3wA!DwLRVEWVr0dum5oC3vVKM35Jl5luV8F-7Cg_YM5Lx5Td4V5rTYAXrYP2yA-obT9jvI8Mrg5INaNEGf7y2k3j_AawPLI$ .

Oleksiy Kozlov

unread,
Nov 1, 2024, 6:54:51 PMNov 1
to ra...@googlegroups.com
Hi Wayne,

please try to add "-o <output_folder>", there seems to be a missing default value for this argument,

Best,
Oleksiy

Pfeiffer, Wayne

unread,
Nov 2, 2024, 5:27:10 AMNov 2
to ra...@googlegroups.com, Pfeiffer, Wayne
Hi Oleksiy,

Thanks for the prompt reply.

Adding the -o option allowed me to successfully analyze two small DNA data sets in 49 and 209 s :)

However, analysis of a larger DNA data set with 45 taxa and 168,565 patterns ran out of time after reaching my specified time limit of 6 hours. Here is the final output in stderr

FeatureComputer - INFO - Finished computing 180 from 200 parsimony bootstraps ... 
FeatureComputer - INFO - Finished computing 200 Parsimony Bootstraps
FeatureComputer - INFO - Finished computing Parsimony Bootstrap features!
FeatureExtractor - INFO - Elpased time: 2306.95 seconds
FeatureComputer - INFO - Finished computing tree split features!
FeatureExtractor - INFO - Elpased time: 0.12 seconds

So no output was generated for over 5 hours.

* Do you think this analysis would finish if run longer, or is this data set just too big for EBG?

I also tried to analyze two amino acid data sets, but both attempts failed. EBG thought that the input was for partitioned DNA data sets, even though the original RAxML-NG analyses were unpartitioned and the model files specified WAG+G4m or LG+G4m. Here is the start of the stdout file from my run with the WAG+G4m model:

ERROR: Failed to read partition file:
ERROR model initialization |(Seq140:0.314768| (LIBPLL-5001): DNA model not found: (Seq140:0.314768

I presumed that AA data sets were allowed, since the paper by Wiegert et al says:

“We used 1496 MSAs (93% DNA and 7% Amino Acid (AA)) for training and evaluating EBG.”

* Please let me know if you would like me to send you or one of your colleagues any of my input files by a separate email thread for further investigation.

Thanks again,
Wayne

Pfeiffer, Wayne

unread,
Nov 2, 2024, 7:22:57 PMNov 2
to ra...@googlegroups.com, Pfeiffer, Wayne
Hi Oleksiy,

I resubmitted a job for the large DNA data set with a longer time limit, and it nearly finished after 13.2 hours when it ran out of memory.

I have resubmitted two more jobs requesting more memory to see whether one of them finishes successfully.

Also, EBG does not seem to accept a partition file as input.

* Does that mean that EBG cannot handle partitioned data sets?

Best regards,
Wayne

Pfeiffer, Wayne

unread,
Nov 4, 2024, 3:41:02 AMNov 4
to ra...@googlegroups.com, Pfeiffer, Wayne
Hi Oleksiy,

My EBG analysis of the DNA data set with 45 taxa and 168,565 patterns finished successfully in 12.6 hours after I increased the memory to 32 GB. The analysis was fine using only 2 GB of memory until the very end, when a final processing step became very memory intensive.

* This explosion in memory usage would be good to investigate along with why the code does not work for amino acid data sets.

Best regards,
Wayne

Oleksiy Kozlov

unread,
Nov 4, 2024, 8:46:32 AMNov 4
to ra...@googlegroups.com, Pfeiffer, Wayne
Hi Wayne,

thanks for extensive testing!

I was not involved in this project, so I will have to discuss your questions with colleagues, and
this could take a while.

We will also discuss whether/how EBG could be integrated into raxml-ng, and hopefully we can address
some limitations of the current implementation in the process.

My best guess so far:

- partitioned alignments are not supported

- for AA data, could it be that you provided a newick file instead of .raxml.bestModel in the
"-model" option?


Best,
Oleksiy
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/raxml/4F4269A3-C817-436B-AE73-E341CB46E928%40sdsc.edu
> <https://groups.google.com/d/msgid/raxml/4F4269A3-C817-436B-AE73-E341CB46E928%40sdsc.edu?utm_medium=email&utm_source=footer>.

Pfeiffer, Wayne

unread,
Nov 4, 2024, 10:13:16 AMNov 4
to Oleksiy Kozlov, ra...@googlegroups.com, Pfeiffer, Wayne
Reply all
Reply to author
Forward
0 new messages