Raxml-ng: tree from partitioned data vs tree from non-partititoned data under the same model differ?

73 views
Skip to first unread message

Miguel Arenas

unread,
May 9, 2023, 1:30:11 PM5/9/23
to raxml

Hi there,
I noted that phylogenetic trees reconstructed from a protein dataset under (1) a single partition that includes all the sites of the dataset or under (2) a partition per site (whatever is -brlen) based on the same exchangeability matrix, differ in terms of both topology and branch lenghts. It seems that using a partition for every site instead of a partition for all the sites (both under the same exchangeability matrix) dramatically reduces the accuracy of the reconstructed phylogenetic tree, is it correct?. Is there any way to reconstruct the same (or very similar) phylogenetic tree with/without partitions under the same exchangeability matrix?
Thank you!
Best,
Miguel

Alexandros Stamatakis

unread,
May 10, 2023, 5:16:03 AM5/10/23
to ra...@googlegroups.com
Dear Miguel,

> I noted that phylogenetic trees reconstructed from a protein dataset
> under (1) a single partition that includes all the sites of the dataset
> or under (2) a partition per site (whatever is -brlen) based on the same
> exchangeability matrix, differ in terms of both topology and branch
> lenghts.

That might happen if you change the model.

> It seems that using a partition for every site

Are you really using a separate partition for every single site?

That might induce substantial over-parametrization.

> instead of a
> partition for all the sites (both under the same exchangeability matrix)
> dramatically reduces the accuracy of the reconstructed phylogenetic
> tree, is it correct?.

You might be over-parametrizing, to provide a more detailed answer I
would need to know the alignment size (#taxa and #sites) and you should
send me your partition file.

> Is there any way to reconstruct the same (or very
> similar) phylogenetic tree with/without partitions under the same
> exchangeability matrix?

I think that there is no clear answer to this, the general rule is: if
you change the model under which you infer the tree you get might change
as well. You may want to consider running PartitionFinder or analogous
tools to find an appropriate partition scheme.

Alexis

> Thank you!
> Best,
> Miguel
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Grimm

unread,
May 10, 2023, 1:20:33 PM5/10/23
to raxml
A little addon to Alexi's answer from an empirist's viewpoint

If the signal in the matrix is trivial, the topology, or any topological aspect that is trivial, will always be (nearly) the same irrespective of the used model/partitioning scheme. Such as e.g. in the case of tip-poor phylogenomic datasets. In such a case, it won't matter whether one infers the tree using e.g. HKY + Gamma across the entire data, or different GTR + Gamma partitions. Only very simple models such as 1-parameter Jukes-Cantor will show some (minor) differences.

If there is a (substantial) model/partition-induced topological change, it usually relates to non-trivial signal: the optimisation makes a different topological choice because part of the data reflect the phylogeny, the sequence of speciation/divergence events, differently (much different evolutionary rates between the lineages and/or gene regions; different evolutionary constraints/selection pressure) or because they reflect aspect-wise different histories (in the case of hybridisation/introgression/HGT).

Personally, I hence always run (and recommend to run) an unpartitioned analysis and a fully partitioned one (if feasible), the latter will typically knock on the wall Alexi mentioned, overparametrisation. Fully partitioned would be to assign each gene in the matrix its own partition, or beyond (distinguishing exons and introns, and codon-positions). For in-between, I prefer logical partitions, i.e. by general function (being a geneticist by training not a phylogeneticists), e.g. all 1st+2nd codon positions vs. 3rd codon position vs. introns across all protein-coding genes. But most reviewers ask for algorithmically defined partitions (e.g. using PartitionFinder). The latter pack the max.-defined number of partitions (e.g. the individual genes) into blocks that are supposed to evolve approximately according to the same model and is a good option to reduce the number of partitions, when the matrix includes hundreds of genes.

This way (inferring an unpartitioned, possibly biased - fully partitioned, possibly overparametrized, some scheme in-between tree) one can directly see the potential topological affect of the model/partitioning scheme (each partitioning will typically lead to different models) and one has captured the total "partitioning-model space".

Cheers, Guido

Miguel Arenas

unread,
May 19, 2023, 7:11:14 AM5/19/23
to raxml
Dear Alexei and Guido,

Many thanks for your comments!

Let me explain a bit why we are exploring this. We are developing new substitution models of protein evolution that produce site-specific exchangeability matrices because it is common to find that the exchangeability matrix varies among protein sites. Next, we wanted to compare our models with the traditional empirical models (i.e., JTT) in terms of likelihood and distance to a "true" phylogenetic tree. However, we noted that for a same exchangeability matrix the trees differ if one applies a site-specific partition. For example, consider the 3 following runs:
Case A: ./raxml-ng --search --msa msa.fas --model JTT --outgroup 1IU4_A --opt-model off --threads 2
Case B: ./raxml-ng --search --msa msa.fas --model partitionsSite.txt --outgroup 1IU4_A --threads 2
where partitionsSite.txt has the same exchangeability matrix for all the sites,
PROTGTR{jones.dat}, p1 = 1-331
Case C: ./raxml-ng --search --msa msa.fas --model partitionsSite.txt --outgroup 1IU4_A --threads 2
where partitionsSite.txt has the same exchangeability matrix for every site,
PROTGTR{jones.dat}, p1 = 1-1
PROTGTR{jones.dat}, p2 = 2-2
...
PROTGTR{jones.dat}, p331 = 331-331
Running this I find exactly the same tree from cases A and B, but a different tree from case C. Moreover, we found that trees from C are more distant to the true tree than trees from A/B.
I found this for different datasets, with 10 or 300 sequences and variable sequence identity of 0.8 and lower. So I do not think this effect is caused by the input data.
I explored different options in -brlen and found similar results.
Perhaps the program cannot deal with site-specific partitions?

Thanks!
Miguel

Oleksiy Kozlov

unread,
May 19, 2023, 11:21:55 AM5/19/23
to ra...@googlegroups.com
Hi Miguel,

are you using linked branch lengths?

https://github.com/amkozlov/raxml-ng/wiki/Input-data#branch-length-linkage

Otherwise, there is one additional free parameter per site to estimate (brlen scaler), which could
lead to the discrepancy (and potentially overfitting).

Best,
Oleksiy
> https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for Research and
> Technology - Hellas
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.biocomp.gr <http://www.biocomp.gr> (Crete lab)
> www.exelixis-lab.org <http://www.exelixis-lab.org> (Heidelberg lab)
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Miguel Arenas

unread,
May 19, 2023, 1:06:25 PM5/19/23
to raxml
Hi Oleksiy,

I explored the three options of -brlen (linked, scaled and unlinked) and found similar results.

This effect is very simple to test. In the attachment some input files to explore it:
Case A: ./raxml-ng --search --msa seq1.fas --model JTT --outgroup 1IU4_A --threads 2
Case B: ./raxml-ng --search --msa seq1.fas --model OnePartition.txt --outgroup 1IU4_A --threads 2
Case C: ./raxml-ng --search --msa seq1.fas --model partitionsSite.txt --outgroup 1IU4_A --threads 2
Case D: ./raxml-ng --search --msa seq1.fas --model partitionsSite.txt —brlen linked --outgroup 1IU4_A --threads 2
Case E: ./raxml-ng --search --msa seq1.fas --model partitionsSite.txt —brlen unlinked --outgroup 1IU4_A --threads 2
Note that "OnePartition.txt" applies the JTT model in only one partition that includes all the sites while "partitionsSite.txt" applies de JTT model in a partition for every site.

Cases A and B produced exactly the same outputs, which is good.
Cases C, D and E produced similar outputs, but different from those of cases A and B.
Somehow, under the same model, including partitions affects the outputs.

Thank you!
Best,
Miguel
seq1.fas
OnePartition.txt
jones.dat
partitionsSite.txt

Oleksiy Kozlov

unread,
May 19, 2023, 2:57:22 PM5/19/23
to ra...@googlegroups.com
Hi Miguel,

I just tested and cases C and D yield two identical tree topologies with identical likelihoods.

If you indeed observe something different, please post your log and output files.

Best,
Oleksiy
> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com>> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer>>>.
> >
> > --
> > Alexandros (Alexis) Stamatakis
> >
> > ERA Chair, Institute of Computer Science, Foundation for Research and
> > Technology - Hellas
> > Research Group Leader, Heidelberg Institute for Theoretical Studies
> > Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> >
> > www.biocomp.gr <http://www.biocomp.gr> <http://www.biocomp.gr <http://www.biocomp.gr>> (Crete
> lab)
> > www.exelixis-lab.org <http://www.exelixis-lab.org> <http://www.exelixis-lab.org
> <http://www.exelixis-lab.org>> (Heidelberg lab)
> >
> > --
> > You received this message because you are subscribed to the Google Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com>
> >
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com?utm_medium=email&utm_source=footer>.

Miguel Arenas

unread,
May 20, 2023, 1:28:19 AM5/20/23
to raxml
Hi all,
Yes, I indicated that. The problem is that cases A and B (no partition or 1 partition that includes all the sites) produce results different from cases C, D and E (site-specific partitions) considering that all the cases are based on the same substitution model.
In the attachment I include the input and output files for all the cases.
Thank you!
Best,
Miguel
CaseD.zip
CaseE.zip
CaseA.zip
CaseB.zip
CaseC.zip

Oleksiy Kozlov

unread,
May 20, 2023, 6:25:33 AM5/20/23
to ra...@googlegroups.com
Sorry, I meant to say "cases B and D", so 1 partition and partitioned with linked branch length.

Please check your command lines, I guess you forgot "--brlen linked" in Case D. So you get the
following warning in the log file, and it's there for a reason ;)

WARNING: Number of free parameters (K=345) is larger than alignment size (n=331).
This might lead to overfitting and compromise tree inference results!
Please consider revising your partitioning scheme, conducting formal model selection
and/or using linked/scaled branch lengths across partitions.
> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com>> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com>>> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer>> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer> <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/f864a982-c4bf-4774-9a63-edae4f236abbn%40googlegroups.com?utm_medium=email&utm_source=footer>>>>.
> > >
> > > --
> > > Alexandros (Alexis) Stamatakis
> > >
> > > ERA Chair, Institute of Computer Science, Foundation for Research and
> > > Technology - Hellas
> > > Research Group Leader, Heidelberg Institute for Theoretical Studies
> > > Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> > >
> > > www.biocomp.gr <http://www.biocomp.gr> <http://www.biocomp.gr <http://www.biocomp.gr>>
> <http://www.biocomp.gr <http://www.biocomp.gr> <http://www.biocomp.gr <http://www.biocomp.gr>>>
> <http://www.exelixis-lab.org>> <http://www.exelixis-lab.org <http://www.exelixis-lab.org>
> > <http://www.exelixis-lab.org <http://www.exelixis-lab.org>>> (Heidelberg lab)
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "raxml" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to
> > > raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> > > To view this discussion on the web visit
> > >
> >
> https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com>
> >
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com>>
> > >
> >
> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer> <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/169b6759-aea0-400c-bdda-90937b339472n%40googlegroups.com?utm_medium=email&utm_source=footer>>>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com>
> >
> <https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/22679758-44c7-4957-a57a-0d513497f25an%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/eb570260-764e-43af-a909-207225738b72n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/eb570260-764e-43af-a909-207225738b72n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Miguel Arenas

unread,
May 21, 2023, 4:51:17 AM5/21/23
to raxml
Hi,
Thanks a lot, with your help I found the problem. It was that I specified the option "brlen linked" in the input but it was not read by the program, I think because of the "--" before "brlen linked". Now I wrote "-" and it worked, I got the same results for the same model with and without partitions.
Many thanks!
Best,
Miguel
Reply all
Reply to author
Forward
0 new messages