MULTIx_MK substitution matrix = Mk? or Mkv? or Mk-pars?

marybel soto gomez

unread,

Jan 22, 2020, 12:21:54 AM1/22/20

to raxml

Hi there,

After searching in the google group, I don't think this question has been asked yet: Is the MULTIx_MK substitution matrix implemented in RAxML-NG the same as the uncorrected Mk method of Lewis (2001; Systematic Biology 50:913-925) or is it one of of the corrected versions of the model, such as Mkv or Mk-pars?

For context, I used MULTI3_MK to run tree searches on a morphological matrix that contains up to three character states (0,1,2).

I would really appreciate any feedback on this -- I'd like to make sure I ran my data properly and that I understand the model I used. Thank you!

Marybel

Grimm

unread,

Jan 22, 2020, 12:18:32 PM1/22/20

to raxml

Hi Marybel,

standard multistate models with a maximum of eight states are already implemented (MK and GTR)

Here's a list: https://isu-molphyl.github.io/EEOB563/computer_labs/lab4/models.html

Cheers, Guido

Alexey Kozlov

unread,

Jan 27, 2020, 12:27:45 PM1/27/20

to ra...@googlegroups.com

Hi Marybel,

MULTIx_MK is the standard uncorrected version, if you want to correct for missing invariable sites,
please use ASC_x option as described in the documentation that Guido kindly linked above :)

Best,
Alexey

> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/e702a202-e6b2-4814-a365-5907bf3609b0%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/e702a202-e6b2-4814-a365-5907bf3609b0%40googlegroups.com?utm_medium=email&utm_source=footer>.

marybel soto gomez

unread,

Jan 28, 2020, 2:52:23 PM1/28/20

to ra...@googlegroups.com

Hi Alexey and Guido,

Thank you very much. I now see the distinction between uncorrected vs. corrected models clearly in the documentation.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/cf21bd6b-75f8-1dd7-dda0-1112439e7013%40gmail.com.

--

Marybel Soto Gomez

Ph.D. Student, Graham Lab

Botany Department

University of British Columbia

Grimm

unread,

Jan 29, 2020, 3:54:10 AM1/29/20

to raxml

Morning Marybel,

just a little add-on and guide. Since you have a matrix with ternary characters and are sitting at the botany department, I guess the matrix codes for morphological traits?

What holds for Bayesian analysis (e.g. Wright and Hillis 2014; see also this preprint by Klopfstein et al. regarding TE-dating), probably holds also for ML. For most non-molecular binary or multistate matrices the implicit character filtering in the morphological partition (when we assemble it, we usually leave away everything we consider irrelevant although we shouldn't) we try to counter by Mkv or Mk-pars models matters little. Mk and Mkv (=ASC_MK in RAxML) typically will produce a very similar tree, however, there may be some difference in the bootstrap support values. The effect will be less visible in PP because Bayes eliminates internal data conflict more effectively (which is a bad thing, but good when the objective is to get just a tree). However, since this differs from data set to data set and any model we chose will be wrong to an undetermined degree for a non-molecular data set, one should always just run both: ascertainment-bias uncorrected and corrected, and then compare the outcome. If it's pretty much the same, all's fine. If not, one should dig deeper into the signal.

I always run all four options for non-binary multistate taxa, MK (typically the primary basis), ASC-MK, GTR and ASC-GTR to check for data/inference issues. When there are differences, one first should check the optimised substitution matrix. MK is per se more "natural" and less biased because all mutation (0<>1, 0<>2, 1<>2) have the same probability to start with, GTR will be affected by the frequency of binary vs. ternary characters (in your case): if there are a lot of binary and few ternary characters, the optimised model tends to penalise the <>2 mutations. Again, this effect differs from data to data set, because in addition to frequency it also depends how well phylogenetically sorted the binary and ternary characters are. For instance, if all binary character are well sorted and the ternary are poorly sorted p(<>2) << p(0<>1). One simple option to rid the data from any binary vs ternary bias is to recode the ternaries as binaries: unordered 0 = 100, 1 = 010, 2 = 001; ordered 0 = 00, 1 = 01, 2 = 11. (PS In case reviewers complain that this will lead to overweighting, it doesn't matter really because we operate in a probabilistic framework but..., you can use RAxML's weighting option.)

In case there are substantial difference between the different models making a call can be difficult, ideally one just traces the character evolution across the various possible topologies (MK vs GTR, uncorrected vs ASC) to see what makes most sense. This is of course much easier for total evidence matrices, where the molecular data predefines the tree's topology.

Mk-pars is not implemented in RAxML but you can sort of simulate it by eliminating all parsimony-uninformative sites from your multistate partition and run ASC on the reduced matrix.

-----

A further note in case your objective is to place a fossil (or fossils) in a modern-day phylogenetic framework (otherwise ignore the following).

Inferring total evidence trees is the standard but not a good choice, in particular not for botanists. By far the most morphological matrices don't provide any tree-like signal (two early of the couple of examples we covered at the Genealogical World of Phylogenetic Networks: spermatophytes (see also this (paywalled) paper by Coiro et al. 2018/ free pre-print on bioRxiv) and dinosaurs. Each character in a morpho-partition has a different quality (mutation probability, phylogenetic information content, selection-pressure), which is the reason Bayesian inference/PP is a worse choice than ML/bootstrapping and parsimony the worst choice for real-world data. Fews characters will be compatible with the true tree, others may reflect aspects of it, and most be in conflict to various degrees. Under parsimony (or distance-based trees) branching artefacts are thus very common, ML and Bayes are more robust; when the parsimony and Mk trees disagree, in nearly all cases the parsimony tree is just wrong. Bayesian analysis will however easily tilt towards one alternative when in fact the data itself would allow for several. Whenever a study finds differences ("conflict") between the TE-Bayes MRC and TE-ML tree, it is because TE-Bayes picks one alternative and the TE-ML another but with conspicuously low support. The reason for the latter is either lack of signal – this also leads to PP << 1.0 – or split support, which will not be visible in the PP. Hence, ML-BS values will give you a better idea about the conflicts since they subsample the morphological partition but you need to look at the splits frequencies in the BS sample (e.g. using support consensus networks) not only at the tree itself. A classic sign of warning are branches with BS << 100 and PP ~ 1.0, which typically translates into there is internal signal conflict (this also applies to concatenated molecular data, e.g. combined nuclear and plastid data; see e.g. this example (long resolved Fagales phylogeny))

We published a couple of papers where we placed plant fossils (based on more or less scorable traits) that take into the peculiar nature of morphological data partitions, the trick is to do a bit of exploratory data analysis (EDA), and provide alternative approaches to the standard TE-tree.

Probabilistic trait mapping works well to assess the position of a single fossil within a molecular backbone topology even when one has only few characters at hand, see our recent paper on Winteraceae: Grímsson et al. (2018), J. Biogeogr. 45: 567–581. The R script/ documentation has been published via rpubs.

For a "full broadside" analysis (EDA, applying RAxML's evolutionary placement algorithm) see this paper: Bomfleur et al. (2015), BMC Evol. Biol. 15: 126.

Note that most TE papers and theoretical papers are made by zoologists, who in many cases have morphological partitions that include much more characters than we botanists can score and, typically, much less complex signals (I guess, it's because animals court and can run away; plants don't). So anything one can read about performance of TE etc. applies to those data sets, but cannot be generalised. Simulations are nice and can test principal things (e.g. why probability methods outperform parsimony also for binary/multistate matrices) but they have little relevance for real-world data either: we cannot possibly simulate a morphological partition that comes close to the data we have in our hands. Most importantly, all models assume that evolution is a neutral process, which fits for genotypes, but not for phenotypes.

Cheers, Guido

Reply all

Reply to author

Forward