RAxML and missing data

1,292 views
Skip to first unread message

marvela

unread,
Aug 18, 2011, 11:49:41 PM8/18/11
to raxml
I've been using the RAxML Black Box tool on CIPRES to analyze a matrix
of aligned DNA sequences. I've noticed that, in a few positions on my
tree (particularly in a region with low variation and relatively low
support values) I have clades that seem to be forming based on
similarity of sequence length rather than similarity of bases. For
example, I have sequences A, B, C, D, and E. Sequences A, B, and C
are identical to each other in terms of base identity, but C is 50
base pairs shorter than A and B (because 50 bases are missing at the
5' end). Sequences D and E differ from each other by one SNP, and
both differ from A, B, and C, by two additional SNPs, but D and E are
the same length as C (again because 50 bases are missing from each at
the 5' end). So...instead of returning two clades based on base
identity (ABC and DE), RAxML is returning two clades based on sequence
length (AB and CDE). Why is this happening? Is this a common problem
in maximum likelihood analyses?

Peter Unmack

unread,
Aug 19, 2011, 2:24:33 AM8/19/11
to ra...@googlegroups.com
Trim the N's from your dataset and rerun it and that will answer your
question! Personally I don't like Ns in my datasets.

Unless you have a good reason to, I would not use blackbox on a dataset
that I wished to publish as a reviewer may harass you about what settings
you used and why. The regular version on cipres gives you more options to
set and it also quite simple to use. For quick and dirty stuff though
then I'm sure blackbox is fine.

Cheers
Peter

marvela

unread,
Aug 21, 2011, 6:06:06 PM8/21/11
to raxml
Peter, thanks for your reply. If I trim my alignment to the shortest
sequences, I lose a lot of information. I don't want to do this. I
need to be able to run the analysis with a little bit of missing data
for some sequences at the beginning and end of the alignment. Are
there parameters I could set in the full RAxML version that would help
me do this? Thanks again!

Alexis

unread,
Aug 22, 2011, 6:01:35 AM8/22/11
to raxml
Hi Marvela,

There are no settings that will allow you to make RAxML behave
differently in grouping
your taxa. For undetermined data we don't know if there is a A, C, G,
or T at the leaf sequence.

Hence, all probabilities for observing A, C, G, and T are set to 1.0
and hence those parts of A and B appear to be more distant
to C, D, E than if there were sequence data available for those 50 bp
or something. Actually what ML is doing is correct since
it groups those taxa together for which there is more signal and less
uncertainty about the missing part.

For your taxa A and B to be grouped with taxon C you would actually
have to assume that you know the missing data sequence.

Alexis

Heiko

unread,
Aug 22, 2011, 9:54:36 AM8/22/11
to raxml
Hi Alexis,

> Hence, all probabilities for observing A, C, G, and T are set to 1.0

why do you actually set all those probabilities to 1.0? That yields a
total probability of 4.0...
I guess you correct for that anyway, but it sounds a bit counter-
intuitive.

Slightly off-topic but related to this one, how does RAxML handle
other ambiguous characters like W (=A+T), S (=G+C), B (=not A=C+G+T)
etc. in this context?

Best,
Heiko

Alexis

unread,
Aug 22, 2011, 12:40:58 PM8/22/11
to raxml
Hi Heiko,

See Joe Felsenstein's book (page 255) "Likelihood Methods: Handling
Ambiguity and Error" for an answer to your question ;-)

Alexis

marvela

unread,
Aug 22, 2011, 5:28:07 PM8/22/11
to raxml
Hi Alexis,

That answers my question perfectly. Thanks!

Marvela

Heiko

unread,
Aug 23, 2011, 8:26:51 AM8/23/11
to raxml
> > > Hence, all probabilities for observing A, C, G, and T are set to 1.0
>
> > why do you actually set all those probabilities to 1.0? That yields a
> > total probability of 4.0...
>
> See Joe Felsenstein's book (page 255) "Likelihood Methods: Handling
> Ambiguity and Error" for an answer to your question ;-)

Yes, that answers it. I got confused since you named them
probabilities while they are actually likelihoods here as Joe points
out. So this only results in a tiny change in the height of the
likelihood surface, but does not change the position or the relative
order of the maxima.

Thanks & Cheers,
Heiko
Message has been deleted

Alexandros Stamatakis

unread,
Mar 22, 2017, 4:25:06 PM3/22/17
to ra...@googlegroups.com
Dear Silvia,

The first site is causing the problem, since - is modeled as
undetermined state in likelihood implementations, thus - means it could
be A or C or G or T, therefore this is counted as undetermined site.

If you upgrade to the latest RAxML version it should actually tell you
at which sites there are invariable sites in the aligment.

Cheers,

Alexis

On 21.03.2017 18:27, Silvia Justi wrote:
> Hi Alexis,
> So I have data like this
> G-------
> G-------
> G-------
> G-------
> G-------
> G-------
> G--C--RT
> --------
> And since is SNP data, I used asc to reconstruct the tree. However, I
> get an error saying that I have invariable sites. would missing data
> like the ones shown in the alignment above (nt 1) be considered
> Invariant? I would guess not, but that is the closest my data gets to it.
> I am not sure what the problem is.
>
> Can you please help me out?
> Best
> Silvia
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org
Message has been deleted

Alexandros Stamatakis

unread,
May 19, 2017, 9:32:19 AM5/19/17
to ra...@googlegroups.com


On 19.05.2017 11:10, Miguel Camacho wrote:
> My case. 24% missing data over 16 Kbp and 114 taxa:
>
> RAxML: bad supports, weird topology with unexpected placement of many taxa.
> FastTree: great supports, perfect placement of taxa and 1/100 of time to
> converge.
>
> Conclusion: there is an ongoing issue with how RAxML treats missing data.
>
> Not much idea... but I guess it has something to do with missing data
> being evaluated when it should not.

This is definitely not the case. Maybe you should do a significance test
on the likelihoods you got for the FastTree and RAxML trees. Also,
FastTree supports are calculated in a completely different way, than in
RAxML, so obtaining low support values may actually mean that there is a
lack of support in the data and that the SH-like supports FastTree
yields are inflated.

Also, please do not post unsubstantiated opinions here, in particular as
long as you apparently do not understand that the support value
calculations in the two programs are based on entirely different
approaches and mathematical concepts.

Alexis
Reply all
Reply to author
Forward
0 new messages