Missing Data Revisited

619 views
Skip to first unread message

Jared Grummer

unread,
Feb 6, 2017, 3:58:12 PM2/6/17
to raxml
Hello,

I have read the various threads here on the impact of missing data on RAxML (and other phylogenetic) analyses, as well as some interesting papers (like this one). In general, so as long as the missing data are not systematic, it seems that missing data should not affect topology too much (though it can affect model parameter estimates). However, I have a dataset of >100 loci, and when I examine the gene trees, the individuals/alleles that are on the longest branches are often the ones with ~50% missing data in that alignment. I'm trying to figure why this is, maybe long branch attraction?

I'm mostly just hoping to know that if an allele has ~50% missing sequence data for a given locus (typically on the ends of the alignments due to lower sequencing depth), that that's not going to affect the position of that allele in the gene tree.

Many thanks!
Jared

Alexandros Stamatakis

unread,
Feb 6, 2017, 4:05:32 PM2/6/17
to ra...@googlegroups.com
Dear Jared,

> I have read the various threads here on the impact of missing data on
> RAxML (and other phylogenetic) analyses, as well as some interesting
> papers (like this one
> <https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/mbe/30/1/10.1093/molbev/mss208/2/mss208.pdf?Expires=1486753545&Signature=FciCqz4k79ooldmW1JEb0rnuK4Q6S1oYtp58oYADcJaRNyUFdDeTSmOGO9BB~~460~IivkGeLctJdZ~s~MzrdyedygJ6DW8YqBJgn28QQICowvXxq7YsEudEsSWFbFZlTpGBuTlNaPoG~Uw-Udom3iQ56nx96hj0zy6E7LRJg14pehxNCFFtMJ7MuDepGU3qCJunP2ZGWnNmkxRaIgUk-YmeMmw0nZSxGWXSvDIpSFz~M8sBxvyfTlQzN3sukfDI4mWnxzy0vvv5VbCImHOz~nqnE8HPmgtVFMl8tSLXCXmpZwk73buIqU3eefKFBsQQ~KfB5RYgc4S~7ery~9UJIQ__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q>).

> In general, so as long as the missing data are not systematic, it seems
> that missing data should not affect topology too much (though it can
> affect model parameter estimates). However, I have a dataset of >100
> loci, and when I examine the gene trees, the individuals/alleles that
> are on the longest branches are often the ones with ~50% missing data in
> that alignment. I'm trying to figure why this is, maybe long branch
> attraction?

No, it's a property of the models I am afraid, but see here for a kind
of fix:

https://academic.oup.com/bioinformatics/article/32/9/1331/1744346/Prediction-of-missing-sequences-and-branch-lengths

We have observed a similar behavior in some simulated datasets recently
that also delat with very closely related sequences.

If you have pop gen type of data, then also correcting for ascertainment
bias might help:

https://academic.oup.com/sysbio/article/64/6/1032/1669226/Short-Tree-Long-Tree-Right-Tree-Wrong-Tree-New

> I'm mostly just hoping to know that if an allele has ~50% missing
> sequence data for a given locus (typically on the ends of the alignments
> due to lower sequencing depth), that that's not going to affect the
> position of that allele in the gene tree.

I believe the best way to go about it is via simulation:

1. Simulate data on the empirical tree and model you have
2. Superimpose the missing gap pattern that reflects your empirical gap
pattern onto the simulated data
3. Infer the tree and see if you get it right.

I am sure reviewers would love to see such an experiment in a paper (at
least I would :-) ).

Alexis

>
> Many thanks!
> Jared
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Jared Grummer

unread,
Feb 6, 2017, 4:28:57 PM2/6/17
to raxml
Hi Alexis,

Thanks for your fast response. It's interesting because I didn't have this issue until phasing each locus. I guess the solution in the case of your first paper is to use that ForeSeqs program on each locus.

My data are within a species group, so pretty pop. gen.-y. I am aware of that paper, my advisor is the lead author! My data are longer sequences, more Sanger-like (~500-600bp), vs. short RAD loci. But you still think that I should explore the ASC model? And running an overly complex model such as GTR vs. a simpler model (HKY maybe?) shouldn't exacerbate this problem, correct?

I'm inferring a concatenated tree right now to see if those results are in line with other data. But, I have no good estimate of an "empirical tree", so I think it would be hard to do the simulation you suggest, though I get your thought process. That would certainly be valuable!

Thanks again for your help!
Jared

Jared Grummer

unread,
Feb 6, 2017, 6:02:42 PM2/6/17
to raxml
I'd also like to add that it doesn't seem to purely be amount of missing data. That is, some alleles have about the same amount of missing data, but they are not on the long branch with the other alleles of the individual with missing data. Instead, these other alleles are at the base of the phylogeny with alleles that have ~complete data. The problem seems complicated.

Grimm

unread,
Feb 7, 2017, 7:51:44 AM2/7/17
to raxml
Hej,

just a thought for testing the missing data effect, while not being able to simulate (in case you're really worried about the placement of those with a lot of missing data)
Why not just running a series of analysis on the same data excluding all accessions with e.g. more than 10% missing loci, more than 25% missing loci, and more than 50% missing loci?
If there is no/little missing data effect, all the trees should be mutually inclusive.
To compare the topologies you can
a) either prune all trees on the same taxon set and do AU tests, RF distances or else such as highlighting shared and non-shared branches (Phangorn for R has just implemented functions to highlight non-shared branches/edges in trees and networks)
b) overlap the trees using a super network approach (as implemented in SplitsTree, www.splitstree.org), the resulting network should be void of boxes

Another thing could  be to place the missing data accessions using the EPA implemented in RAxML using a tree as backbone based on a fairly loci-wise covered taxon set. If there's missing data effect on the overall tree's topology, EPA may find a different affinity than the comprehensive tree analysis for a gappy accession.

Cheers, Guido

PS Regarding the exclusion: to my experience with pretty odd data, half-sequenced loci don't mess with placement, unless the sequenced half is from a (very) conserved part of the locus and there is no other data to compensate for this "short-branch culling" (I know of plant oligogene data sets including partial 18S rDNAs or rbcL with little backup data completely misplacing the according taxa; in that case the EPA will fail as much as the tree)

 

Jared Grummer

unread,
Feb 7, 2017, 11:40:02 AM2/7/17
to raxml
Hi Grimm,

Thanks for your suggestions. I did a more simple test where I added in data for the sequences that had many ?s, and it drastically altered their placement in the phylogeny. These loci are not super variable (most are from the UCE probes dataset), so not having data at a few variable sites I guess throws the analysis off. And just to be clear, this isn't the case of individuals missing completely from loci to make a "hole-y" matrix, but missing data within a locus.

I'll try the ForeSeqs and see if that helps at all. The other problem is that these are individual nuclear loci at the population level, so I don't expect loci to have the same branching pattern due to gene tree heterogeneity.

Thanks,
Jared

Alexandros Stamatakis

unread,
Feb 7, 2017, 2:03:55 PM2/7/17
to ra...@googlegroups.com
This all sounds complicated, could you maybe send us some sort of
graphical representation of the missing data pattern such that we can
get a better feeling for ehat it looks like?

Foreseqs will essentially just fix the branch lengths, but if the tree
on which you correct branch lengths is wrong, it can't be fixed by
Foreseqs ...

Alexis
> in SplitsTree, www.splitstree.org <http://www.splitstree.org>), the
> resulting network should be void of boxes
>
> Another thing could be to place the missing data accessions using
> the EPA implemented in RAxML using a tree as backbone based on a
> fairly loci-wise covered taxon set. If there's missing data effect
> on the overall tree's topology, EPA may find a different affinity
> than the comprehensive tree analysis for a gappy accession.
>
> Cheers, Guido
>
> PS Regarding the exclusion: to my experience with pretty odd data,
> half-sequenced loci don't mess with placement, unless the sequenced
> half is from a (very) conserved part of the locus and there is no
> other data to compensate for this "short-branch culling" (I know of
> plant oligogene data sets including partial 18S rDNAs or rbcL with
> little backup data completely misplacing the according taxa; in that
> case the EPA will fail as much as the tree)
>
>
>

Alexandros Stamatakis

unread,
Feb 7, 2017, 2:08:25 PM2/7/17
to ra...@googlegroups.com
Hi Jared,

> Thanks for your fast response. It's interesting because I didn't have
> this issue until phasing each locus. I guess the solution in the case of
> your first paper is to use that ForeSeqs program on each locus.

Yes, maybe, hopefully.

> My data are within a species group, so pretty pop. gen.-y. I am aware of
> that paper, my advisor is the lead author!

Oups, and I knew your name somehow sounded familar.

> My data are longer sequences,
> more Sanger-like (~500-600bp), vs. short RAD loci. But you still think
> that I should explore the ASC model?

It depends, if you know how many constant sites you are excluding then
maybe, yes.

> And running an overly complex model
> such as GTR vs. a simpler model (HKY maybe?) shouldn't exacerbate this
> problem, correct?

Probably not, but you should test for this (you know how reviewers are).

It may be important to test if a model of rate heterogeneity is
required, some of the pop gen data performs equally well with
homogeneous models. That's what we saw in the paper with Adam.

> I'm inferring a concatenated tree right now to see if those results are
> in line with other data. But, I have no good estimate of an "empirical
> tree", so I think it would be hard to do the simulation you suggest,
> though I get your thought process. That would certainly be valuable!

You could still assess how the missing data pattern changes the relative
truth.

>
> Thanks again for your help!

:-)

Alexis
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Adjunct Professor, Dept. of Ecology and Evolutionary Biology,
> University
> of Arizona at Tucson
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
Reply all
Reply to author
Forward
0 new messages