Hi all,I've been looking all over the literature, but I haven't been able to find a complete and lucid discussion of this: is it theoretically valid to use concatenated SNP data in model-based analyses like divergence dating and demographic reconstruction (e.g. BSP)?Imagine a next-gen, RAD-tag/RRL dataset with about 150 individuals and 3k loci with one SNP/locus, from which only the SNPs are concatenated. And one good fossil for calibration.In terms of constructing a phylogeny, SNPs in distance and parsimony methods seem perfectly valid -- but, they don't get you very far in reconstructing biogeographic history. How about SNPs in ML or Bayesian construction methods? e.g., Emerson et al. (2010) concatenated several thousand SNPs and made ML and Bayesian trees. They used the GTR (no I, no G) model, which seems appropriate for a dataset consisting entirely of variable positions. In contrast, Wagner et al. (2013) retained all sequence data -- including invariant sites comprising the vast majority of the sequences -- and concatenated them into supermatrices in excess of 30k-5mil bases; their reasoning being, ostensibly, that the models of molecular evolution were developed for "full" sequence data (not concatenated SNPs) therefore "full" sequence data should be used. From those they build ML trees using the GTR+G model.Are these valid uses of SNP data? And specifically, does use of a model such as GTR in the case of Emerson et al. "account" for the fact that concatenated SNPs aren't the same as actual sequences? Extending from that, if it's valid to use concatenated SNPs for Bayesian tree construction, is it valid also to use them for divergence dating and demographic reconstructions, since those are similarly based on models of molecular evolution?Please overlook the necessity of concatenation here, unless you think it's absolutely relevant. The alternative is a multigene analysis using thousands of loci with VERY little information each -- at very least, that is not really computationally feasible right now.Finally, I am aware that program SNAPP is meant to build Bayesian phylogenies from SNP/AFLP data -- but my goal is to use a SNP dataset for more complex historical biogeographic analyses.
Sorry for the length, but immense thanks to those who slogged through,Angela--Emerson, K. J., C. R. Merz, J. M. Catchen, P. A. Hohenlohe, W. A. Cresko, W. E. Bradshaw, and C. M. Holzapfel. 2010. Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Sciences of the United States of America 107:16196-16200.
Wagner, C. E., I. Keller, S. Wittwer, O. M. Selz, S. Mwaiko, L. Greuter, A. Sivasundar, and O. Seehausen. 2013. Genome-wide RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation. Molecular Ecology 22:787-798.
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
Chris,
Thanks for your thoughts. Your idea of concatenating groups of loci to make new matrices is interesting. (Actually, one of my committee members had a similar thought, that loci showing the same topology [which would be VERY basic with only 1-2 SNPs per fragment] could be concatenated and thought of as genetic blocks which, for one reason or another [physical linkage, statistical linkage, chance, etc.], are giving you the same story of phylogenetic history.) Are you making a case for retaining the invariant sites in RAD data, and not just boiling down to the SNPs?
I feel comfortable enough using SNPs just for tree-building -- if for no other reason than it seems to be passing in the literature right now. But it's the biogeographic tests that are giving me pause.
Do you think we could you use concatenated SNPs in the same way as Sanger sequence data for: divergence dating? Bayesian skyline plots? Mutation rate estimation and coalescent simulations for phylogeographic hypothesis testing, e.g. timing of divergence or topology/directionality? Really, there's nothing stopping me from throwing the concatenated SNPs into any of those analyses in the same way I use sequences . . . well, nothing except the nagging feeling that it's an invalid use of that type of data.
Angela
"My feeling is that, assuming that your alleles are monophyletic within species and all incongruence is due to incongruent topologies among genes, then the greatest affect of using consensus sequences rather than alleles would be on population sizes and branch lengths. Recall that population sizes are estimated based on nucleotide diversity distributed on coalescent patterns, so using ambiguity codes would restrict your ability to estimate this parameter. The coalescent patterns are connected to the divergence times though both gene tree branch lengths and population sizes. So, unless your tree has some very short internal branches, and all you are interested in is the topology, then you might be safe. If alleles are not monophyletic within species, then I would be less comfortable using unphased alleles."
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/5wtZ2bsuhJc/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.