Re: concatenated SNPs for historical biogeography?

Christopher Blair

unread,

May 24, 2013, 4:24:08 PM5/24/13

to beast-users

Hi Angela,

Your questions are andeed valid and I am happy to give my two cents. If one is interested in building trees using standard methods (i.e. haplotypes), I think that there are a variety of ways to proceed. Because RAD read lengths are still relatively short (~200-250 bp), I think that some sort of concatenation will be required. You must also decide how you wish to incorporate information from heterozygous alleles (using a consensus sequence for example). One approach is to concatenate all RAD loci for each individual to create a supermatrix and then run standard concatenated phylogenetic analyses. A second approach, which I have not seen much discussion on, would be to create sets of concatenated loci with enough variation to obtain some phylogenetic resolution. You could then define each of these sets as a 'locus' to perform coalescent-based species tree inference. I personally do not see a difference between concatenating RAD loci for phylogenetics versus concatenating loci obtained from Sanger sequencing. What kind of biogeographic analyses are you interested in? Many analyses simply require a tree or set of trees along with locality info for your terminals. I'd have a look at the Rubin et al. (2012) PloS One paper if you have not. Julian Catchen also just published a paper in Mol. Ecol. discussing new methods implemented in STACKS.

Chris

On Wed, May 22, 2013 at 7:09 PM, Angela <adho...@gmail.com> wrote:

Hi all,

I've been looking all over the literature, but I haven't been able to find a complete and lucid discussion of this: is it theoretically valid to use concatenated SNP data in model-based analyses like divergence dating and demographic reconstruction (e.g. BSP)?

Imagine a next-gen, RAD-tag/RRL dataset with about 150 individuals and 3k loci with one SNP/locus, from which only the SNPs are concatenated. And one good fossil for calibration.

In terms of constructing a phylogeny, SNPs in distance and parsimony methods seem perfectly valid -- but, they don't get you very far in reconstructing biogeographic history. How about SNPs in ML or Bayesian construction methods? e.g., Emerson et al. (2010) concatenated several thousand SNPs and made ML and Bayesian trees. They used the GTR (no I, no G) model, which seems appropriate for a dataset consisting entirely of variable positions. In contrast, Wagner et al. (2013) retained all sequence data -- including invariant sites comprising the vast majority of the sequences -- and concatenated them into supermatrices in excess of 30k-5mil bases; their reasoning being, ostensibly, that the models of molecular evolution were developed for "full" sequence data (not concatenated SNPs) therefore "full" sequence data should be used. From those they build ML trees using the GTR+G model.

Are these valid uses of SNP data? And specifically, does use of a model such as GTR in the case of Emerson et al. "account" for the fact that concatenated SNPs aren't the same as actual sequences? Extending from that, if it's valid to use concatenated SNPs for Bayesian tree construction, is it valid also to use them for divergence dating and demographic reconstructions, since those are similarly based on models of molecular evolution?

Please overlook the necessity of concatenation here, unless you think it's absolutely relevant. The alternative is a multigene analysis using thousands of loci with VERY little information each -- at very least, that is not really computationally feasible right now.

Finally, I am aware that program SNAPP is meant to build Bayesian phylogenies from SNP/AFLP data -- but my goal is to use a SNP dataset for more complex historical biogeographic analyses.

Sorry for the length, but immense thanks to those who slogged through,

Angela

Emerson, K. J., C. R. Merz, J. M. Catchen, P. A. Hohenlohe, W. A. Cresko, W. E. Bradshaw, and C. M. Holzapfel. 2010. Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Sciences of the United States of America 107:16196-16200.

Wagner, C. E., I. Keller, S. Wittwer, O. M. Selz, S. Mwaiko, L. Greuter, A. Sivasundar, and O. Seehausen. 2013. Genome-wide RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation. Molecular Ecology 22:787-798.

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--
Christopher Blair, Ph.D.
Postdoctoral Associate
Department of Biology
Duke University, Box 90338
BioSci 130 Science Drive
Durham, NC 27708
ph: 919-613-8727
Christopher.Blair@duke.edu

Eduardo Castro Nallar

unread,

May 24, 2013, 4:32:45 PM5/24/13

to beast...@googlegroups.com

Hi Angela,

That's a really interesting question, and I've been thinking about it lately but no good answer yet other than branch length estimates are probably going to be biased and presumably any dating derived from it.

In bacterial comparative genomics this is an issue as well and here I attached a paper where they use Beast to date nodes using a SNP dataset.

Looking forward to other comments,

Eduardo

http://www.nature.com/ng/journal/v44/n9/full/ng.2369.html

Brian Muchmore

unread,

May 28, 2013, 1:05:04 AM5/28/13

to beast...@googlegroups.com

Great questions, and I wish I could provide good answers for you, but instead I will add to the questions:

I have read before, and it is mentioned here, that if you don't take into account invariant sites then branch length estimates will be biased, but I don't understand why. Can anybody provide a good paper or reason?

Angela, how do you intend on on creating a multiple sequence alignment with SNP data? I am leaning toward doing something like the following: Assuming this is a section of variable sequence in my population actg(C/T)actg then this is how my MSA would look:

person 1: actgCTactg

person 2: actgCCactg

person 3: actgTTactg

But, I don't know if that is valid, and more importantly, I don't know how to argue that this is a valid way of dealing with SNP data.

Angela

unread,

May 28, 2013, 4:39:40 PM5/28/13

to beast...@googlegroups.com

Chris,

Thanks for your thoughts. Your idea of concatenating groups of loci to make new matrices is interesting. (Actually, one of my committee members had a similar thought, that loci showing the same topology [which would be VERY basic with only 1-2 SNPs per fragment] could be concatenated and thought of as genetic blocks which, for one reason or another [physical linkage, statistical linkage, chance, etc.], are giving you the same story of phylogenetic history.) Are you making a case for retaining the invariant sites in RAD data, and not just boiling down to the SNPs?

I feel comfortable enough using SNPs just for tree-building -- if for no other reason than it seems to be passing in the literature right now. But it's the biogeographic tests that are giving me pause.

Do you think we could you use concatenated SNPs in the same way as Sanger sequence data for: divergence dating? Bayesian skyline plots? Mutation rate estimation and coalescent simulations for phylogeographic hypothesis testing, e.g. timing of divergence or topology/directionality? Really, there's nothing stopping me from throwing the concatenated SNPs into any of those analyses in the same way I use sequences . . . well, nothing except the nagging feeling that it's an invalid use of that type of data.

Angela

unread,

May 28, 2013, 4:43:51 PM5/28/13

to beast...@googlegroups.com

Fantastic, I hadn't seen SNPs used like this yet. Gives me more to chew on. Thanks for linking it, Eduardo!

Angela

Brian Muchmore

unread,

May 28, 2013, 9:28:42 PM5/28/13

to beast...@googlegroups.com

After thinking about it some, I think I have answered my own questions. But feedback is welcome.

Variable sites - in this case SNPs - are the meat of any phylogenetic analysis. However, variable sites need context, which is why people include invariant sites. The mutational rate without any other influences (e.g. selection, recombination etc) is just a probability, thus it is important for BEAST to know the probability that a base won't change also. I wonder if the BEAST creators could add a context box in Beauti. What I mean is that a person could use a SNP data set and then state the length of the sequence the data is derived from e.g. I upload a 50 bp concatenated SNP data set and then tell Beauti this is coming out of a 20,000 bp region. Would that work?

Secondly, there seem to be three options for dealing with heterozygous data from diploid organisms. Either you can create consensus sequences or two separate sequences for each individual or one sequence where every base is represented twice. If one decides to create a consensus sequence, then you can use IUPAC ambiguity codes (http://droog.gs.washington.edu/parc/images/iupac.html) to model SNP positions. For example, the hypothetical sequence actg(A/C)actg can become actgMactg or actgAactg + actgCactg or aaccttggACaaccttgg. Here are some further thoughts on this from another thread:

"My feeling is that, assuming that your alleles are monophyletic within species and all incongruence is due to incongruent topologies among genes, then the greatest affect of using consensus sequences rather than alleles would be on population sizes and branch lengths. Recall that population sizes are estimated based on nucleotide diversity distributed on coalescent patterns, so using ambiguity codes would restrict your ability to estimate this parameter. The coalescent patterns are connected to the divergence times though both gene tree branch lengths and population sizes. So, unless your tree has some very short internal branches, and all you are interested in is the topology, then you might be safe. If alleles are not monophyletic within species, then I would be less comfortable using unphased alleles."

You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/5wtZ2bsuhJc/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.

Reply all

Reply to author

Forward