Hi Pedro,
> I have data from a whole-genome resequencing project for which I would like
> to proceed with phylogenetic analysis with RAxML. I searched this group for
> similar questions but it seems that other posts are related to GBS an not
> on whole-genome resequencing
> (e.g.
https://groups.google.com/forum/#!topic/raxml/DUHB_6MqIXE) which have
> thus motivated me to post this as a new question.
>
> The data I have is from the whole-genome and not only for SNPs but it would
> be impossible (I guess) to perform a phylogeny on the whole data
This is not impossible, for this purpose we have developed ExaML and if
you prefer Bayesian analyses ExaBayes, see:
http://sco.h-its.org/exelixis/software.html
> and as so
> only the polymorphic sites (SNPs) that can be confidently call on all
> individuals are being used for this task. The new versions of RAxML have an
> ascertainment bias correction model that seems properly designed for this
> kind of situation, however I would like clarify some doubts I have about
> the application of this new model to my data.
>
> 1) my understanding of ascertainment bias (which might not be correct... )
> is that only positions that are present in a reference panel are being used
> for genotype/SNP calling in the sample under study, meaning that that is an
> intrinsic bias away from low frequency variants. Does the model correction
> of ascertainment bias in RAxML is also based on these assumptions? or else
> in which way is it different from this?
> 2) Does the ascertainment bias correction only have effect on branch lengths or
> it also affects the topology?
regarding 2: it mostly affects branch lengths according to our tests,
but it may as well affect topologies as well.
regarding 1: I think you are confounding data sampling with asc. bias
correction here. The asc. bias simply corrects for the fact that in one
way or another only variable sites have been sampled, while invariable
ones are not contained in the sample, despite the fact that they exist.
> The bottom line for these questions is actually to know if it is in fact
> preferable to use the ascertainment bias correction on my data for running
> RAxML, since I should not expect to have any bias (or at least this bias
> should be minimum) in the frequency of the variants, and/or if not what
> would be the best approach to do phylogenetic inference on these situations.
As far as I understood you have a couple of whole-genome sequences with
mostly invariable sites? If this is the case you don't need an asc. bias
correction since the variable sites are there.
Computationally, having a large number of invariable sites is not
expensive, since sites consisting of the same DNA character can be
compressed into one single site pattern and be assigned a higher weight.
Thus, the steps you should follow are:
1. determine how many distinct site patterns you hav ein your alignment
2. use the memory req. calculator on
http://sco.h-its.org/exelixis/web/software/raxml/index.html to calculate
mem. reqs for your alignment based on the # of distinct site patterns
3. Based on this result decide if you need to use ExaML/ExaBayes or if
you can get away using RAxML
IN general I wouldn't use asc. bias. corr. if I had the invariable data,
using the correction only makes sense if the invariable data has not
been sampled/sequenced
Alexis
>
> Thanks in advance,
> Pedro
>
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson
www.exelixis-lab.org