Question about the ascertainment bias model in RAxML

Pedro

unread,

Jul 4, 2014, 5:13:15 PM7/4/14

to ra...@googlegroups.com

Hi all,

I have data from a whole-genome resequencing project for which I would like to proceed with phylogenetic analysis with RAxML. I searched this group for similar questions but it seems that other posts are related to GBS an not on whole-genome resequencing (e.g. https://groups.google.com/forum/#!topic/raxml/DUHB_6MqIXE) which have thus motivated me to post this as a new question.

The data I have is from the whole-genome and not only for SNPs but it would be impossible (I guess) to perform a phylogeny on the whole data and as so only the polymorphic sites (SNPs) that can be confidently call on all individuals are being used for this task. The new versions of RAxML have an ascertainment bias correction model that seems properly designed for this kind of situation, however I would like clarify some doubts I have about the application of this new model to my data.

1) my understanding of ascertainment bias (which might not be correct... ) is that only positions that are present in a reference panel are being used for genotype/SNP calling in the sample under study, meaning that that is an intrinsic bias away from low frequency variants. Does the model correction of ascertainment bias in RAxML is also based on these assumptions? or else in which way is it different from this?

2) Does the ascertainment bias correction only have effect on branch lengths or it also affects the topology?

The bottom line for these questions is actually to know if it is in fact preferable to use the ascertainment bias correction on my data for running RAxML, since I should not expect to have any bias (or at least this bias should be minimum) in the frequency of the variants, and/or if not what would be the best approach to do phylogenetic inference on these situations.

Thanks in advance,

Pedro

Alexandros Stamatakis

unread,

Jul 5, 2014, 2:23:24 AM7/5/14

to ra...@googlegroups.com

Hi Pedro,

> I have data from a whole-genome resequencing project for which I would like
> to proceed with phylogenetic analysis with RAxML. I searched this group for
> similar questions but it seems that other posts are related to GBS an not
> on whole-genome resequencing
> (e.g. https://groups.google.com/forum/#!topic/raxml/DUHB_6MqIXE) which have
> thus motivated me to post this as a new question.
>
> The data I have is from the whole-genome and not only for SNPs but it would
> be impossible (I guess) to perform a phylogeny on the whole data

This is not impossible, for this purpose we have developed ExaML and if
you prefer Bayesian analyses ExaBayes, see:
http://sco.h-its.org/exelixis/software.html

> and as so
> only the polymorphic sites (SNPs) that can be confidently call on all
> individuals are being used for this task. The new versions of RAxML have an
> ascertainment bias correction model that seems properly designed for this
> kind of situation, however I would like clarify some doubts I have about
> the application of this new model to my data.
>
> 1) my understanding of ascertainment bias (which might not be correct... )
> is that only positions that are present in a reference panel are being used
> for genotype/SNP calling in the sample under study, meaning that that is an
> intrinsic bias away from low frequency variants. Does the model correction
> of ascertainment bias in RAxML is also based on these assumptions? or else
> in which way is it different from this?
> 2) Does the ascertainment bias correction only have effect on branch lengths or
> it also affects the topology?

regarding 2: it mostly affects branch lengths according to our tests,
but it may as well affect topologies as well.

regarding 1: I think you are confounding data sampling with asc. bias
correction here. The asc. bias simply corrects for the fact that in one
way or another only variable sites have been sampled, while invariable
ones are not contained in the sample, despite the fact that they exist.

> The bottom line for these questions is actually to know if it is in fact
> preferable to use the ascertainment bias correction on my data for running
> RAxML, since I should not expect to have any bias (or at least this bias
> should be minimum) in the frequency of the variants, and/or if not what
> would be the best approach to do phylogenetic inference on these situations.

As far as I understood you have a couple of whole-genome sequences with
mostly invariable sites? If this is the case you don't need an asc. bias
correction since the variable sites are there.

Computationally, having a large number of invariable sites is not
expensive, since sites consisting of the same DNA character can be
compressed into one single site pattern and be assigned a higher weight.

Thus, the steps you should follow are:

1. determine how many distinct site patterns you hav ein your alignment
2. use the memory req. calculator on
http://sco.h-its.org/exelixis/web/software/raxml/index.html to calculate
mem. reqs for your alignment based on the # of distinct site patterns
3. Based on this result decide if you need to use ExaML/ExaBayes or if
you can get away using RAxML

IN general I wouldn't use asc. bias. corr. if I had the invariable data,
using the correction only makes sense if the invariable data has not
been sampled/sequenced

Alexis

>
> Thanks in advance,
> Pedro
>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Pedro

unread,

Jul 5, 2014, 8:10:08 PM7/5/14

to ra...@googlegroups.com

Hi Alexis,

thank you for the prompt reply.

This is not impossible, for this purpose we have developed ExaML and if
you prefer Bayesian analyses ExaBayes, see:
http://sco.h-its.org/exelixis/software.html

This is great, I will try ExaML for now.

regarding 1: I think you are confounding data sampling with asc. bias
correction here. The asc. bias simply corrects for the fact that in one
way or another only variable sites have been sampled, while invariable
ones are not contained in the sample, despite the fact that they exist.

Thank you for the clarification. I think that I get the idea now :) This was helpful.

As far as I understood you have a couple of whole-genome sequences with
mostly invariable sites? If this is the case you don't need an asc. bias
correction since the variable sites are there.
Computationally, having a large number of invariable sites is not
expensive, since sites consisting of the same DNA character can be
compressed into one single site pattern and be assigned a higher weight.
Thus, the steps you should follow are:
1. determine how many distinct site patterns you hav ein your alignment
2. use the memory req. calculator on
http://sco.h-its.org/exelixis/web/software/raxml/index.html to calculate
mem. reqs for your alignment based on the # of distinct site patterns
3. Based on this result decide if you need to use ExaML/ExaBayes or if
you can get away using RAxML
IN general I wouldn't use asc. bias. corr. if I had the invariable data,
using the correction only makes sense if the invariable data has not
been sampled/sequenced

Thanks also for all the suggestions which I will try to follow. Including the invariable sites, the final alignment is around 7 Mb with 17299 alignment patterns. By the memory req. calculator this would require 171MB, however this number is very abstract for me... Is it ok for a RAxML run or would it be better to run ExaML? The machine I am working on has 2 processors, 8 cores each, with 96 GB RAM.

In case of need of ExaML I was thinking in using the best-known ML tree of the RAxML tree search from the SNP data as the starting tree for ExaML. For bootstrap I would follow the suggestions on RAxML-Light manual. Is this ok?

All the best,

Pedro

Alexandros Stamatakis

unread,

Jul 6, 2014, 4:57:28 AM7/6/14

to ra...@googlegroups.com

Dear Pedro,

> As far as I understood you have a couple of whole-genome sequences with
>> mostly invariable sites? If this is the case you don't need an asc. bias
>> correction since the variable sites are there.
>> Computationally, having a large number of invariable sites is not
>> expensive, since sites consisting of the same DNA character can be
>> compressed into one single site pattern and be assigned a higher weight.
>> Thus, the steps you should follow are:
>> 1. determine how many distinct site patterns you hav ein your alignment
>> 2. use the memory req. calculator on
>> http://sco.h-its.org/exelixis/web/software/raxml/index.html to calculate
>> mem. reqs for your alignment based on the # of distinct site patterns
>> 3. Based on this result decide if you need to use ExaML/ExaBayes or if
>> you can get away using RAxML
>> IN general I wouldn't use asc. bias. corr. if I had the invariable data,
>> using the correction only makes sense if the invariable data has not
>> been sampled/sequenced
>
>
> Thanks also for all the suggestions which I will try to follow. Including
> the invariable sites, the final alignment is around 7 Mb with 17299
> alignment patterns. By the memory req. calculator this would require 171MB,
> however this number is very abstract for me... Is it ok for a RAxML run or
> would it be better to run ExaML? The machine I am working on has 2
> processors, 8 cores each, with 96 GB RAM.

You can easily analyze this with RAxML on that machine.

> In case of need of ExaML I was thinking in using the best-known ML tree of
> the RAxML tree search from the SNP data as the starting tree for ExaML. For
> bootstrap I would follow the suggestions on RAxML-Light manual. Is this ok?

No, this is not necessary, since ExaML uses the same tree search
algorithm as RAxML, thi swill not improve anything, I'd just stick to
using RAxML in your case.

Alexis

Pedro

unread,

Jul 7, 2014, 5:06:20 PM7/7/14

to ra...@googlegroups.com

Hi Alexis,

thank you so much for all the help and suggestions. Indeed raxml ran quite nicely on this data.