RAxML on exome sequence data

52 views
Skip to first unread message

Adrián Báez Ortega

unread,
Aug 10, 2016, 8:24:11 AM8/10/16
to raxml
Dear Alexis,

I haven't found any topic in this group about applying RAxML on exomes, exons, pull-down or targeted sequencing data, so I'll ask myself. I hope my questions are not too naive.

I want to use RAxML to analyse some hundreds of whole-exome samples based on a large set of point mutations that I have called from these. (The data don't contain every exon in the genome but most of them.) My first idea was to create the input sequences for RAxML by simply concatenating all the exon sequences (with mutations imposed) for every sample. Regarding this, do you think that omitting those exons that don't contain any mutations would affect the results?

Secondly, I'm considering using only the mutations themselves (i.e. using the concatenation of all the mutation as the input sequence), and applying one of the methods that RAxML offers for ascertainment bias correction. However, I'm not sure of whether this is a good idea, because we do have the information for the invariant sites (but notably, only for the exons, which are quite short – many are <200 bp). My reason for trying this is that we suspect that running RAxML on the whole exomes could take a very long time, especially for the bootstrapping. My intention is to use raxmlHPC-PTHREADS-SSE3 (or raxmlHPC-PTHREADS if this gives problems), with a GTRGAMMAI model.

If you think that it is a good idea to use only the variable sites (because of time constraints or because the concatenation of exons is not fully informative), which correction method would you use? If I am right, --asc-corr=lewis is the conditional likelihood method (which is easier but potentially much less accurate), whereas felsenstein and stamatakis correspond to the reconstituted DNA approach; for these, I don't know if I would need to count the number of invariant sites by considering only the exons or the whole reference genome. I also read your paper about asc. bias correction for ddRADseq data with varying levels of missing data, and I am also not sure of whether this problem applies to our data, because I will be using a set of mutations that show decent sequencing coverage in every sample, so I think no missing data should be expected.

Many thanks for your help! I think your work is extraordinary.

Regards,
Adrian

Alexandros Stamatakis

unread,
Aug 11, 2016, 3:22:44 AM8/11/16
to ra...@googlegroups.com
Hi Adrian,

Please check this paper here, it should answer some of your questions
below:

http://sysbio.oxfordjournals.org/content/64/6/1032

> I haven't found any topic in this group about applying RAxML on exomes,
> exons, pull-down or targeted sequencing data, so I'll ask myself. I hope
> my questions are not too naive.
>
> I want to use RAxML to analyse some hundreds of whole-exome samples
> based on a large set of point mutations that I have called from these.
> (The data don't contain every exon in the genome but most of them.) My
> first idea was to create the input sequences for RAxML by simply
> concatenating all the exon sequences (with mutations imposed) for every
> sample. Regarding this, do you think that omitting those exons that
> don't contain any mutations would affect the results?

IT will affect the branch lengths, more details in the paper I mentioned
above.

> Secondly, I'm considering using only the mutations themselves (i.e.
> using the concatenation of all the mutation as the input sequence), and
> applying one of the methods that RAxML offers for ascertainment bias
> correction. However, I'm not sure of whether this is a good idea,
> because we do have the information for the invariant sites (but notably,
> only for the exons, which are quite short – many are <200 bp). My reason
> for trying this is that we suspect that running RAxML on the whole
> exomes could take a very long time, especially for the bootstrapping. My
> intention is to use raxmlHPC-PTHREADS-SSE3(or raxmlHPC-PTHREADS if this
> gives problems), with a GTRGAMMAI model.

That's not necessarily the case, say you have 10,000 invariant sites
consisting of A, all those 10,000 sites will be compressed into a single
site with a weight of 10,000, so this should not be an issue.

Thus, if you have the invariant sites, use them!

Also, if you still run into performance problems I'd suggest you rather
use ExaML which is more efficient than RAxML on such large datasets.

> If you think that it is a good idea to use only the variable sites
> (because of time constraints or because the concatenation of exons is
> not fully informative), which correction method would you use?

See above, since you have the data, there is no need to estimate it :-)

> If I am
> right, --asc-corr=lewis is the conditional likelihood method (which is
> easier but potentially much less accurate), whereas felsenstein and
> stamatakis correspond to the reconstituted DNA approach; for these, I
> don't know if I would need to count the number of invariant sites by
> considering only the exons or the whole reference genome.

Ideally the whole reference genome to get the true number of invariant
sites.

> I also read
> your paper about asc. bias correction for ddRADseq data with varying
> levels of missing data, and I am also not sure of whether this problem
> applies to our data, because I will be using a set of mutations that
> show decent sequencing coverage in every sample, so I think no missing
> data should be expected.

Well, that's even better then if you don't have to worry about missing data.

So overall, I'd use all the data you have and probably run ExaML
depending on how large those datasets are.

Keep in mind the compression factor of invariable sites, if you have
10,000 SNPs and 1,000,000 invariant sites the likelihood calculations
witll only be done on 10,000 + 4 site patterns.

Alexis

>
> Many thanks for your help! I think your work is extraordinary.
>
> Regards,
> Adrian
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Adrián Báez Ortega

unread,
Aug 11, 2016, 4:00:05 AM8/11/16
to raxml
Hi Alexis,

Many thanks for the prompt reply!

I had already read the paper, it's similar to the exomes' case because you concatenate many small regions. However, what I still don't understand is that you argue that omitting regions with only invariant sites will affect branch length, when exome data is already omitting 98% of the genome. Can we say that this does not affect branch length because we have no idea about how much of this 98% is invariant and how much is variable sites? Can I just use a "full sequence" approach when I am missing most of the real sequence (the non-coding regions)? That sounds strange to me.

If I'm getting the idea of invariant site compression right, then the computing time depends on the number of SNPs, not of invariable sites? That is great.

So, I'll follow your recommendation and try RAxML on the concatenated exons (with a GTRGAMMAI model, if that's appropriate), and in the event this takes too much time, I will give ExaML a try!

Many thanks for your help! :)

Adrian

Alexandros Stamatakis

unread,
Aug 11, 2016, 6:24:17 AM8/11/16
to ra...@googlegroups.com
Hi Adrian,

> Many thanks for the prompt reply!

:-)

> I had already read the paper, it's similar to the exomes' case because
> you concatenate many small regions.

Yes, I saw that further down in your message.

> However, what I still don't
> understand is that you argue that omitting regions with only invariant
> sites will affect branch length, when exome data is already omitting 98%
> of the genome. Can we say that this does not affect branch length
> because we have no idea about how much of this 98% is invariant and how
> much is variable sites?

You could say that. It all depends of course on how you are going to
post-analyze the phylogenies you are going to infer. If you want to
estimate divergence times then it does matter. Of course having the
genome data could also change the phylogeny. If you do have estimates
about the variability in the whole genome, then of course you could use
the correction methods for ascertainment bias from our paper

> Can I just use a "full sequence" approach when I
> am missing most of the real sequence (the non-coding regions)? That
> sounds strange to me.

Well, there must be a good justification in the experimental design as
to why you are only looking at the exons.

>
> If I'm getting the idea of invariant site compression right, then the
> computing time depends on the number of SNPs, not of invariable sites?
> That is great.

Yes that's correct.

> So, I'll follow your recommendation and try RAxML on the concatenated
> exons (with a GTRGAMMAI model, if that's appropriate), and in the event
> this takes too much time, I will give ExaML a try!

Okay, also note that, for SNPs datasets we have frequently observed that
rate heterogeneity models are not required, i.e., a plain GTR model will
do. RAxML might issue some respective warnings (also see previous posts
on here).

Alexis

>
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Adjunct Professor, Dept. of Ecology and Evolutionary Biology,
> University
> of Arizona at Tucson
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>

Adrián Báez Ortega

unread,
Aug 11, 2016, 6:34:00 AM8/11/16
to ra...@googlegroups.com
Hi Alexis,

Thanks! I see, then I'll concatenate all the exons and give RAxML/ExaML a try using the GTRGAMMAI model and a simple GTR model, to check the differences. I will think about asc. bias correction to account for the missing bases in the genome, but it probably won't be necessary. We have some good idea about how the phylogeny should look like (roughly).

Thanks again for your help! This is certainly the best "user support service" I've seen for a bioinformatics tool! :)

All the best,
Adrian


For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/bytKrvzIyZw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alexandros Stamatakis

unread,
Aug 11, 2016, 8:48:21 AM8/11/16
to ra...@googlegroups.com
Hio Adrian,

> Thanks! I see, then I'll concatenate all the exons and give RAxML/ExaML
> a try using the GTRGAMMAI model and a simple GTR model, to check the
> differences.

Okay :-)

> I will think about asc. bias correction to account for the
> missing bases in the genome, but it probably won't be necessary. We have
> some good idea about how the phylogeny should look like (roughly).

okay.

> Thanks again for your help! This is certainly the best "user support
> service" I've seen for a bioinformatics tool! :)

:-)

Alexis


>
> All the best,
> Adrian
>
> On 11 August 2016 at 11:24, Alexandros Stamatakis
> <alexandros...@gmail.com
> <mailto:raxml%2Bun...@googlegroups.com> <javascript:>
> > <mailto:raxml+un...@googlegroups.com
> <mailto:raxml%2Bun...@googlegroups.com> <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical
> Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of
> Technology
> Adjunct Professor, Dept. of Ecology and Evolutionary Biology,
> University
> of Arizona at Tucson
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
> <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from
> it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml%2Bunsu...@googlegroups.com>
> <mailto:raxml+un...@googlegroups.com
> <mailto:raxml%2Bunsu...@googlegroups.com>>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
> of Arizona at Tucson
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "raxml" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/raxml/bytKrvzIyZw/unsubscribe
> <https://groups.google.com/d/topic/raxml/bytKrvzIyZw/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> raxml+un...@googlegroups.com
> <mailto:raxml%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Adrian Baez-Ortega
> linkedin.com/in/abaezortega <http://linkedin.com/in/abaezortega>
> www.divulgatum.com <http://www.divulgatum.com>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages