Hi Adrian,
Please check this paper here, it should answer some of your questions
below:
http://sysbio.oxfordjournals.org/content/64/6/1032
> I haven't found any topic in this group about applying RAxML on exomes,
> exons, pull-down or targeted sequencing data, so I'll ask myself. I hope
> my questions are not too naive.
>
> I want to use RAxML to analyse some hundreds of whole-exome samples
> based on a large set of point mutations that I have called from these.
> (The data don't contain every exon in the genome but most of them.) My
> first idea was to create the input sequences for RAxML by simply
> concatenating all the exon sequences (with mutations imposed) for every
> sample. Regarding this, do you think that omitting those exons that
> don't contain any mutations would affect the results?
IT will affect the branch lengths, more details in the paper I mentioned
above.
> Secondly, I'm considering using only the mutations themselves (i.e.
> using the concatenation of all the mutation as the input sequence), and
> applying one of the methods that RAxML offers for ascertainment bias
> correction. However, I'm not sure of whether this is a good idea,
> because we do have the information for the invariant sites (but notably,
> only for the exons, which are quite short – many are <200 bp). My reason
> for trying this is that we suspect that running RAxML on the whole
> exomes could take a very long time, especially for the bootstrapping. My
> intention is to use raxmlHPC-PTHREADS-SSE3(or raxmlHPC-PTHREADS if this
> gives problems), with a GTRGAMMAI model.
That's not necessarily the case, say you have 10,000 invariant sites
consisting of A, all those 10,000 sites will be compressed into a single
site with a weight of 10,000, so this should not be an issue.
Thus, if you have the invariant sites, use them!
Also, if you still run into performance problems I'd suggest you rather
use ExaML which is more efficient than RAxML on such large datasets.
> If you think that it is a good idea to use only the variable sites
> (because of time constraints or because the concatenation of exons is
> not fully informative), which correction method would you use?
See above, since you have the data, there is no need to estimate it :-)
> If I am
> right, --asc-corr=lewis is the conditional likelihood method (which is
> easier but potentially much less accurate), whereas felsenstein and
> stamatakis correspond to the reconstituted DNA approach; for these, I
> don't know if I would need to count the number of invariant sites by
> considering only the exons or the whole reference genome.
Ideally the whole reference genome to get the true number of invariant
sites.
> I also read
> your paper about asc. bias correction for ddRADseq data with varying
> levels of missing data, and I am also not sure of whether this problem
> applies to our data, because I will be using a set of mutations that
> show decent sequencing coverage in every sample, so I think no missing
> data should be expected.
Well, that's even better then if you don't have to worry about missing data.
So overall, I'd use all the data you have and probably run ExaML
depending on how large those datasets are.
Keep in mind the compression factor of invariable sites, if you have
10,000 SNPs and 1,000,000 invariant sites the likelihood calculations
witll only be done on 10,000 + 4 site patterns.
Alexis
>
> Many thanks for your help! I think your work is extraordinary.
>
> Regards,
> Adrian
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
raxml+un...@googlegroups.com
> <mailto:
raxml+un...@googlegroups.com>.
> For more options, visit
https://groups.google.com/d/optout.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson
www.exelixis-lab.org