codeml with gaps ALSO gene trees or species trees

177 views
Skip to first unread message

swil...@gmail.com

unread,
Aug 8, 2018, 12:31:05 PM8/8/18
to PAML discussion group
Hi all,

A couple of questions for someone new to PAML but not phylogenetics. I'm trying to estimate dN/dS and site specific rates for representative transcripts mined from a de novo RNAseq assembly, some of which either are incomplete transcripts or truly have missing triplets among species.

I apologize for beating a dead horse, but is the consensus that codeml simply will not function with any amount of gaps in the alignment? I know what the manual says, but all of my alignments with any amount of gaps (even missing 1 triplet in 1 sequence) completely fail with cleandata=0.

This may also be a can of worms, but I presume that most folks estimate dN/dS from alignments forced on the species tree. However, considering Matt Hahn's and other works about mutations optimized on incongruent trees, and ignoring for now the logistics of getting an accurate gene tree, should dN/dS most properly be estimated from the gene tree topology and not the species tree topology (assuming these differ)?

Thanks. Stu

cajawe

unread,
Aug 9, 2018, 4:20:50 PM8/9/18
to PAML discussion group
This may also be a can of worms, but I presume that most folks estimate dN/dS from alignments forced on the species tree. However, considering Matt Hahn's and other works about mutations optimized on incongruent trees, and ignoring for now the logistics of getting an accurate gene tree, should dN/dS most properly be estimated from the gene tree topology and not the species tree topology (assuming these differ)?

You should use the "true" tree or, more realistically, your best guess as to what the true tree is.  Maybe this is the species tree, but often, of course, it's not, say because if ILS or because of gene births/deaths, etc.  My experience is that using the gene tree tends to result in more conservative parameter estimates, i.e., shorter branch lengths and lower dN/dS.  I suspect this is a fairly general result for most reasonable tree estimation methods and data sets.  Whether or not it's practical to use gene trees (say, because of data set size, or because you're repeating analyses over thousands of genes across the genome) is another matter.
Reply all
Reply to author
Forward
0 new messages