codeml analysis for protein coding genes

433 views
Skip to first unread message

Bikash Shrestha

unread,
Aug 29, 2018, 1:59:49 PM8/29/18
to PAML discussion group
Hi,

I am trying to calculate dN/dS for protein coding genes for different species within a genus. For each gene, I have sequence alignment file and a tree file to run codeml. As, I am trying to calculate dN/dS for each species for a particular gene, I am using free-ratio model,  "model =1", but I am not sure if it's appropriate to use free-ratio model as PAML manual suggest it is very parameter rich model. Since, I am not looking for site under selection, I guess  the site models won't be appropriate. Any suggestions?

When I ran "model =1", I got dN and dS values for each branch along with dS tree, dN tree, and omega value for each species in mlc output file (see attachment). Are those values next to the species name in the dN and dS tree can be considered as dN and dS for each species for a particular sequence alignment?

Any suggestions will be greatly appreciated.

Thanks!
mlc

cajawe

unread,
Aug 30, 2018, 5:30:58 PM8/30/18
to PAML discussion group
When I ran "model =1", I got dN and dS values for each branch along with dS tree, dN tree, and omega value for each species in mlc output file (see attachment). Are those values next to the species name in the dN and dS tree can be considered as dN and dS for each species for a particular sequence alignment?

Unless you're comfortable with several strong assumptions, those values probably shouldn't "be considered as dN and dS for each species".  

They are values for the entire terminal branches, starting from the most recent nodes and ending at the tips (i.e., ending with your focal species).  If you've sampled all possible species in the genus, then the tip-branch values can be thought of as estimating dN/dS for the entire history of each species.  But if you've missed some species, or if some have gone extinct and can't be sampled at all, then some of the tip-branch values will also cover time points that pre-date the origins of the sampled species.

Even making these assumptions, you might want to be cautious in comparing values across species.  Values for younger species could be unreliable if the terminal branch lengths are too short.  And for long terminal branches, you might reasonably worry that selection has varied through time..

Bikash Shrestha

unread,
Sep 2, 2018, 12:38:27 PM9/2/18
to PAML discussion group
Thank you for you comment Cameron. My taxon sampling is not extensive at all, so I agree that I should be careful with interpreting the output as you have suggested. The omega values for short brancsh are high as 999.00, which I am not sure how to interpret. 

What analysis do you think would be more appropriate one if I am looking dN/dS for each species? 


Ziheng

unread,
Oct 11, 2018, 12:38:15 PM10/11/18
to PAML discussion group
it is probably more meaningful to talk about dN/dS for a branch on the species phylogeny than dN/dS for a species.  Even so, the estimates may have very large sampling errors if the branch is short.  I think if you have a few thousand genes, and concatenate them into one supergene, you will have many sites (codons), and then the free-ratios model will be able to estimate dN/dS for each branch with high precision, and the estimates can be interpreted as genome-wide averages.  Such estimates make some intuitive sense but may not be very useful.  
999 is the upper limit set by the program, and means infinity.  If you apply the free-ratios model to each gene, with only 200 or 500 codons, and if no synonymous substitutions have occurred on a short branch, then the program will give you an estimate of infinity.  Similarly if no nonsynonymous substitutions have occurred on a short branch, you will get 0 (0.0004, the lower bound set by the program), which is also an extreme estimate.
The situation is like flipping a coin twice and getting two heads, and the MLE of probability of heads will be 100%, which is extreme.
ziheng

Reply all
Reply to author
Forward
0 new messages