Numbers of synonymous and non-synonymous sites using free-ratio model

30 views
Skip to first unread message

Hugo Moreno

unread,
May 7, 2024, 10:01:00 AMMay 7
to PAML discussion group
Hi,

I am running CODEML free-ratio model to estimate numbers of non-synonymous and synonymous per branch. However, even when I test it for a small dataset (3 or 4 sequences) the numbers returned are sometimes unreal.
When I try to find the nucleotide changes in the aligmnent the observed numbers are different from the numbers estimated by codeml (S*dS and N*dN).

Consider the following case for example. A 4 sequence alignment:
>seq4
CAGGCTGAGTACCGAGCT
>seq1
CACACTGACTACCGAGCC
>seq2
AACACTGACTACCGACCA
>seq3
CACACAGACTACCGACCT

I thought the tree could be influencing the results so I also tried using different trees.
(seq4,seq1,(seq2,seq3));
(seq4,(seq1,(seq2,seq3)));

The numbers of synonymous and non-synonymous sites were stable, independently of the tree variant used.

 branch          t       N       S   dN/dS      dN      dS  N*dN  S*dS

   5..1      0.862    14.2     3.8 999.0000  0.3636  0.0004   5.2   0.0   (seq4)
   5..2      0.852    14.2     3.8  0.0001  0.0001  1.3512   0.0   5.1   (seq1)
   5..6      0.183    14.2     3.8 999.0000  0.0773  0.0001   1.1   0.0 (seq2+3)
   6..3      0.560    14.2     3.8  0.1496  0.0850  0.5682   1.2   2.1 (seq2)
   6..4      0.000    14.2     3.8  0.0001  0.0000  0.0000   0.0   0.0 (seq3)

In this case, the model estimates 5 synonymous mutation in seq1. However, besides the 5 sites considered non-synonymous sites of seq4, there are no differences between seq1 and seq4.

What is the model considering as synonymous site of seq1 in this case? Are these misestimations or my misinterpretations? Is this expected for this model even for small datasets?

The files I used in this test are attached.
Thank you for your time.
 
test.fas
free_ratio.ctl
test.nwk

Janet Young

unread,
May 7, 2024, 2:02:12 PMMay 7
to PAML discussion group
Hi Hugo,

A few things to think about, that might partly explain what you’re seeing

- Codeml will unroot any rooted tree you give it, so those two trees you show probably end up being treated exactly the same after the second one (the rooted one) gets unrooted.  It probably gave you a warning about that.
- In your ctl file you’ve asked codeml to estimate kappa. Your choice of CodonFreq also requires codeml to use the alignment to create a codon frequency table.  That alignment is so short and has so few sequences that I suspect those estimates will be extremely unreliable.  That might be contributing to your very unrealistic observations. For this very small test case you might be better off with CodonFreq=0 and fix_kappa=0 (kappa=2 is reasonable).  You can look in your output to see what it estimated for those things.  It's estimating omega as well - did it come up with a reasonable estimate for that?  If not you might fix it. 
- Notice that the last codon in your alignment has nucleotides positions where you observe >2 nucleotides (e.g. the very last position).  Depending on the estimates of kappa and the codon table, it’s unclear what codeml would assign as the ancestral state at each node for that very last nucleotide position. That can result in more changes being inferred than you actually think (due to recurrent changes at the same position).

I really like this older review article by Yang and Bielawski,  to help think about what codeml is  actually doing under the hood and why these parameter choices might matter - https://www.sciencedirect.com/science/article/pii/S0169534700019947 

all the best,

Janet
Reply all
Reply to author
Forward
0 new messages