Hi,
I am running CODEML free-ratio model to estimate numbers of non-synonymous and synonymous per branch. However, even when I test it for a small dataset (3 or 4 sequences) the numbers returned are sometimes unreal.
When I try to find the nucleotide changes in the aligmnent the observed numbers are different from the numbers estimated by codeml (S*dS and N*dN).
Consider the following case for example. A 4 sequence alignment:
>seq4
CAGGCTGAGTACCGAGCT
>seq1
CACACTGACTACCGAGCC
>seq2
AACACTGACTACCGACCA
>seq3
CACACAGACTACCGACCT
I thought the tree could be influencing the results so I also tried using different trees.
(seq4,seq1,(seq2,seq3));
(seq4,(seq1,(seq2,seq3)));
The numbers of synonymous and non-synonymous sites were stable, independently of the tree variant used.
branch t N S dN/dS dN dS N*dN S*dS
5..1 0.862 14.2 3.8 999.0000 0.3636 0.0004 5.2 0.0
(seq4)
5..2 0.852 14.2 3.8 0.0001 0.0001 1.3512 0.0 5.1
(seq1)
5..6 0.183 14.2 3.8 999.0000 0.0773 0.0001 1.1 0.0
(seq2+3)
6..3 0.560 14.2 3.8 0.1496 0.0850 0.5682 1.2 2.1
(seq2)
6..4 0.000 14.2 3.8 0.0001 0.0000 0.0000 0.0 0.0
(seq3)
In this case, the model estimates 5 synonymous mutation in seq1. However, besides the 5 sites considered non-synonymous sites of seq4, there are no differences between seq1 and seq4.
What is the model considering as synonymous site of seq1 in this case? Are these misestimations or my misinterpretations? Is this expected for this model even for small datasets?
The files I used in this test are attached.
Thank you for your time.