Question about total number of sites decreasing as the number of input sequences increasing

24 views
Skip to first unread message

Chenyu Fan

unread,
Jul 20, 2022, 12:48:25 PM7/20/22
to PAML discussion group
Hi,
Thanks for taking time reading my question.
I have  questions about total number of sites (the number of N + the number of S) in the result of codeml.
1)
 For example, when I put 20 sequences in codeml,

pairwise comparison (Goldman & Yang 1994)
seq seq        N       S       dN       dS     dN/dS   Paras.
  2   1  22322.5   6885.5   0.0013   0.0006   2.1128   0.0033   5.5060   2.1128 -38647.267
the number of N+S is around the length of the whole sequence.

But when I put around 150 sequences into codeml, the result looks like

pairwise comparison (Goldman & Yang 1994)
seq seq        N       S       dN       dS     dN/dS   Paras.
  2   1   9001.4   3139.6   0.0010   0.0003   3.1455   0.0025  19.8616   3.1455 -16013.324
  3   1   9127.5   3013.5   0.0010   0.0003   2.9457   0.0025   8.7926   2.9457 -16016.518
the number of N+S decreases.

I suppose that the total number of sites may decreas because more sequences with missing data exist. Is it a reasonable explanation?
In this case, what kinds of measure can I take in order to get a more precise result? Maybe I can delete the sequences with the high percentage of missing data?
2)
Below is the control file of codeml. I am wondering if the decrease of total number of sites is related to the model I used?
seqfile = test.phylip
    outfile = results.txt  

        oisy = 0      
      verbose = 0
      runmode = -2    

      seqtype = 1  
    CodonFreq = 2  
        model = 0      
      NSsites = 0    

Thanks again!

Chenyu Fan



Janet Young

unread,
Jul 20, 2022, 12:51:47 PM7/20/22
to PAML discussion group
hi Chenyu,

Take a look at the 'cleandata' option in the manual.  From your description, I'm guessing you are using cleandata=1 ?   You might want to try cleandata=0 (but please do think about it first)

Janet

Chenyu Fan

unread,
Jul 21, 2022, 6:39:59 PM7/21/22
to PAML discussion group
Thank you Janet. Actually I didn't set the cleandata option in my control file. But I tried cleandata=1 and cleandata =0 and no much difference was made. The total number of sites still decreases, which make me feel even more confusing.

Chenyu

Ziheng

unread,
Jul 22, 2022, 1:58:14 PM7/22/22
to PAML discussion group
perhaps you are using the 
runmode = -2 
option, which means pairwise comparison.  with that option, the program removes columns with alignment gaps, so the number of sites is smaller than in the original file before gaps were removed.

there are some notes in the doc, about "complete deletion" vs. "pairwise comparsion", but i think columns with ambiguities are removed as long as you use runmode = -2.
if you analyze all sequences on a tree, then cleandata works as intended.
best, ziheng
Reply all
Reply to author
Forward
0 new messages