Question about total number of sites decreasing as the number of input sequences increasing

Chenyu Fan

unread,

Jul 20, 2022, 12:48:25 PM7/20/22

to PAML discussion group

Hi,

Thanks for taking time reading my question.

I have questions about total number of sites (the number of N + the number of S) in the result of codeml.

1)

For example, when I put 20 sequences in codeml,

pairwise comparison (Goldman & Yang 1994)
seq seq N S dN dS dN/dS Paras.
2 1 22322.5 6885.5 0.0013 0.0006 2.1128 0.0033 5.5060 2.1128 -38647.267

the number of N+S is around the length of the whole sequence.

But when I put around 150 sequences into codeml, the result looks like

pairwise comparison (Goldman & Yang 1994)
seq seq N S dN dS dN/dS Paras.
2 1 9001.4 3139.6 0.0010 0.0003 3.1455 0.0025 19.8616 3.1455 -16013.324
3 1 9127.5 3013.5 0.0010 0.0003 2.9457 0.0025 8.7926 2.9457 -16016.518

the number of N+S decreases.

I suppose that the total number of sites may decreas because more sequences with missing data exist. Is it a reasonable explanation?

In this case, what kinds of measure can I take in order to get a more precise result? Maybe I can delete the sequences with the high percentage of missing data?

2)

Below is the control file of codeml. I am wondering if the decrease of total number of sites is related to the model I used?

seqfile = test.phylip
outfile = results.txt

oisy = 0
verbose = 0
runmode = -2

seqtype = 1
CodonFreq = 2
model = 0
NSsites = 0

Thanks again!

Chenyu Fan

Janet Young

unread,

Jul 20, 2022, 12:51:47 PM7/20/22

to PAML discussion group

hi Chenyu,

Take a look at the 'cleandata' option in the manual. From your description, I'm guessing you are using cleandata=1 ? You might want to try cleandata=0 (but please do think about it first)

Janet

Chenyu Fan

unread,

Jul 21, 2022, 6:39:59 PM7/21/22

to PAML discussion group

Thank you Janet. Actually I didn't set the cleandata option in my control file. But I tried cleandata=1 and cleandata =0 and no much difference was made. The total number of sites still decreases, which make me feel even more confusing.

Chenyu

Ziheng

unread,

Jul 22, 2022, 1:58:14 PM7/22/22

to PAML discussion group

perhaps you are using the

runmode = -2

option, which means pairwise comparison. with that option, the program removes columns with alignment gaps, so the number of sites is smaller than in the original file before gaps were removed.

there are some notes in the doc, about "complete deletion" vs. "pairwise comparsion", but i think columns with ambiguities are removed as long as you use runmode = -2.