codeml calculate ka ks and get pvalue or confidential interval of the ka/ks value

586 views
Skip to first unread message

hua

unread,
Apr 11, 2019, 4:50:29 AM4/11/19
to PAML discussion group
Hi,

I have used codeml to calculate ka/ks of 2 mtDNA sequences and my control as followed in codeml.ctl.
I got the result showed in result.mlc file below.
Firse of all, I am not clear of that which is the true ka/ks value(red color and green color in result.mlc file) I should get by comparing these 2 sequences.
Second, a pvalue should be got for measure the significance of the ka/ks value, but it missed in the result file and I don't know whether codeml will generate a pvalue.
Finally, if codeml does not come out a pvalue ,maybe I would use bootstrap method to get the 95% confidential interval of the ka/ks value.But I don't know how to do that although I search it from many rearch papers and their authors claimed thte CI could be obtained.
Could you help me to solve these problems? I will be really appreciated for that!

Best,
Hua

#############################################################################################
result.mlc:
CODONML (in paml version 4.8, March 2014)  aln2.fa
Model: One dN/dS ratio for branches, 
Codon frequency model: F3x4
ns =   2  ls = 3789

Codon usage in sequences
----------------------------------------------------------------------
Phe TTT  77  76 | Ser TCT  32  32 | Tyr TAT  46  46 | Cys TGT   5   6
    TTC 139 140 |     TCC  99  99 |     TAC  89  89 |     TGC  17  17
Leu TTA  73  73 |     TCA  83  83 | *** TAA   0   0 | Trp TGA  93  93
    TTG  18  18 |     TCG   7   7 |     TAG   0   0 |     TGG  11  11
----------------------------------------------------------------------
Leu CTT  65  65 | Pro CCT  41  41 | His CAT  18  18 | Arg CGT   7   6
    CTC 167 167 |     CCC 119 119 |     CAC  79  80 |     CGC  26  25
    CTA 276 276 |     CCA  52  52 | Gln CAA  82  82 |     CGA  28  28
    CTG  45  45 |     CCG   7   7 |     CAG   8   8 |     CGG   2   2
----------------------------------------------------------------------
Ile ATT 124 124 | Thr ACT  52  52 | Asn AAT  32  32 | Ser AGT  14  14
    ATC 196 196 |     ACC 155 155 |     AAC 132 132 |     AGC  39  39
Met ATA 167 167 |     ACA 134 134 | Lys AAA  85  87 | *** AGA   0   0
    ATG  40  41 |     ACG  10  10 |     AAG  10  10 |     AGG   0   0
----------------------------------------------------------------------
Val GTT  31  31 | Ala GCT  43  43 | Asp GAT  15  16 | Gly GGT  24  23
    GTC  48  48 |     GCC 124 125 |     GAC  51  51 |     GGC  87  86
    GTA  70  71 |     GCA  80  79 | Glu GAA  64  63 |     GGA  67  68
    GTG  18  17 |     GCG   8   8 |     GAG  24  23 |     GGG  34  33
----------------------------------------------------------------------

Codon position x base (3x4) table for each sequence.

#1: MT_CDS         
position  1:    T:0.20823    C:0.26973    A:0.31407    G:0.20797
position  2:    T:0.41013    C:0.27606    A:0.19398    G:0.11982
position  3:    T:0.16522    C:0.41357    A:0.35735    G:0.06387
Average         T:0.26119    C:0.31979    A:0.28847    G:0.13055

#2: N1_CDS         
position  1:    T:0.20850    C:0.26946    A:0.31486    G:0.20718
position  2:    T:0.41040    C:0.27606    A:0.19451    G:0.11903
position  3:    T:0.16495    C:0.41383    A:0.35788    G:0.06334
Average         T:0.26128    C:0.31979    A:0.28908    G:0.12985

Sums of codon usage counts
------------------------------------------------------------------------------
Phe F TTT     153 | Ser S TCT      64 | Tyr Y TAT      92 | Cys C TGT      11
      TTC     279 |       TCC     198 |       TAC     178 |       TGC      34
Leu L TTA     146 |       TCA     166 | *** * TAA       0 | Trp W TGA     186
      TTG      36 |       TCG      14 |       TAG       0 |       TGG      22
------------------------------------------------------------------------------
Leu L CTT     130 | Pro P CCT      82 | His H CAT      36 | Arg R CGT      13
      CTC     334 |       CCC     238 |       CAC     159 |       CGC      51
      CTA     552 |       CCA     104 | Gln Q CAA     164 |       CGA      56
      CTG      90 |       CCG      14 |       CAG      16 |       CGG       4
------------------------------------------------------------------------------
Ile I ATT     248 | Thr T ACT     104 | Asn N AAT      64 | Ser S AGT      28
      ATC     392 |       ACC     310 |       AAC     264 |       AGC      78
Met M ATA     334 |       ACA     268 | Lys K AAA     172 | *** * AGA       0
      ATG      81 |       ACG      20 |       AAG      20 |       AGG       0
------------------------------------------------------------------------------
Val V GTT      62 | Ala A GCT      86 | Asp D GAT      31 | Gly G GGT      47
      GTC      96 |       GCC     249 |       GAC     102 |       GGC     173
      GTA     141 |       GCA     159 | Glu E GAA     127 |       GGA     135
      GTG      35 |       GCG      16 |       GAG      47 |       GGG      67
------------------------------------------------------------------------------
Codon position x base (3x4) table, overall

position  1:    T:0.20837    C:0.26960    A:0.31446    G:0.20757
position  2:    T:0.41027    C:0.27606    A:0.19425    G:0.11942
position  3:    T:0.16508    C:0.41370    A:0.35761    G:0.06361
Average         T:0.26124    C:0.31979    A:0.28877    G:0.13020


Nei & Gojobori 1986. dN/dS (dN, dS)
(Note: This matrix is not used in later ML. analysis.
Use runmode = -2 for ML pairwise comparison.)

MT_CDS              
N1_CDS               1.0261 (0.0011 0.0010)


TREE #  1:  (1, 2);   MP score: 12
check convergence..
lnL(ntime:  2  np:  4): -14637.701888      +0.000000
   3..1     3..2  
 0.000004 0.003256 22.495890 1.173809

Note: Branch length is defined as number of nucleotide substitutions per codon (not per neucleotide site).

tree length =   0.00326

(1: 0.000004, 2: 0.003256);

(MT_CDS: 0.000004, N1_CDS: 0.003256);

kappa (ts/tv) = 22.49589

omega (dN/dS) =  1.17381

dN & dS for each branch

 branch          t       N       S   dN/dS      dN      dS  N*dN  S*dS

   3..1      0.000  8332.7  3034.3  1.1738  0.0000  0.0000   0.0   0.0
   3..2      0.003  8332.7  3034.3  1.1738  0.0011  0.0010   9.4   2.9

tree length for dN:       0.0011
tree length for dS:       0.0010


Time used:  0:01


codeml.ctl:
seqfile = aln2.fa * sequence data file name
outfile = result.mlc * main result file name
treefile =aln2.fa.tree * tree structure file name

noisy = 0 * 0,1,2,3,9: how much rubbish on the screen
verbose = 0 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise
 seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
 CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
* ndata = 10
 clock = 0 * 0:no clock, 1:clock; 2:local clock; 3:TipDate
 aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
 * 7:AAClasses
 model = 0
 * models for codons:
 * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
 * models for AAs or codon-translated AAs:
 * 0:poisson, 1:proportional,2:Empirical,3:Empirical+F
* 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189)
 NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
 * 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ
* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
* 13:3normal>0
 icode = 1 * 0:universal code; 1:mammalian mt; 2-11:see below
 Mgene = 0 * 0:rates, 1:separate;
 fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated
 kappa = 2 * initial or fixed kappa
 fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate
 omega = .4 * initial or fixed omega, for codons or codon-based AAs
 fix_alpha = 1 * 0: estimate gamma shape parameter; 1: fix it at alpha
 alpha = 0. * initial or fixed alpha, 0:infinity (constant rate)
 Malpha = 0 * different alphas for genes
 ncatG = 3 * # of categories in dG of NSsites models
 fix_rho = 1 * 0: estimate rho; 1: fix it at rho
 rho = 0. * initial or fixed rho, 0:no correlation
 getSE = 0 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)

egomez

unread,
Apr 12, 2019, 10:06:21 AM4/12/19
to PAML discussion group
Hi Hua,

I have also been requested by reviewers to do this :)

I noticed in the PAML manual (http://web.mit.edu/6.891/www/lab/pamlDOC.pdf) it says (page 36):

"How to bootstrap? 
To generate bootstrap pseudo-samples from your original data, you should use the control variable bootstrap in the control files baseml.ctl or codeml.ctl, and specify the number of samples, as follows. bootstrap = 100 * generate bootstrap data sets This generates a file named boot.txt. The file name is hard coded in the programs so you might want to rename it. Note that the way bootstrap samples are generated depends on your model, so you use baseml to generate samples for nucleotide-based analysis and codeml for amino acid and codon-based analysis. If you the data are partitioned (using option G), the programs use stratified sampling to generate bootstrap samples. The bootstrap samples can be analyzed using phylip programs (choose the option for multiple data sets) and baseml or codeml (using the variable ndata in the control file)." 

I am hoping to figure out implementing this within an ete framework. If any one has any tips, it would be greatly appreciated! I will update here if I figure it out. But may have to do it in codeml.

Best,
Elaine
 

cajawe

unread,
Apr 12, 2019, 2:52:14 PM4/12/19
to PAML discussion group
The red number is a non-ML approximation; the green number if the ML estimate generated by codeml.

You can run with 'getSE = 1' to obtain a standard error estimate for dN/dS, which could then be converted to a confidence interval.

hua

unread,
Apr 15, 2019, 4:20:30 AM4/15/19
to PAML discussion group
Hi Elaine,

As you said, the reviewers asked me to attach the  confidence intervals, LOL.
I also perceived the bootstrap method in PAML manual and have added the "bootstrap=100 * generate bootstrap data sets" to my codeml.ctl file before I asked for help here. But it only output one file named "boot.txt". Is this a bootstrap file and I can do this 1000 times? 
Moreover, I also used  evolver to simulate a sequence(have stop codons) file based applying baseml to calculate the nucleotides' rate of my original sequence file. Is also well if I utilize it  for calculating a new Ka/Ks and then get the CI for do this 1000 times?
I am a beginner of using PAML, so all of these are just my speculation, I will be really  thankful  if anyone could help me to figure it out.

Best regards,
Hua


hua

unread,
Apr 15, 2019, 4:48:11 AM4/15/19
to PAML discussion group
Hi cajawe,

This means I should take the green number as the result of true Ka/Ks? 
Or the red one is the true value and the green one is the estimated value?
And I have run with 'getSE = 1' in the codeml.ctl and get the result file(attached result.mlc ) like:
………………

MT_CDS              
N1_CDS               1.0261 (0.0011 0.0010)


TREE #  1:  (1, 2);   MP score: 12
lnL(ntime:  2  np:  4): -14637.701888      +0.000000
  3..1     3..2  
0.000005 0.003254 22.496715 1.173789
SEs for parameters:
-1.000000 -1.000000 23.860732 0.794434

Note: Branch length is defined as number of nucleotide substitutions per codon (not per neucleotide site).

tree length =   0.00326

(1: 0.000005, 2: 0.003254);

(MT_CDS: 0.000005, N1_CDS: 0.003254);

Detailed output identifying parameters
………………

I don't understand which value is  the standard error estimate for dN/dS and I get a warning output in the screen:
"
Calculating SE's

Warning: Hessian matrix may be unreliable for zero branch lengths
"

Is this warning caused the weird output? 


Best,
Hua
result.mlc

Ziheng

unread,
Aug 18, 2019, 6:48:36 AM8/18/19
to PAML discussion group
you can report
1.17 +- 0.79
for dN/dS.

you have only two sequences and they are highly similar, so the SE is large and the test won't be significant.
ziheng

Reply all
Reply to author
Forward
0 new messages