Hi Vinciane,
Regarding the BEB output, all the theoretical information about its calculation and application to various empirical datasets is detailed in
Yang, Wong, Nielsen (2005). The BEB output that you show refers to the M2a model, I assume. I am going to go through the output file you get when running `CODEML` with the example dataset used by
Yang, Wong, Nielsen (2005), and then compare that to your output:
1. Maximum-likelihood estimates (MLEs): while you do not ask about them, I find it is important to mention that you will always find the MLEs for p0, p1, (p2), w0, w1, and w2 after the line that starts with "MLEs". Note that, for M8, you will have p0, p, q, (p1), and w after the line that starts with "Parameters in M8". If you run `CODEML` with the dataset that was used by
Yang, Wong, Nielsen (2005), which can be found
on the PAML GitHub repository, you will reproduce the results they report on their Table 4, where you shall see that w2>1. Below, you can find the results I got when I re-ran the analyses with the latest version of `CODEML` (I added p0, p1, p2, w0, w1, and w2 for M2a within parentheses, these are not printed on the output) together with a screenshot of Table 4 in their study:
```
(M2a) | (M8)
MLEs of dN/dS (w) for site classes (K=3) | Parameters in M8 (beta&w>1):
|
p:
(p0) 0.37712
(p1) 0.44169
(p2) 0.18119 | p0 = 0.79950 p = 0.16719 q = 0.14883
w: (w0) 0.05998 (w1) 1.00000 (w2) 3.62564 | (p1 = 0.20050) w = 3.47036
```
2. Bayesian estimates: the BEB calculation will give you Bayesian estimates for the model parameters under M2a and M8. In that way, you can compare the estimates obtained under a maximum-likelihood approach to those obtained under a Bayesian approach. To get more familiar with the "ternary graph" and "the grid" you see in the output file, however, you will need to read
Yang, Wong, Nielsen (2005).
In short, the Bayesian empirical Bayes (BEB) approach assigns a prior to the model parameters (i.e., the first lines that you see (i) under header "The grid (see ternary graph for p0-p1)" for w0 and w2 and (ii) a ternary graph that is not shown here but is shown as part of Fig. 1 in
Yang, Wong, Nielsen (2005) for p0 and p1) and integrates over their uncertainties.
The prior used for w0 and w2 is a uniform distribution: w0 ~ U(0,1) and w2 ~U(0,1). This distribution is discretised into 10 categories, and you can see the values used for each category in the output. For w0, from 0.05 to 0.95. For w2, from 1.5 to 10.5. Once all the calculations take place, you get the posterior estimated values for these parameters under the "Posterior on the grid" header. In this case, the posterior distribution for w0 "peaks" under the first category (i.e.,
w0=0.05) with a probability of
0.4. For w2, it is the third category:
w2=3.5 with a probability of
0.503. The ternary graph is a bit harder to visualise, and so I have tried to "draw" the graph from 0.1 to 1.0 for both p0 (horizontal) and p1 (vertical) trying to match the representation in Fig. 1 (
Yang, Wong, Nielsen (2005)) with the results output by the program -- this is the image you shall see as part of the output. Following the same procedure as with the grid, the posterior distribution seems to peak
somewhere between 0.3-0.4 for p0 and between
0.4-0.5 for p1 with a
probability of 0.098. I guess that, if you were to compute the posterior following the equations in their study, you would get the values they report in the main text: p0=0.37 and p1=0.47 with a probability of 0.0908 -- not sure whether the estimated values are output in a specific file, have not found them.
```
The grid (see ternary graph for p0-p1)
w0: 0.050 0.150 0.250 0.350 0.450 0.550 0.650 0.750 0.850 0.950
w2: 1.500 2.500 3.500 4.500 5.500 6.500 7.500 8.500 9.500 10.500
Posterior on the grid
w0: 0.400 0.339 0.182 0.063 0.013 0.002 0.000 0.000 0.000 0.000
w2: 0.000 0.203 0.503 0.208 0.056 0.018 0.007 0.003 0.002 0.001
Posterior for p0-p1 (see the ternary graph) (YWN2015, fig. 1)
sum of density on p0-p1 = 1.000000
```
For your output, then I would say that, according to the grid, w0=0.05 with probability 1 and w2=1.5 with probability 0.682. According to the graph, p0 is estimated to be between 0.8 and 0.9 and p1 between 0 and 0.1 with probability 0.549.
```
The grid (see ternary graph for p0-p1)
w0: 0.050 0.150 0.250 0.350 0.450 0.550 0.650 0.750 0.850 0.950
w2: 1.500 2.500 3.500 4.500 5.500 6.500 7.500 8.500 9.500 10.500
Posterior on the grid
w0: 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
w2: 0.682 0.197 0.068 0.028 0.013 0.006 0.003 0.002 0.001 0.001
Posterior for p0-p1 (see the ternary graph) (YWN2015, fig. 1)
0.000
0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.410
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.001 0.549 0.029
```
The same procedure follows for understanding the output regarding analyses under M8 (you can also read the original paper with their results). If you want to check the BEB probabilities for the classes of each model for each AA, then you should find this in the output file called `rst`. You can read the
PAML documentation for more details about this file :)
With regards to heterozygous sites, I am not entirely sure what is the best next thing to do as, per default, `CODEML` does not really deal with heterozygotes or within-species polymorphism -- Ziheng shall give you a better answer for that!
Hope that at least the explanation about the BEB output is helpful. Have a good one!
Sandra