Likelihood computation - (homogeneous) site model vs. (non-homogeneous) branch-site model with custom restrictions

64 views
Skip to first unread message

Keren Halabi

unread,
Oct 29, 2018, 5:26:30 AM10/29/18
to Bio++ Usage Help Forum
Dear Bio++ team,

Following our discussion, I am having some difficulties understanding the computation approach in non-homogeneous models under custom restrictions. 

I tried to create a non-homogeneous model and then use custom restrictions to reduce it to a simple site model, as follows:
 
nonhomogeneous = general
nonhomogeneous.number_of_models = 2
model1 = model2 = YNGP_M2(...)
site.number_of_paths = 2
site.path1 = model1[YN98.omega_1]&model2[YN98.omega_1]
site.path2 = model1[YN98.omega_2]&model2[YN98.omega_2]
   
The above restrictions constrain the omega category per site to remain the same across the phylogeny. I therefore expected that these settings would simply yield a site model equivalent to " YNGP_M2(...)".

However, the computation of likelihood given these settings and given a simple setting of a homogeneous site model yield different values.

Could you please let me know what I am missing here?

Thanks!
Keren


Keren Halabi

unread,
Oct 31, 2018, 11:21:28 AM10/31/18
to Bio++ Usage Help Forum
Hi again,

Following this post, I tried to play with the initial parameters of the model. I saw that when I provided parameters that require low decimal accuracy (for example, F0 codon frequencies, and rational or whole numbers as initial values for the omegas, kappa, theta1 and theta2), there is no discrepancy in the values of the log likelihood that each approach yields. 

However, when using non-rational numbers as initial values for theta1/theta2, or the codon frequencies, for example, different values of log likelihood are received. 

The discrepancy grows larger with the size of the provided sequence data (for the same initial values, I received a ~3.01 log likelihood units discrepancy for a dataset of 5 taxa, and 100 codon sites, and a ~59.77  likelihood units discrepancy for a dataset of 50 taxa and 1000 codon sites).

This raises my suspicion that in some stage of the likelihood computation, different settings of decimal accuracy are used in each approach. I don't know where this occurs at code level, but if you are interested I could dig deeper into it.

Thanks!
Keren

Laurent Guéguen

unread,
Nov 1, 2018, 3:18:56 AM11/1/18
to Bio++ Usage Help Forum
Hi Keren,

thanks for your observation.

What do you refer to as "non-rational" numbers? 

Frequencies are parameterized  as ratios, to ensure that  they sum to 1.
For example, for frequencies (th1, th2, th3, ....), successive parameters  are th1, th2/(1-th1), th3/(1-th1-th2), etc.

So it is very possible that this plays a role in the accuracy of the computations, while rounding values for low frequencies.

Actually, I have put down this approach in Simplex (/bpp-core/src/Bpp/Numeric/Prob/Simplex.h), with two other
encoded approaches. The standard method is "global", "local" method is very bad for optimization (because it allows only  small steps during optimization,
which is very slow), and "binary" method is fine.

You can look at it if you want, but do not feel forced to (time is short in PhD). I think the way to do it would be to compare input-output
values during encoding and decoding process between frequencies and parameters.

By the way, likelihood computation will be totally different in a while in newlik branch, using Eigen. So we do not plan changing it in the master branch now.

Cheers,
Laurent




Laurent Guéguen

unread,
Nov 1, 2018, 3:35:57 AM11/1/18
to Bio++ Usage Help Forum
Hi again Keren,

in YNGP_M2 there are 3 values of omega: <1, =1, >1.  "=1" is not parameterized because it is a fixed
value of omega.

And then there are three possible homogeneous paths:

site.path1 = model1[YN98.omega_0]&model2[YN98.omega_0]
site.path2 = model1[YN98.omega_1]&model2[YN98.omega_1]
site.path3 = model1[YN98.omega_2]&model2[YN98.omega_2]

However, the missing path should be added automatically. Do you have something like:

Site Path Completion.................... true

in your output?

Then indeed all this is similar to an homogeneous YNGP_M2. Perhaps there is a trick
at the root to handle the root distribution, I have to think about it.

Cheers,
Laurent


Keren Halabi

unread,
Nov 1, 2018, 5:25:35 AM11/1/18
to Bio++ Usage Help Forum
Thank you Laurent, for the elaborated answer!

I am not sure what you mean by trick at the root, could you please elaborate on that?

I should clarify a few things:

1) I have indeed made sure that when using the non-homogeneous approach, the 3rd path is completed (and the weight of the hypernodes corresponding to each omega path are equal to their frequencies in the site model). As you suggested, the following line appears in my output: Site Path Completion...................: Yes
2) I am not optimizing the likelihood function, only computing the likelihood given a set of parameters, but I indeed understand how usage of small codon frequencies could result in such issue. However, I am a bit concerned with the scope of discrepancy that is, possibly, resulted by it.
3) I also tested the computation by computing manually the weighted average of the log likelihood of 3 YN98 model copies, each one corresponding to one of the omegas used in the two other approached. This yields a different log likelihood value regardless of the initial parameters I used.

To give a better picture of the situation, I attach here the tester I used, with two datasets to run it with and the results.
The result of the run on the small dataset yield 3.014 log likelihood units discrepancy when using small values for codon frequencies, while the result of the run on the large dataset yield 59.77 log likelihood units discrepancy.

Since the likelihood computation pipeline is expected to change in the newlik version, I think that for now it is best for me not to dwell too much on this. However, I report it to you in the case it might shed some light on a possible issue in the newlik computation as well.

Many thanks again!
Keren


 
TestLoglComputation.zip

Laurent Guéguen

unread,
Nov 7, 2018, 8:19:44 AM11/7/18
to Bio++ Usage Help Forum
Dear Keren,

I answer in several messages.

The discrepancy between M2a and your manual mixing YN98 models is due to the fact that the models in M2a do not have the same average rate.
In Bio++, each model is normalized to ensure one substitution per site per time unit on the equilibrium distribution.
So when you build up YN98(omega=0.1), YN98(omega=1) and YN98(omega=2), all three are normalized likewise.

A mixture model is normalized likewise, but there are infinite ways to do it, depending on the relative rates of the submodels. For example, consider
M2a, mixture of models M0.1, M1, M2, with respective rates r0.1, r1, r2 and probabilities p0.1, p1, p2 (such that p0.1+p1+p2=1).
Any combination of r0.1, r1 and r2, such that r0.1*p0.1+r1*p1+r2*p2=1 will make a normalized mixed model.
If r0.1=r1=r2=1, it works. But the problem is now about modelling the biological process. If r0.1=r2, since in M2 non-synonymous substitutions are easier
than in M0.1, it means that synonymous substitution have a lower rate in M2 than in M0.1. But in the modelling of the mixture
we want to change the rates of the non-synonymous substitutions between the submodels, but not the rates of the synonymous
substitutions. So r0.1 < r1 < r2. 

In the code of codeml models, I define synfrom_ and synto_ as the code of two synonymous codons, and ensure that in the relative rates of the submodels,
the substitution rates from synfrom_ and synto_ is the same for the three models (line 141 of YNGP_M2.cpp).

Laurent



   

Laurent Guéguen

unread,
Nov 7, 2018, 9:12:15 AM11/7/18
to Bio++ Usage Help Forum

Hi again,

about the discrepancies when using specific values of codon frequencies,
you can also try with F1X4 instead of F0, to check if there is still a problem there
For example you can try with values likes:

frequencies=F1X4, 123_Full.theta=0.2, 123_Full.theta1=0.3, 123_Full.theta2=0.8

in the declaration of the models?

And if there is no problem with F1X4, you can also try with F3X4.

Actually I have no idea of the problem. Perhaps you can also use a root frequency instead of stationarity?
I do not remember the default option, but there is no sense to use stationarity when there are several
models (actually the default should be to forbid stationarity in this case).

Also, I suggest you try with an even simpler example (tree and alignment).

Cheers,
Laurent



Keren Halabi

unread,
Nov 8, 2018, 11:04:02 AM11/8/18
to Bio++ Usage Help Forum
Hi Laurent,

Thank you for the clarification regarding the normalization! 

I do have 2 follow-up questions regarding your explanation:
1) According to my understanding, the normalization should achieve the goal of having the sum of diagonal in a matrix be 1. However, when creating a simple YN98(kappa=1,omega=0.1,frequencies=F0) model, I see that the sum of the diagonal is -58.816397228637342, and it is not complemented to 1 by some reciprocal value in data member rate_. Thus, I am not sure what the normalization achieved in this case.
2)  Since I wish to simulate data under a mixture model (see here: https://groups.google.com/forum/#!category-topic/biopp-help-forum/all-questions/2SyFxOCqnWQ) - a feature that is not yet available - I am currently simulating by creating 3 different YN98 models (M0.1,M1,M2), each one corresponding to a different omega category, and then sampling the model to simulate each site under according to the omegas distribution.
In practice, when I infer the parameters with YNGP_M2 based on data simulated in the above manner, I receive relatively poor results.
If I understood you correctly, by creating 3 independent YN98 models during the simulation, I only normalize, but don't homogenize. So, When I infer the parameters with YNGP_M2 (which homogenizes in addition to normalization), It may come as no surprise that the inferred values vary from the simulated ones. 
Could this possibly explain why I see relatively poor inference of the omega parameters when inferring with a mixture model parameters based on simulated data?


As for the discrepancy issue between site models and branch-site models, I did as you suggested with a minor addition:

1) executed likelihood computation using F1X4 with 123_Full.theta=0.2, 123_Full.theta1=0.3, 123_Full.theta2=0.8
2) executed likelihood computation using F1X4 with 123_Full.theta=0.3333333333333, 123_Full.theta1=0.3333333333333, 123_Full.theta2=0.3333333333333

I executed the two computations on a simpler and smaller data, as you suggested. 

In the first case I received the same log likelihood in both computations (that is, branch-site model with two copies of the same model, and a single site model).
In the second case I received discrepancy of 0.00043952825 log likelihood units. This may seem minor, but on a larger, less simple dataset, the discrepancy grows into 0.059210998. Naturally, the more parameters require accuracy, and the larger the data, the greater the discrepancy is.

Please find attached the tester, the data and the results. 

Many thanks again!
Keren
TestLoglComputation.zip

Laurent Guéguen

unread,
Nov 14, 2018, 5:32:01 AM11/14/18
to Bio++ Usage Help Forum
Hi Keren,

1) Normalization means a substitution per site per unit of time on a sequence at the equlibrium of the model. If (p_i)_i is this
equilibrium distribution, and (d_i)_i the diagonal of the generator, normalization means setting  \sum_i p_i * d_i = 1.

2) Yes indeed, different rates in the mixture are prone to make much differences in the estimates. Actually, I have
recently set a way to retrieve a submodel from a mixture model, which submodel keeps the same rate can be used for simulations (at least it should).
The class is InMixedSubstitutionModel , and in your example a way to get them is (for example the 1st model):

model=InMixed(model=YNGP_M2(...), numMod=1)

I let you check it works.

Cheers,
Laurent



Laurent Guéguen

unread,
Nov 14, 2018, 6:03:20 AM11/14/18
to Bio++ Usage Help Forum
Hi again,

I tested your program on your small example, using command

 ../LikelihoodComutationTest param=param.bpp > out_small.txt                        

I have a different output (less verbose), in which the likelihoods are similar (see joined file)
in both modellings.

Perhaps you have changed something in your bpp files?

Cheers,
Laurent

out_small.txt

Keren Halabi

unread,
Sep 2, 2019, 11:07:46 AM9/2/19
to Bio++ Usage Help Forum
Hi Laurent,

I am sorry to open this issue again, but I encountered it once again - this time on a small example with simple settings (PFA). 

Here is the output I receive:

Error! Non homogenous M2 with paths restrcitions yields different likelihood than homogenous M2 model
NonHomogenous M2 Log Likelihood: -107.459
BrLen0.................................: 0.01
BrLen1.................................: 0.01
BrLen2.................................: 0.02
BrLen3.................................: 0.03
BrLen4.................................: 0.04
YNGP_M2.kappa_1........................: 2
YNGP_M2.theta1_1.......................: 0.5
YNGP_M2.theta2_1.......................: 0.8
YNGP_M2.omega0_1.......................: 0.1
YNGP_M2.omega2_1.......................: 2


Honogenous M2 Log Likelihood: -105.748
BrLen0.................................: 0.01
BrLen1.................................: 0.01
BrLen2.................................: 0.02
BrLen3.................................: 0.03
BrLen4.................................: 0.04
YNGP_M2.kappa..........................: 2
YNGP_M2.theta1.........................: 0.5
YNGP_M2.theta2.........................: 0.8
YNGP_M2.omega0.........................: 0.1
YNGP_M2.omega2.........................: 2

The parameters appear to have the same values in both cases, but the resulting likelihood is different. I can also add that the transition probabilities for node 0 and site 0 are also different (please see example in the attached file).
These results are reproducible in a local clone of the devel branches, compiled with gcc-7.2.0.

Could you please let me know your thoughts on this? 

Many thanks!
Keren



testNonHomogenousModels.cpp

Keren Halabi

unread,
Nov 13, 2019, 4:06:22 AM11/13/19
to Bio++ Usage Help Forum
Hi dear Laurent,

I took another look at the output file you attached and I saw that you actually received the same input as me:

the likelihood value in line 61, which is an outcome of the computation that is based on site model M2 (-1511.3934048063) is different from the one in line 135, which is an outcome of the computation that is based on branch-site model that consists of two similar copies of the same M2, when restricting the mixture category to remain consistent throughout the phylogeny with respect to each site (-1364.83415882663).

I thought that these two settings would yield the same likelihood:

model = YNGP_M2(kappa=1,omega0=0.1,omega2=2,theta1=0.5,theta2=0.5,frequencies=F1X4, 123_Full.theta=0.2, 123_Full.theta1=0.3, 123_Full.theta2=0.8)
nonhomogeneous = "no

model1 = YNGP_M2(kappa=1,omega0=0.1,omega2=2,theta1=0.5,theta2=0.5,frequencies=F1X4, 123_Full.theta=0.2, 123_Full.theta1=0.3, 123_Full.theta2=0.8)
model2 = YNGP_M2(kappa=YNGP_M2.kappa_1,omega0=YNGP_M2.omega0_1,omega1=YNGP_M2.omega1_1,omega2=YNGP_M2.omega2_1,theta1=YNGP_M2.theta1_1,theta2=YNGP_M2.theta2_1,frequencies=F1X4, 123_Full.theta=0.2, 123_Full.theta1=0.3, 123_Full.theta2=0.8)
nonhomogeneous = general
nonhomogeneous.number_of_models = 2
nonhomogeneous.stationarity = yes
site.number_of_paths = 2
site.path1 = model1[YN98.omega_1]&model2[YN98.omega_1]
site.path2 = model1[YN98.omega_2]&model2[YN98.omega_2]

Maybe my intuition about this is wrong, but if you look at the possible events of transitions between omega categories in the latter case, due to the addition site.path1 and site.path2, I should accept exactly the same events considered in the first case. Could you please let me know where I'm going wrong?

Many thanks!
Keren


Laurent Guéguen

unread,
Nov 22, 2019, 4:11:18 AM11/22/19
to Bio++ Usage Help Forum
Dear Keren,

yes indeed it should provide the same likelihood. I need to fix this (soon I do hope).

Cheers,
Laurent

Laurent Guéguen

unread,
Nov 25, 2019, 4:01:15 AM11/25/19
to Bio++ Usage Help Forum
Hi,

thanks to Keren who found the bug, the problem is solved.

Cheers,
Laurent
Reply all
Reply to author
Forward
0 new messages