Correct usage of codon models with rate heterogeneity across sites and mX(a) models estimating omega.

21 views
Skip to first unread message

Karolis Ramanauskas

unread,
Jul 19, 2021, 12:18:26 PM7/19/21
to bali-phy-users
Hi once again,

I have been running and comparing the output of various codon models with and without rate heterogeneity. Additionally, I have been attempting to test for positive selection by passing the omega parameter to mX(a)_test models. Below is a selection of model combinations I have running, alphabet="Codons", imodel="rs07" in all cases; smodel definitions are as follows:

1. gtr+fMutSel
2. gtr+fMutSel+Rates.gamma[4]
3. function[w,gtr+fMutSel[w]]+m8a_test
4. function[w,gtr+fMutSel[w]]+m8a_test+Rates.gamma[4]


I have also tried gy94 and mg94 (and their variants) instead of fMutSel. I have noticed an apparently significant difference in the estimates of the Rates.gamma:alpha parameter between the models 2 and 4 (currently 1.768 and 6.734, respectively) which does not look right. I understand that the ordering of functions determines how they are interpreted and nested by BAli-Phy. Did I specify the models incorrectly? My intention was to apply the discrete gamma distribution of rates between codon sites, but I am concerned that this may not be the case in model 4.

Thanks in advance,
Karolis

Benjamin Redelings

unread,
Jul 19, 2021, 1:14:01 PM7/19/21
to bali-ph...@googlegroups.com

Hi Karolis,

Ah, nice to see you are experimenting with interesting models!

Your models 2 and 4 below are not equivalent, so the Rates.gamma:alpha values cannot really be expected to have the same value.  To explain in more detail:

2. gtr+fMutSel+Rates.gamma[4]
This a 4-component mixture, where different components run at different rates.

In this model, Rates.gamma:alpha describes differences in conservation AND differences in the synonymous rate ....... which are assumed to be the same thing.

This is not a completely realistic model because it will use rate variation to explain more conserved codons.  That means that, for more conserved codons, the 3rd codon position will also evolve slower, which should not be the case.

3. function[w,gtr+fMutSel[w]]+m8a_test

This is also a 4-component mixture, but different components have different omega values instead of different rates.

In this model, m8a_test:gamma describes differences in *conservation*.

This is more realistic than your model #2, because the more-conserved components will change amino-acids at a slower rate, but the 3rd codon position will not evolve slower at more conserved sites.

4. function[w,gtr+fMutSel[w]]+m8a_test+Rates.gamma[4]

This is a 16-component mixture.  Mixture components have BOTH different dN/dS values AND different rates.

In this model, m8a_test:gamma describes differences in *conservation*, and Rates.gamma:alpha describes differences in the *synonymous rate*.

This model has both different omega values AND different rates.  This basically extends model 3 to allow differences in the SYNONYMOUS rate at different codons, in addition to different ratios between the synonymous and non-synonymous rates.

In this model, the Rates.gamma:alpha is NOT responsible for describing differences in conservation.  The m8a_test part of the model is responsible for describing differences in conservation.  So the Rates.gamma:alpha value is expected to be higher (indicating less variation) than in your #2 model, because it is not handling variation in conservation, but only variation in the synonymous rate.

Does that make sense?

-BenRI

P.S. You subscribed to the list with your gmail address, but since you have been sending messages from your UIC address, this might explain why you are not getting replies.  (Also messages are not getting sent until I approve them).  Maybe you should subscribe both addresses, but disable e-mail delivery for one of them?  Not completely sure...

P.P.S.  When I tested out these models, I used fMutSel0, which can be a bit faster.  It has fewer parameters to estimate, since it does not allow different codons for the same amino acid to have different fitnesses.

--
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bali-phy-users/5dad3436-2eb5-4649-a575-66fcb650bb2cn%40googlegroups.com.

Karolis Ramanauskas

unread,
Jul 19, 2021, 3:31:11 PM7/19/21
to bali-phy-users
It does make sense, thank you for the detailed explanation.

I have a related question, see the attached Tracer output of relevant parameter distributions for models #3 and #4, defined earlier. Granted for #4, this is a very early phase of the chain, but I have 64 chains running. The total sample size is 22192. What is interesting is that the m8a_test:posSelection is virtually always 0. While the m8a_test:posP is 0.5904 and the m8a_test:posW is 3.1587. The m8a_test:posSelection estimate for #3 is 0.6938, m8a_test:posP=0.4462, m8a_test:posW=1.8241. Out of 22192 samples in #4, there are only 51 samples of m8a_test:posSelection=1.

Is this simply a result of a small number of samples per chain in #4? My assumption was that early in the chain m8a_test:posSelection should sample 0 or 1 with probability close to 0.5. As it stands now, #4 is a little too parameter-rich anyway, I do not think I will be able to accrue enough samples for it to ever converge in a reasonable amount of time. I just do not understand the behavior, and it bothers me.

Sorry if this is a silly question.

P.S. These are concatenated chains with no burn-in; 64 for #4 and 16 for #3.


fMutSel_rgamma04_m8atest__posSelection.pngfMutSel_m8atest__posSelection.pngfMutSel_rgamma04_m8atest__posW.pngfMutSel_m8atest__posW.pngfMutSel_rgamma04_m8atest__posP.pngfMutSel_m8atest__posP.png

Benjamin Redelings

unread,
Jul 19, 2021, 3:47:50 PM7/19/21
to bali-ph...@googlegroups.com
Hi Karolis,

So, in the m8a_test model, posP is the fraction if neutral or positively-selectrd sites. When posSelection is zero, posP is the fraction of neutral sites.  The posterior distribution of posP clearly does not include 0. So, I think the model clearly supports a substantial fraction of neutral or positively selected sites. From what you have said, these may be neutral sites.

Note that the m2a model has separate proportions for neutral and positively-selectrd sites. The m2a model was developed AFTER the m8a model. It would be possible to extend the m8 model to have separate categories for neutral and positively-selected sites....

If posSelection is 0, posW should be sampled from the prior. If you look, you will see that posW has a large mean, but has a peak at 1.

-BenRI




Karolis Ramanauskas

unread,
Jul 19, 2021, 4:58:07 PM7/19/21
to bali-phy-users
Thank you,

I fundamentally misunderstood m8/m8a models. This makes perfect sense.

Karolis

Reply all
Reply to author
Forward
0 new messages