Why strong positive selection in short BUT strong negative selection in large protein

Arthur Tugume

unread,

Jul 29, 2022, 9:18:38 AM7/29/22

to PAML discussion group

Hi ALL,

I am wondering why PAML is giving different results in a protein expressed as a fusion protein - kind of a hybrid protein.

Here is the thing:

At first, I implemented the models M0, M1a, M2a, M7, M8 and M8a to the 2 coding sequence - let me name them seq #1: ABCDEF, and seq #2: CD - where in this case CD is a different short ORF inserted inside the ORF of seq #2 by frame-shifting. My hypothetical naming is implying how they relate with their nt and aa sequences in the alignment.

Later I learnt that seq #2 (i.e., CD) is not expressed in nature as CD but as a fusion protein ABCD (I call it seq #3). In other words, although seq #2 ORF appears, in reality, it is not detected in cells, but is only detected in form of ABCD. So I had to re-do analysis with PAML for three different coding sequences:

1. ABCDEF

2. CD (this protein has not been detected in nature)

3. ABCD

What is confusing me is that sequence CD which has not been detected in nature shows a lot of positive selection on numerous amino acid sites. I do not even want to discuss such results because this protein has NOT been detected in nature so one cannot discuss the biology of it.

However, when ABCD (which indeed occurs in nature) is analyzed including the domains of CD, the aa that previously show positive selection in the short CD are not detected. In fact, everything now gets under strong negative selection pressure. None of the nested models show evidence of positive selection. In contrast, when the short CD is analyzed alone, evidence of strong positive selection is detected on numerous sites including being backed by log likelihoods of nested models. I expected the positively selected aas in CD also to be detected in ABCD.

What could explain this? Has any of you experienced something similar?

Looking forward to your feedback!

Arthur Tugume

Sandra AC

unread,

Dec 12, 2022, 6:25:08 AM12/12/22

to PAML discussion group

Hi Arthur,

In this case, it is always best to check first whether there have been issues with CODEML settings, the format of the input files, and how the latter were inferred:

1. Have you made sure that the alignment has been inferred correctly and that you are using a codon alignment in PHYLIP format? In other words, have you removed stop codons, made sure that you have deleted or masked unreliable alignment regions, etc.?

2. Have you removed the branch lengths from the tree file you are using?

3. You are running site models, so you need to use an unrooted tree. Is this the case?

4. Have you checked that you are enabling the correct settings to run M0, M1a, M2a, M7, and M8? E.g., `model = 0` and `NSsites = 0 1 2 7 8` to run these site models in a batch, `clock = 0` as no clock is assumed, `fix_omega = 0` and `omega = 0.5` as this enables omega to be estimated, `seqtype = 1` to be codon data, `cleandata = 0` as you do not want to remove sites with ambiguity data. You may want to use the mutation-selection model with observed codon frequencies used as estimates (`CodonFreq = 7` and `estFreq = 0`) as this model accounts for the mutational bias and selection affecting codon usage. Note that, if you are running M8a (null), you will need to set `model = 0`, `NSsites = 8` but then change `fix_omega = 1` and `omega = 1` as you are fixing the value of omega = 1.

4. Have you run LRTs and compared M0 vs M1a, M1a vs M2a, M7 vs M8 to make sure whether the results observed are significant? The LRT might not be significant for some of these model comparisons.

5. Are you checking the BEB results in the CODEML report for the models of positive selection (i.e., M2a and M8)? Remember that the BEB method is an improvement over NEB and accommodates the uncertainties in the MLEs. While CODEML reports results from both methods, the BEB should be used instead of the NEB. If any sites are listed under the BEB analysis, you should only consider sites which posterior probability is larger than 95% (sites will have a "*") or than 99% (sites will have "**"). Also, note that the BEB calculation does not take place under the null models (e.g., M1a or M7), so you will only see this under the models of positive selection (e.g., M2a and M8).

Hope this helps to narrow down the possibility of any technical issues!

All the best,
Sandra

Arthur Tugume

unread,

Dec 13, 2022, 9:53:24 PM12/13/22

to PAML discussion group

Dear Sandra,

Thank you for responding to my questions. Interestingly, all your guidance as you described is how the analysis is implement including the settings in the input files - ctl file, tree file and sequence file as you have described. I took care to ensure that's how the analysis is done but the problem remains. The analysis runs without problems but the output/results become hard to interpret.

My querry is if you analyze the short protein coding sequence, paml identifies amino acids under positive selection. Different amino acids are identified as under positive selection when a longer sequence (and same number of sequences) is used when the same ORF is used.

It is still a paradox to me, unless the length of sequence in away influences this analysis such that these models also account for context of amino acids in protein in which case pressures in large protein differ from what the conditions might be in short proteins.

Just to illustrate my concern further:

1. CMYFKRCSTGMTPSKY - this is the imaginary long protein

2. RCSTGMT - this protein has not been detected in nature; but analysis shows positively selected aas

3. CMYFKRCSTGMT - this is the protein that is detected in nature; it shows positive selection on aas but different ones from what is identified for protein 2.

I still need assistance

Arthur Tugume

Sandra AC

unread,

Dec 15, 2022, 7:17:55 AM12/15/22

to PAML discussion group

Hi Arthur,

I may be wrong but, if "RCSTGMT" does not occur in nature, the idea to test for positive selection might not be entirely appropriate. Perhaps, this sequence only has "a purpose" if it is part of seq1 and seq2 (i.e., it will affect the way seq1 and seq2 fold and also the functionality in the final translated proteins). When "RCSTGMT" is part of seq1 or part of seq2 (i.e., seq1 and seq2 will have a specific way of folding and "RCSTGMT" will contribute to that folding), you can test whether any sites are positively selected using site models. If you detect any positively selected sites in seq1 and seq2, it would be a good idea to see where in the 3D structure of seq1 and seq2 these sites are and which types of interactions they hold with other AAs, which might explain the results you get. If "RCSTGMT" is not detected in cells, it might not even be a protein-coding sequence and hence testing for positive selection on this sequence might not be correct.

Again, I may be wrong and other people might have other advise for you to follow, but this is what I can think of at the moment!

All the best,
Sandra

Janet Young

unread,

Dec 15, 2022, 6:09:55 PM12/15/22

to PAML discussion group

hi Arthur,

I noticed your original question says that the portion that's not detected in nature is the result of frameshifting. Can you clarify? Is it encoded by the same sequence that normally encodes something else in a different reading frame? or is it normally intronic?

Overlapping reading frames will totally mess up the assumptions of PAML. Perhaps dS appears to be very very low in your non-natural protein region because the "synonymous" sites are actually acting as non-synonymous sites in the other reading frame.

I would stay away from using any dN/dS metrics at all when there's overlapping reading frames.

Even if there weren't the overlapping reading frame question, I think I know why codeml is giving you different results when you take different subregions of an alignment, especially if they are small regions. Here's how I think of it: the codeml sitewise models involve categorizing sites into some number of classes, where each site in the class has the same dN/dS. E.g. M8 might have 11 categories (depending on your ncatG setting), M0 has a single class, etc. Codeml uses the entire input alignment to try to figure out the dN/dS of the classes. The additional sites in the longer alignment influence the dN/dS estimates of the classes, just like the sites in the sub-region do. In turn those changes in the class dN/dS estimates influence the sitewise estimates that BEB comes out with - if I understand right the BEB estimates aren't free estimates of the dN/dS of each site, they are using the prior probabilities to get some sort of weighted average dN/dS over classes.

There are other tools you might use if you want to estimate the dN/dS of each site independently. I'm no expert on the HyPhy/Datamonkey algorithms, but their FEL or FUBAR may be worth a try. http://www.datamonkey.org. I would expect those tools might give more consistent results whether you analyze one longer alignment, or a shorter sub-alignment: they're not trying to categorize sites like codeml does. But again, if there are overlapping reading frames I would stay away from the dN/dS metric entirely.

Hope that helps,

Janet

Ziheng

unread,

Feb 19, 2023, 11:27:13 AM2/19/23

to PAML discussion group

Yang Z, Lauder IJ, Lin HJ. 1995. Molecular evolution of the hepatitis B virus genome. J Mol Evol 41:587-596.

in case anybody is interested, this old paper partitioned sites in the HBV genome with protein coding genes in overlapping reading frames into different partitions, and estimated rates for the partitions. there were some discussions about which gene might be more conserved than which other.

there were also codon models with overlapping reading frames, but they are not computationally practical even for 2 sequences.