Positive selection results are very different after adding two outgroup species in branch-site models

67 views
Skip to first unread message

Wenqiang

unread,
Apr 28, 2024, 3:28:13 AMApr 28
to PAML discussion group
Dear PAML authors,

Thanks for developing the great package!

I'm doing branch site tests for 4 species on genome scale about 8000 gene families. The branch of interest is the ancestor of three species. The rooted tree file is like the following:
((Hl_A,(Hl_B,Hl_C)) #1,HL)
LRT tests identified 300 gene families under positive selection.


I'm not sure whether 4 species are sufficient for the  analysis. So I added two outgroup species within same genus (within 150 MYA), and did the similar analysis on ~5700 one-to-one genes families with unrooted trees. After LRT tests, there are only 21 families under selection, and most of the genes are not overlapped with the identified genes based on 4 species analysis. One example is in the following figure. The amino acid G in the last column is identified as positively selected in 6 species model but not in the 4 species model. The output of codeml are in the attachment, in which bs.A directory is for the alternative model, bs.A1 is for the null model, and cmp.table.csv is the results of LRT.
 
无标题.png

Is this behavior expected? I'm wondering how many species I should include in the analysis, or did I make some mistakes in the analysis.
codeml.zip

Sandra AC

unread,
Jun 23, 2024, 5:21:24 AMJun 23
to PAML discussion group
Hi there,

Thanks for your message! By adding two new species in your analysis, you have incorporated new sequences into your alignment and you have also changed your tree hypothesis. I re-ran your analyses with your input data and reproduced your results with both your input control file and the tailored control file for analyses under the branch-site models that contains only relevant options (see our GitHub repository for more info and our paper for more details on how the various analyses can be carried out).

I am not sure why adding these two particular sequences has had such an effect on your analysis, but you may want to compare the sequence alignments with four and six species and check which site patterns have changed (e.g., you may want to check those sites that have been identified as being under positive selection when you have six species but are not when you have four species). While you may have already checked that, you may want to double check whether your sequence file has indeed a codon alignment, that there are no STOP codons, that the sequences you are using are indeed not contaminated or have been properly filtered, etc. Note that you are also changing your tree hypothesis (i.e., your tree topology with four taxa is rooted, the tree with six taxa is unrooted and has a different root). You may want to understand how adding these two new species is impacting your analysis, and better understand what your biological question is and whether this hypothesis or the hypothesis with four taxa can answer that.

By understanding better what these newly added sequences are, you may be able to explain the results you are observing and give a biological explanation to that :)

Hope this helps!
S.

Janet Young

unread,
Jun 24, 2024, 4:26:58 PMJun 24
to PAML discussion group
hi Wenqiang,

I totally agree with Sandra's great advice!  A few thoughts to add to that:

4 or 6 species is a low number: I think your power to detect positive selection is probably going to be weak either way.    

Our lab's experience is mostly with site models (not branch-site), but I suspect the same will apply - when we initially run PAML with a small number of orthologs and find positive selection, adding more species sometimes confirms the initial finding (with stronger statistical support) but sometimes it goes away. 

You might also be interested in this paper (also site models): https://pubmed.ncbi.nlm.nih.gov/25556235/ . I like the way they tested various genes with various numbers of species. Fig 2C was particularly interesting, I thought - for 3 genes where they don't know the "correct" answer (selection or not), the statistical signal of selection can increase or decrease with more species.   It also matters exactly which species they choose (not just how many).

I think your instinct is good,  to actually look at the alignments for sites predicted to be rapidly evolving.  If you look at a handful of genes found in your 4-species analysis, and 5 genes found in your 6-species analysis, do you believe one set more than the other?  

Also good to look at what is the typical total branch length (dS) for your 6-species and your 4-species set? Be careful you are not running into saturation: too much synonymous divergence will make it difficult for PAML to give reasonable results.  Also in the 4-species set be careful you have ENOUGH mutations.  One of Ziheng's early papers about statistical power with various tree lengths is useful (this one , I think https://pubmed.ncbi.nlm.nih.gov/15514074/ ).

all the best,

Janet


Reply all
Reply to author
Forward
0 new messages