Question on branch labeling with two species (multiple strains)

Ville Hoikkala

unread,

Mar 24, 2022, 3:54:20 PM3/24/22

to PAML discussion group

Hi,

I'm not sure if my approach is correct, so I would like some guidance 😊

I have species A and B (diverged roughly 2 mya), with four strains in A and six in B (10 samples total). I determined orthogroups using Orthofinder and extracted single copy genes only. I would like to see if there are genes that are under positive selection in either species using branch-site models (the tree is an unrooted version of the species tree output by Orthofinder). I will run two batches:

Batch 1: label species A strains as #1 and species B as #0

Batch 2: label species B strains as #1 and species A as #0

Both batches would include a null model (model = 0, NSites = 0) and a BS-model (model = 2, NSites = 2) as an alternative model. LRT would be performed afterwards.

The end goals are 1) to see which genomic regions contain genes that are under selection and 2) to see if specific functional groups are overrepresented in selection.

My specific questions:

1. Is it worth doing two batches like this or would a single run already reveal selection?

2. Is an outgroup required? Can I even do this reliably with this sample size (I have roughly 8000 single copy genes)?

3. In my preliminary run I got many genes with dnds at 999. If my LRT shows significance, can I trust this gene is under pos. selection?

thank you,

Ville

Ziheng

unread,

Apr 3, 2022, 11:29:57 AM4/3/22

to PAML discussion group

Batch 1: label species A strains as #1 and species B as #0

Batch 2: label species B strains as #1 and species A as #0

The analysis you outlined looked ok. yes, you need to do this in 2 batches, labelling one clade as the foreground each time.

you didn't specify the null model for the branch-site test correctly. the two models are as follows

H0: model = 2, NSites = 2, fix_omega = 1, omega = 1

Ha: model = 2, NSites = 2, fix_omega = 0

if you have an outgroup that is not too distant, it should be useful to include it.

otherwise you need to be careful about branch labelling, as the tree should be unrooted.

if the dn/ds ratio is 999, the upper limit set by the program, the estimate is unreliable. the LRT is still fine despite the difficulty of estimate the dn/ds ratio.

i suspect that the reliability of the analysis is affected by the levels of sequence divergence. i assume that the two species are not too far away (<30% of sequence divergence, say), but what about the strains in each species? if the strains are nearly identical the sequences won't have much info, and then the test will not be significant simply due to lack of info in the data.

best wishes,

ziheng

Ville Hoikkala

unread,

Apr 4, 2022, 8:32:11 AM4/4/22

to PAML discussion group

Thank you very much for the response.

It seems the analysis fails when species A (four strains that are highly similar) is set as foreground. Codeml produces the result files, but leaves out roughly half of the results (including omega values and likelihoods). Vice versa labeling (species B as #1) produces intact result files.

One further verification: If I include an outgroup, is the outgroup labeled as background (#0) in both batches?