What is the difference when testing branches by labeling only the tips, versus labeling deeper branches?

236 views
Skip to first unread message

DNAngel

unread,
Apr 4, 2019, 10:27:03 PM4/4/19
to PAML discussion group
I am just curious to understand the main difference when running clade models and labeling on the branch tips for species of interest (with #1) within a clade, compared to just labeling their ancestral branch ($1) which includes more branches leading to the tips. Does codeml reconstruct some ancestral sequences at the deeper nodes and calculates dN/dS in that manner, as I would think for labeling only the tips it would not need to "reconstruct" the sequences at ancestral nodes.

labeling.jpg

For example see image attached:

Ziheng

unread,
Aug 18, 2019, 6:57:29 AM8/18/19
to PAML discussion group
this is explained in the doc.

#1 labels one branch, while $1 labels all branches within the clade including the branch. the following two are the same:

(((rabbit, rat) $1, human), goat_cow, marsupial);
(((rabbit #1, rat #1) #1, human), goat_cow, marsupial);

and both are different from 
(((rabbit, rat) #1, human), goat_cow, marsupial); 

your intuition is o.k., but you would need to reconstruct ancestral sequences even if you want to estimate dN/dS for the tip branches.  you need to know sequences at the two ends of the branch.  anyway codeml is an ML method, so it averages over all possible ancestral reconstructions.  if you want to know the details, you can read a book which covers likelihood calculation on a tree, for example, chapter 4 of yang (2014).
Yang Z. 2014. Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford, England.
ziheng

DNAngel

unread,
Aug 22, 2019, 7:24:33 PM8/22/19
to PAML discussion group
Thank you Dr. Ziheng!

I have a follow-up question - based on both diagrams for hypothetical gene A, if the second diagram where the whole clade is labeled as the foreground and experiences positive selection, while the first diagram (tips only) does not, can this be explained by saying that positive selection has been occurring earlier for the the clade prior to the species divergence? What is the biological meaning?

I saw papers that explain that the branch LEADING to a clade experienced positive selection (and thus all species have the amino acid substitutions that changed the gene) but I don't know if they labeled the clade using #1 (just that branch leading to the whole clade) or $1 (everything highlighted). I want to really understand these papers. If they say "branch leading to" can I expect it to be just a #1 on a branch. Otherwise if it is tested with $1, how does one explain the pattern of selection (i.e. the whole clade experienced positive selection? Doesn't make sense to me...)

Ziheng

unread,
Oct 27, 2019, 1:14:38 PM10/27/19
to PAML discussion group

"branch leading to" 

that should mean labeling just one branch, which is ancestral to the clade.  yes, it should be #1.

$1 labels in addition all branches inside the clade.  it is useful when you want to label many branches in the same clade.  
suppose flu jumps from swine to humans, and we imagine that different amino acid positions are under positive selection in different hosts.  then one might use $1 to label all branches in the human clade.

i think most often the positive selection models in codeml do not make biological sense.  they are simple models that do averaging, in the same way that we often calculate an average for the whole sample even if we know the sample is heterogeneous.  for example the models in effect assume that aa1 -> aa2 mutation and aa2 -> aa1 mutation have the same fitness, so that when one is advantageous the opposite mutation is advantageous as well.  this must be highly unrealistic most of the time and tend to mask the signal of positive selection.  
ziheng

Reply all
Reply to author
Forward
0 new messages