config.summary of Canopy

41 views

Skip to first unread message

Wandinger, Caroline Sophie

unread,

Oct 15, 2019, 12:40:45 PM10/15/19

to canopy_p...@googlegroups.com, Mughal, Sadaf Shabbir, yuc...@email.unc.edu

Dear all,

Thank you for providing the Canopy package work flow to analyze tumor heterogeneity. I am currently working with it and I got a question concerning the config.summary of the

posterior tree evaluation step.

Within the provided code, the following explanations for the config.summary are given:

# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space

I am wondering if these explanations are actually correct. For me it seems like the second column gives the proportion of configuration x among all configurations which are left after applying the post.config.cutoff. By changing this threshold, also the probabilities change accordingly to the number of tree configurations which are considered under the threshold. This would mean that the probability refers to the subtree space of the threshold post.config.cutoff and not to the entire tree space, wouldn't it?

However, the third column gives the tree likelihood which is estimated within the MCMC sampling approach and therefore indeed in the entire tree space. Also the header of the config.summary stating "Mean_post_lik" for the third column confuses me, as the canopy.post function seems to take the (tree with the) maximum likelihood among all trees with the same configuration, but not a mean value of all likelihoods (of those trees with the same configuration).

I also would like to know from a statistical point of view, why the tree with the highest likelihood is chosen, even if this tree has a configuration which is less probable concerning all configurations under the threshold post.config.cutoff.

Many thanks to you in advance!

Kindest regards,

Caroline

Jiang, Yuchao

unread,

Oct 15, 2019, 3:24:33 PM10/15/19

to Wandinger, Caroline Sophie, canopy_p...@googlegroups.com, Mughal, Sadaf Shabbir, Jiang, Yuchao

Hi Caroline,

Thanks for your interests in Canopy and your insightful questions. See below for my response. I hope these are helpful to you.

Good luck!

Yuchao

On Oct 15, 2019, at 12:40 PM, Wandinger, Caroline Sophie <c.wan...@dkfz-heidelberg.de> wrote:

Dear all,

Thank you for providing the Canopy package work flow to analyze tumor heterogeneity. I am currently working with it and I got a question concerning the config.summary of the

posterior tree evaluation step.

Within the provided code, the following explanations for the config.summary are given:

# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space

I am wondering if these explanations are actually correct. For me it seems like the second column gives the proportion of configuration x among all configurations which are left after applying the post.config.cutoff. By changing this threshold, also the probabilities change accordingly to the number of tree configurations which are considered under the threshold. This would mean that the probability refers to the subtree space of the threshold post.config.cutoff and not to the entire tree space, wouldn't it?

Yes you are absolutely correct. We included the threshold, psot.config.cutoff, specifically for situations where one would have many mutations as input. In such cases, a simple change of one mutation would result in a new configuration (here the configuration includes not only tree topology but also the mutational profiles along the tree branches). If you want the second column to reflect the probability over the entire sampled space, you can set the threshold to be 0.

However, the third column gives the tree likelihood which is estimated within the MCMC sampling approach and therefore indeed in the entire tree space.

Also the header of the config.summary stating "Mean_post_lik" for the third column confuses me, as the canopy.post function seems to take the (tree with the) maximum likelihood among all trees with the same configuration, but not a mean value of all likelihoods (of those trees with the same configuration).

For each iteration in the MCMC, we calculate the likelihood. For trees with the same configurations, we calculate the mean of their likelihood and output Mean_post_lik. Their configurations are the same but their VAFs etc can be different resulting in slightly different likelihood. Here you can also pick the tree with the highest likelihood, instead of the mean. Regarding the latter part you mentioned, which I am not sure I fully understand, I don’t think my script would contain a bug but this has also been a while.

I also would like to know from a statistical point of view, why the tree with the highest likelihood is chosen, even if this tree has a configuration which is less probable concerning all configurations under the threshold post.config.cutoff.

Yes, the reason we chose the tree with the highest posterior likelihood is mostly for application purposes. That is, when you present the data to oncologists/clinicians, it’s hard to present them with a bag of trees but rather a tree that is the most probable.

Reply all

Reply to author

Forward

0 new messages