Question about the C Matrix

43 views

Skip to first unread message

David Pais

unread,

May 13, 2021, 2:47:49 PM5/13/21

to canopy_phylogeny

Hello Yuchao Jiang,

I hope you are alright and safe,

I am using Canopy in my master's dissertation. Thank you for an amazing tool.

However, I'm not 100% sure of my interpretation of the C Matrix. If you could help me with the best and most explicative answer possible, I would appreciate it a lot.

Going back to a question published on 16/11/2016 here on the group, titled "multiple overlap regions in canopy" (if you need to revise it, it is a small question):

Like in the picture above, I started by defining my data's CNA regions as in the second (or more completely) version. However, when I tried to run canopy, it is not possible to run with more than one 1 in a column. Why is that? Because, theoretically the more complete version is also correct, right?

This is not my main question though (it is related to it), more of a curiosity I personally have.

But my question is, following the simple version, focused on the overlapped regions (where we will have only one 1 per column, per CNA event):

How would we define the CNA regions in the following example, which is part of your Supplementary Material (page 9)?

Could it be this way: (also in the attachment if better visualization needed)

So, here we have a CNA region that includes these 3 CNA events (where one is nested in other), that results from the union of this segment where these CNAs overlap (mainly, CNA E1 and E2).

Then, to calculate the copy number, we would make a weigthed mean of the copy number based on the % that each CNA event contributes to the CNA region, applying this idea for each sample. Looking at sample 1, for instance:

WM (particularly the cell which row is this CNA region and column is the sample 1) major copy value of this CNA region in sample 1 would be like 40 % of CNA E1 major copy number, plus 20% of the first region of CNA E2, then plus 20% of CNA E1, and finally, 20% of the other region of CNA E2, divided by the number of CNAs (or copy number values used).

In case this is not how you do it, it would not make sense to me to use here the intersections as you used on the question I referred to above, with chr1 and CNA1 - CNA4.

In the Supplementary Material example, using that intersection idea, would be like (also in the attachment for better visualization):

But this does not make really so much sense to me as the first option. Mainly, because all the SNV (or SNAs) that are located in the zones in red would have a 1 on the column of "non-cna_region", which does not make sense because the SNV is being affected by a CNA, but that information is lost since I am only focusing on the overlapped regions and discarding the other parts of it, as you did with the question of the first image.

So my question is, clearly, how can you calculate the C matrix in this example (of the supplementary material, page 9). Is it according to my first option, the second option, or another different way?

A final approach I could use is, of course, put C = NULL. However, since I have overlapped CNAs and even quite some nested CNAs in my data, I think it would be really unfortunate and even poor to discard these events and not to use all the features that Canopy provides.

I am eager to receive your feedback on this matter,

Thank you so much for your attention and time dispensed reading this message. I also apologize for the message's extensive text.

Kind regards,

David Pais

Pic2.JPG

Pic3.JPG

Reply all

Reply to author

Forward

0 new messages