Dear Vince,
Thanks very much in your encouraging feedback! You have a very good question in that how to generate a clean set of input to Canopy is non-trivial. While this is not the focus of our paper, an input with hight false discovery rate will only lead to garbage in garbage out by Canopy. We are currently working on automating the pipeline for both CNAs and SNAs as well as offering guidance to select the informative CNAs. By saying informative, we mean that the SNAs or CNAs show distinct patterns between different samples (from the same patient since we are looking at intratumor heterogeneity). For SNAs, this means that the observed VAFs are different (see Figure 4B in our paper) and in this case a heatmap is a good way for visualization. For CNAs, this means that the WM and Wm are different (see Supplementary Figure S13 in our paper) and IGV is a good tool for visualization.
With this said, you don’t have to have all three types in the mutation input. If you don’t have CNAs, you can feed in Canopy all SNAs. Similarly you can use CNAs only, or a combination of these.
WM and Wm are both for CNAs. The Y matrix specifies whether an SNA lies within a CNA. For SNAs, Canopy doesn’t need a major and minor copy for SNA (which is required by PhyloWGS). For SNAs that are CNA-free, it is possible that the copy number ratio can be different from 1:1 but this might very well be due to false calls by CNA calling software and Canopy doesn’t aim at adjusting for upstream calls.
Hope that this clarifies. I am cc’ing my advisor Nancy here. Nancy, please feel free to add in here.
Cheers,
Yuchao