Re: running MARATHON and Canopy, starting from VCF and BAM files

73 views

Skip to first unread message

Jiang, Yuchao

unread,

Jan 4, 2018, 9:38:07 PM1/4/18

to Tödling, Jörn, canopy_phylogeny, Urrutia, Eugene, Jiang, Yuchao

Hi Joern,

Thanks for your interest in MARATHON and Canopy. In short to your answer, here you can find how to generate input for Canopy https://github.com/yuchaojiang/Canopy/blob/master/instruction/SNA_CNA_input.md.

For point mutations, you can generate that from the somatic vcf file. Vcf is the output from most callers. If you don’t know how to extract the R and X matrix from here, you need to understand and learn the vcf format. For allele-specific copy number, you will need to call germline heterozygous loci. This is again going to be in vcf format — your germline VCF will do if they also have the tumor entries. Then read in the vcf file and follow FALCON or FALCON-X’s pipeline to generate copy number input. The R notebook below have instructions for FALCON and FALCON-X. For both, stringent QC and selection are needed after the above input generation.

We’ve recently updated MARATHON significantly and generated an R notebook. Refer to here https://rawgit.com/yuchaojiang/MARATHON/master/notebook/MARATHON.html

Hope this is helpful.

Yuchao

On Jan 4, 2018, at 10:32 AM, Tödling, Jörn <joern.t...@charite.de> wrote:

Dear Yuchao Jiang,

I have come across your MARATHON pipeline and think that this will be a very useful tool for studying tumor evolution in our setting. We have exome sequencing data from several sections of the same tumors and want to check them for similar subclones. The Canopy tool looks very promising and we would like to use it for our analysis.

However, I am struggling how to get all the numbers hat I need for running Canopy out of my files.

I have followed the MARATHON documentation as far as I could. However, I am still missing some steps.

If I have the BAM files, the somatic SNV calls in VCF format and the germline VCF format. I also have separate copy-number variation calls in BED format if they are of any use.

I am sure that packages like VariantAnnotation and other Bioconductor packages can be used to extract the numbers that I need for Canopy, but this seems a bit tricky and seems not to be explained in the MARATHON documentation.

It is clear that MARATHON still is in development, but I think it would be very useful to if the documentation contained complete description or even script how to obtain the required numbers for Canopy if just these files are present.

So any hints or suggestions on how to obtain the required numbers from the VCF files would be very useful.

Thank you very much in advance,

Best regards,
Joern Toedling

--
Joern Toedling, PhD

Charité - Universitätsmedizin Berlin
CC17, AG Schulte
CVK, Forum 4, Raum 0.0207
Augustenburger Platz 1, D-13353 Berlin
Tel: +49 30 450 616198

Reply all

Reply to author

Forward

0 new messages