Re: Running CODEX2 Somatic WES

Jiang, Yuchao

unread,

Jul 10, 2018, 7:23:15 PM7/10/18

to Wubbenhorst, Bradley, Gene Urrutia, canopy_phylogeny

Hmmm I think this varies from data to data. I personally filter out non-exonic het loci if it is WES, although if the off-target loci have high coverage, I think you can retain them. For large CNA events, I don’t think this will make much a difference.

Yuchao

On Jul 10, 2018, at 2:00 PM, Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu> wrote:

Hey Yuchao,

For FALCON-X WES het input variants is it proper to restrict to the exon targets? Im not sure how tolerant the method is to False Positive input. I could put the cohort through a robust variant filtering or just take the raw variants that pass minimum call quality.

Quality over Quantity? Thanks.

-Brad

From: Jiang, Yuchao <yuc...@email.unc.edu>
Sent: Monday, July 9, 2018 2:35 PM
To: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Cc: Gene Urrutia <gene.u...@gmail.com>
Subject: Re: Running CODEX2 Somatic WES

The reason I combine all chromosomes is because N can only be unbiased estimated on the genome wide scale (especially for cancer with abrupt copy number changes) and its a bit confusing to go back and forth between genome wide and chr specific. I don’t think getting coverage for the entire genome all at once is too slow although I haven’t checked too many WES datasets. Can you let me know what your experiences are? And yes please feel free to fork and play around.

Yuchao

Sent from my iPhone

On Jul 9, 2018, at 2:00 PM, Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu> wrote:

Ah thank you.

What was the reasoning for removing the chr argument? I found value in being able generate coverage values for all chromosomes for all samples in parallel (and writing to file to reload). Big time saver. But if this caused issues, maybe Ill try and fork CODEX2 and play around with other parallel options.

-Brad

From: Jiang, Yuchao <yuc...@email.unc.edu>
Sent: Monday, July 9, 2018 1:47 PM
To: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Cc: Gene Urrutia <gene.u...@gmail.com>
Subject: Re: Running CODEX2 Somatic WES

Can you update from GitHub? The Chr argument is removed in the most recent version

Sent from my iPhone

On Jul 9, 2018, at 1:45 PM, Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu> wrote:

Thanks!

The documentation for getcoverage still shows it requiring a chr argument. Is this truly the case?

-Brad

From: Jiang, Yuchao <yuc...@email.unc.edu>
Sent: Monday, July 9, 2018 1:21 PM
To: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Cc: Gene Urrutia <gene.u...@gmail.com>
Subject: Re: Running CODEX2 Somatic WES

Hi Brad,

I made some updates for CODEX2 last week. As of now, for whole-exome sequencing, you should get coverage, perform qc, calculate N from the entire genome. For normalization, you can do that in a genome-wide fashion on iterate through each chromosome (the latter is faster). For segmentation, it should be ran on each chromosome separately. The most updated demo is here http://htmlpreview.github.io/?https://github.com/yuchaojiang/CODEX2/blob/master/demo/CODEX2.html

Above is for whole-exome sequencing, if you have targeted sequencing, everything is on the genome-wide scale except for segmentation, which is carried out within each gene separately.

Hope this clarifies,

Yuchao

On Jul 9, 2018, at 12:23 PM, Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu> wrote:

Greetings,

I am still making progress. I am back at the CODEX2 portion, working to break up the work as much as possible. At what point must I stop working individual chromosomes? I split up calculating coverage and running the qc, but trying to put together the individual qc results raises some flags, so perhaps Im doing things wrong.

At first I figured I could just concatenate the Y_qc matrices, but its possible that sampleA could be removed for chr2 but not chr1 and I don’t know what to do about that so that so that it runs properly (If I executed qc() on all chr data, what would have happened?).

Can I continue to run the rest of CODEX2 on individual chromosomes? It appears like I could calculate N, normalize, and make calls and segments a chromosome at a time? Or is it the opposite, can I only do the depth calculations and I need to qc and everything after using all chromosome data together? Thank you!

-Brad

From: Gene Urrutia <gene.u...@gmail.com>
Sent: Monday, July 2, 2018 9:14 AM
To: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Cc: Jiang, Yuchao <yuc...@email.unc.edu>
Subject: Re: Running CODEX2 Somatic WES

Hi Brad, Glad you are having success with MARATHON.

Falcon-X needs across sample normalization and the input is allele-specific read counts at germline heterozygous loci, called across all samples. Therefore, the vcf should have the same entries (SNPs) for all samples.

HaplotypeCaller is fine since we are calling germline heterozygous loci to begin with.

Thanks, and let me know if you have any other questions,
Gene

On Thu, Jun 28, 2018 at 1:21 PM, Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu> wrote:

Hey Yuchao,

I got codex2 working and I am moving right into MARATHON (I can send you my pipeline code when I am finished if you would like).

Would you be able to answer some usage questions about the Falcon-x portions? The example data you present is very straight forward, but I am trying to expand them to fit my current data.

Ok so IF I take the snp locations from the paired vcf files, the sites called will not by uniform across all samples when I try an put them together. Now I can make a multisample vcf for all tumors and normals but the proper thing to do in that case would be to assign no calls, ./., to missing positions for each sample. And Im not sure how any of your tools and functions would handle presumably NAs. So then the other option would be to run HaplotypeCaller jointly on all of the tumors and normal together (or some variation where I try to generate a master list of postions/snps and genotype each sample individually). This appeared to be implied in the example. I do not consider HaplotypeCaller to be a very good capturer or somatic mutations, but it would probably do the trick and at least all sites would have a genotype. And perhaps the falcon-x input does not have to be very somatic focused?

Thanks.

-Brad

From: Jiang, Yuchao <yuc...@email.unc.edu>
Sent: Tuesday, May 8, 2018 9:58 AM
To: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Cc: Nathanson MD, Katherine L <knat...@upenn.edu>; gene.urrutia (gene.u...@gmail.com) <gene.u...@gmail.com>
Subject: RE: Running CODEX2 Somatic WES

Hi Brad,

Hope all is well in Philly! For your experimental design, CODEX2 is not an ideal method because you only have 2 normal samples and the highest K will be 2. Also the total number of samples might not be sufficient either (usually >20). We have recently published a pipeline MARATHONhttps://github.com/yuchaojiang/MARATHON/, which tells user when to use which pipeline under different designs. I am cc’ing Gene here who can further assist you if you have questions about MARATHON. For your case, since CODEX2 cannot be applied, you probably can retain copy number signals by comparing tumor to normal after adjusting for sequencing depth.

For CODEX2, I will have a final release soon after the paper is accepted. Currently, for depth of coverage, it is computed chr by chr and for normalization, you can either perform it genome-wide or chr by chr.

Yuchao

From: Wubbenhorst, Bradley <bw...@pennmedicine.upenn.edu>
Sent: Friday, May 4, 2018 10:30 AM
To: Jiang, Yuchao <yuc...@email.unc.edu>
Subject: Running CODEX2 Somatic WES

Hey Yuchao! I hope you and your family have been doing well!

Im getting my hands back into CODEX2 work, trying to shape it into my somatic pipeline. I cant find any of my old Tumor-Normal Codex code so I just have a few questions on the new stuff.

With a Tumor-Normal WES situation, do I process all of the data together or in pairs? In one project I have 16 Tumors, but they are matched to one of two normal ( I have 16 tumor bams and 2 normal bams); so Im not sure if things will work.

And then I run the entire codex pipeline for each chromosome correct? Im just reading through all of the code and comments and see that the coverage needs to be calculated for each chromosome, but then I wanted to double check that Im not supposed to be like building an array of coverage objects or something. Thanks.

-Brad

Wubbenhorst, Bradley

unread,

Jul 19, 2018, 9:08:46 AM7/19/18

to Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

Hey Yuchao,

Could you explain to me a bit more about the usage of CODEX2 with negative control regions?

“empirically identify common CNV regions by a first-pass CODEX run.” Does this mean that I can always determine negative control regions in WES studies? Can I use this for tumor/normal WES too somehow or is this for germline studies?

Your example shows indices, but how can I determine them from my data? Or is this meant to be only for certain study designs?

Jiang, Yuchao

unread,

Jul 19, 2018, 9:58:51 AM7/19/18

to Wubbenhorst, Bradley, Gene Urrutia, canopy_phylogeny

Brad - I have a few deadlines this week. Will get back to you next week.

Wubbenhorst, Bradley

unread,

Jul 20, 2018, 1:37:29 PM7/20/18

to Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

Thanks Yuchao!

To ask one more question too, I just successfully did some codex2 targeted. I followed your examples near exactly, but since they were germline should my segmentation mode still have been ‘fraction’? I left it as such, but I know that normally I would switch to ‘integer’.

Wubbenhorst, Bradley

unread,

Jul 22, 2018, 12:23:36 PM7/22/18

to Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

Sorry, Im starting to ask you frivolous questions. I figured out everything with the fractional and integer. Segmentation is something I should know very well. What I do not yet grasp is how should I interpret the mBIC value in the results. My optimal K was 1 and Ive read that the AIC and BIC values should be informative, but what their values mean is not something I currently understand.

Is high or low mBIC bad?

Thanks. No Rush, I know you have deadlines.

-Brad

Jiang, Yuchao

unread,

Jul 24, 2018, 10:46:27 AM7/24/18

to Wubbenhorst, Bradley, Gene Urrutia, canopy_phylogeny

Hi Brad,

The indices for common CNVs can be obtained from either previous knowledge/database (e.g., BRAF amplification) or from a first-pass of CODEX2 run. The latter depends on a user-defined threshold on the s.d. of the normalized log ratio and can thus vary between studies and datasets.

# Empirically identify common CNV regions from the data.

z.codex2 <- log(Y_qc[chr.index,]/normObj.null$Yhat[[which.max(normObj.null$BIC)]])

plot(1:nrow(z.codex2), apply(z.codex2, 1, sd))

which(apply(z.codex2,1,sd)>=0.25)

cnv_index1 <- which(apply(z.codex2,1,sd)>=0.25)

# User-provided indices for common CNVs.

cnv_index2 <- 1580:1620

all(cnv_index1 == cnv_index2)

In the example above, cnv_index1 is inferred empirically, while cnv_index2 is provided from external info. I would recommend only keeping the consecutive exons in cnv_index1 (i.e., if it contains 23, 24, 25, 45, 57,58, remove 45).

For tumor-normal WES design, you DON’T need this. The normal samples serve as baseline for common CNV estimation.

Hope this clarifies,

Yuchao

Yuchao Jiang

unread,

Jul 24, 2018, 10:48:45 AM7/24/18

to Wubbenhorst, Bradley, Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

mBIC is part of the final output of CODEX/CODEX2, which tells the SEGMENTATION where to stop. You should not cross compare mBIC values across different chromosomes as each chromosome is analyzed separately. mBIC itself is the likelihood ratio minus a penalized term for model complexity.

AIC and BIC are used to determine the optimal number of latent factors during NORMALIZATION. It tells what is the optimal number of latent factors.

--
You received this message because you are subscribed to the Google Groups "canopy_phylogeny" group.
To unsubscribe from this group and stop receiving emails from it, send an email to canopy_phyloge...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/canopy_phylogeny/BN1PR04MB375F6C28872343764E9874A9F570%40BN1PR04MB375.namprd04.prod.outlook.com.
For more options, visit https://groups.google.com/d/optout.

Wubbenhorst, Bradley

unread,

Jul 25, 2018, 8:09:01 AM7/25/18

to Yuchao Jiang, Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

Thank you. What Im still not clear on is under what circumstances would I use AIC over BIC? And for a given segment, is the mBIC value informative about anything? Is a higher value better?

-Brad

Yuchao Jiang

unread,

Jul 25, 2018, 8:46:22 AM7/25/18

to Wubbenhorst, Bradley, Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

AIC and BIC are ONLY used for determining the number of latent factors during NORMALIZATION. Given a segment, you should look at the log likelihood ratio and post filter on this.

Wubbenhorst, Bradley

unread,

Jul 30, 2018, 10:05:45 AM7/30/18

to Yuchao Jiang, Jiang, Yuchao, Gene Urrutia, canopy_phylogeny

Hi Yuchao,

Can you provide me with a sanity check? Attached is a choiceofK plot, this was a tumore_normal study. All of the plots look like this, all optK(’s) were determined to be 1. Im not as familiar looking at plots of BIC, but it does not look correct. Thank you.

tumor_normal_1_ns_choiceofK.pdf

Yuchao Jiang

unread,

Aug 3, 2018, 2:19:46 PM8/3/18

to Wubbenhorst, Bradley, Gene Urrutia, canopy_phylogeny

Hi Brad,

In this case, I am not sure either. Do you have positive or negative controls for you to navigate? For example, you can try different K and see if BRAF amplification is detected with an expected population frequency.