Re: CODEX pipeline fails with Error: subscript contains NAs

Jiang, Yuchao

unread,

Oct 28, 2015, 9:45:10 AM10/28/15

to Jagiella, Nick, CODEX: COpy number Detection by EXome sequencing

Hi Nick,

Thanks for your interest in CODEX. I replied your post on Bioconductor but am copying them here again.

Cheers,

Yuchao

-------------------------

Hi Nick,

Thanks for your interest! This code has been used quite extensively so there shouldn't be any huge systematic bugs. I wonder it's due to some initialization inconsistence in your code. Also have you sorted your bed file?

Can you use save.image(file='CODEX_debug.rda') after the mapp=getmapp() step and send the file to me? I will look into it. My email is yuc...@wharton.upenn.edu. If it is too large, you can share that with me via Dropbox: yj...@cornell.edu.

Cheers,

Yuchao

---------------------------

On Oct 28, 2015, at 4:33 AM, Jagiella, Nick <njag...@definiens.com> wrote:

Dear Yuchao Jiang ,

I would like to use the CODEX R package to do some whole exon sequencing data analysis. The installation and toy data example worked flawlessly. But applying the CODEX pipeline to my own BAM and BED files always ends up with an error and I can’t really figure out how to solve it.

I already submitted a formal request to the Bioconductor support website:

https://support.bioconductor.org/p/73839/

The error seems to occur during the quality control step:

qcObj <- qc(Y, sampname, chr, ref, mapp, gc, cov_thresh = c(20, 4000), length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20, 80))

Excluded NA exons due to extreme coverage.

Excluded 0 exons due to extreme exonic length.

Excluded 0 exons due to extreme mappability.

Excluded 0 exons due to extreme GC content.

After taking union, excluded NA out of 8 exons in QC.

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :

subscript contains NAs

I would be very grateful about any idea, what could have caused the problem. The problem occurs in R-Studio (R version 3.2) as well as R (version 3.3) using either CODEX version 1.2 or 1.3.

Yours sincerely,

Nick Jagiella

<image001.jpg>

Nick Jagiella
Research Scientist
p +49 (89) 231180 0 definiens.com

<image002.png>

Sitz der Gesellschaft/Registered Office: Munich, Germany; Vorstand/Executive Board: Thomas Heydler (Vorsitzender/CEO), Prof. Dr. Gerd Binnig, Christiaan Neeleman, Dr. Markus Rinecker; Vorsitzende des Aufsichtsrats/Chairwoman of the Supervisory Board: Dr. Bahija Jallal; Registergericht/Commercial Register München HRB 133088

Jiang, Yuchao

unread,

Oct 28, 2015, 10:45:02 PM10/28/15

to Jagiella, Nick, codex_...@googlegroups.com

Hi Nick,

CODEX is designed for whole-exome sequencing. You need to process the entire chromosome all at once, unless if yours is targeted sequencing (but your sequencing depth isn’t that high). If it’s targeted sequencing, you can use an adapted version of CODEX:

CODEX for targeted sequencing:

We've adapted CODEX for targeted sequencing. Refer to codes attached (need to source segment_targeted.R for gene based segmentation):

codex_targeted.R

segment_targeted.R

The error you saw is because CODEX filters out samples with < 2000 total reads per chromosome (for whole exome sequencing). And yours are 383 227 and 365 respectively and CODEX retreats these samples and capture failure. That’s why you see the NA and thus the errors.

Also, do you only have 3 samples? CODEX adopts a normalization procedure that estimates the GC content bias, the exon amplification and targeting efficiency, and latent biases and artifacts across all samples. Three samples aren’t enough to estimate these biases. Normally we recommend at least ~20 samples as input for CODEX.

Cheers,

Yuchao

From: Jagiella, Nick [mailto:njag...@definiens.com]
Sent: Wednesday, October 28, 2015 12:58 PM
To: Jiang, Yuchao <yuc...@wharton.upenn.edu>
Subject: RE: CODEX pipeline fails with Error: subscript contains NAs

Hi Yuchao,

Thank you for your fast answer! Attached you can find the CODEX_debug.rda file which I produced following your indications and the BED file just in case it could help somehow.

I didn’t sort the BED file. It is just a subset of the exons found in the following region:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr9%3A5450503-5470567

One difference I recognized between the WES example and mine was, that in the example chromosome was indicated by

chr <- 22

while I needed to use

chr <- “chr9”

to make it run.

If you have any idea where the issue could be, I would be very grateful!

Cheers,

Nick

lala motlhabi

unread,

Jan 23, 2016, 7:49:12 AM1/23/16

to CODEX: COpy number Detection by EXome sequencing, njag...@definiens.com, yuc...@wharton.upenn.edu

Dear Yuchao Jiang,

Kindly, I'm also encountering a similar error as Nick, and I was wondering if you were able to resolve the issue ?

Kindly,

Lala

Jiang, Yuchao

unread,

Jan 24, 2016, 11:14:45 AM1/24/16

to lala motlhabi, CODEX: COpy number Detection by EXome sequencing, njag...@definiens.com

Hi Lala,

Please see my attached reply to Nick below. This should be able to solve your problem. Basically, the sample QC filtered out all of your samples.

Yuchao

************************************************************************

Hi Nick,

CODEX is designed for whole-exome sequencing. You need to process the entire chromosome all at once, unless if yours is targeted sequencing (but your sequencing depth isn’t that high). If it’s targeted sequencing, you can use an adapted version of CODEX:

CODEX for targeted sequencing:

We've adapted CODEX for targeted sequencing. Refer to codes attached (need to source segment_targeted.R for gene based segmentation):

codex_targeted.R

segment_targeted.R

The error you saw is because CODEX filters out samples with < 2000 total reads per chromosome (for whole exome sequencing). And yours are 383 227 and 365 respectively and CODEX retreats these samples and capture failure. That’s why you see the NA and thus the errors.

Also, do you only have 3 samples? CODEX adopts a normalization procedure that estimates the GC content bias, the exon amplification and targeting efficiency, and latent biases and artifacts across all samples. Three samples aren’t enough to estimate these biases. Normally we recommend at least ~20 samples as input for CODEX.