TCGA CNV file input

姚英豪

unread,

Jun 30, 2018, 4:25:47 AM6/30/18

to ParseCNV

Dear Joseph Glessner:

Hello,

I downloaded Masked Copy Number Segment files from TCGA website and trying to use ParseCNV for downstream analysis.

However, I had no idea about how to convert these files to PennCNV format. TCGA provided segment mean values for each CNV, I want to ask what is the relationship between segment mean value and CNState value?

Moreover, TCGA did not give SNP information for a given CNV, which seems necessary for PennCNV format.

So, could you please give me some suggests or scripts to convert the TCGA CNV files to PennCNV format?

Thank you very much !

Kind regards.

Yinghao

Joe Glessner

unread,

Jul 2, 2018, 11:10:05 AM7/2/18

to 姚英豪, ParseCNV

Hi Yinghao,

Assuming your TCGA CNV files look like this:

Sample Chromosome Start End Num_Probes Segment_Mean

TCGA-HC-8260-11A-01D-2259-01 1 61735 8176626 4058 -0.0157

TCGA-HC-8260-11A-01D-2259-01 1 8182588 8189285 9 -0.9492

TCGA-HC-8260-11A-01D-2259-01 1 8191792 17030245 4730 0.0015

TCGA-HC-8260-11A-01D-2259-01 1 17035220 17114712 33 -0.307

TCGA-HC-8260-11A-01D-2259-01 1 17177045 17260618 58 0.3641

awk '{ORS="";print "chr"$2":"$3"-"$4" numsnp="$5" length="$4-$2" ";if($6<-.02){print "state2,cn=1"}else if($6>.02){print "state5,cn=3"}else{print "state3,cn=2"}print " "$1" startsnp="$2"_"$3" endsnp="$2"_"$4" conf=NA\n"}' TCGA.txt | sed '1d' > TCGA.rawcnv

You could plot the histogram distribution of Segment_Mean to see if a value <-.02 has high density to define "state1,cn=0" or a value >.02 to define "state6,cn=4" if you are interested in differentiating those states.

Then you can just provide a blank text file as the .map input to ParseCNV and the .map will be automatically generated based on the observed Start and End positions.

Regards,

Joseph

--
You received this message because you are subscribed to the Google Groups "ParseCNV" group.
To unsubscribe from this group and stop receiving emails from it, send an email to parsecnv+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

姚英豪

unread,

Jul 3, 2018, 10:14:50 AM7/3/18

to ParseCNV

Hi Joseph,

Thank you for your reply.

According to the ParseCNV documents, CNState values: state1,cn=0 state2,cn=1 state5,cn=3 state6,cn=4 referred to as homozygous deletion, hemizygous deletion, hemizygous duplication, and homozygous duplication respectively. Your awk command only have state2,cn=1, state3,cn=2 and state5,cn=3, what about others? In addition, The segment mean values in the TCGA files equal to log2(copy-number/2), thus, state2,cn=1 (hemizygous deletion) should corresponding to -1, not -0.02.

So, please tell if I have some misunderstadings or give more detailed explanations.

Thank you very much!

regards

Yinghao
在 2018年7月2日星期一 UTC+8下午11:10:05，Joe Glessner写道：

To unsubscribe from this group and stop receiving emails from it, send an email to parsecnv+u...@googlegroups.com.

Joe Glessner

unread,

Jul 3, 2018, 4:06:20 PM7/3/18

to 姚英豪, ParseCNV

Hi Yinghao,

There is a distribution of Segment_Mean values if you plot them out in a histogram.

Here I plot All TCGA with Project=LIHC.

Sorry, it should be -0.2 and 0.2 not -.02 and .02

state1,cn=0 and state6,cn=4 do not have evident cut-offs based on peaks since they are relatively rare.

Not Zoomed In:

Zoomed In:

Regards,

Joseph Glessner

To unsubscribe from this group and stop receiving emails from it, send an email to parsecnv+unsubscribe@googlegroups.com.

姚英豪

unread,

Jul 3, 2018, 10:33:47 PM7/3/18

to ParseCNV

Hi Joseph,

Sorry for bothering again.

From the distribution of Segment_Mean values, there are two peaks corresponding to value -0.5 and 0.5, I still can not figure out why the cut-off values be set to -0.2 and 0.2.

I have TCGA Project=OV data, and plot them too.

not zoomed in:

Zoomed in:

So, What cut-offs should I use and why?

Thanks again,

Regards,

Yinghao

在 2018年7月4日星期三 UTC+8上午4:06:20，Joe Glessner写道：

Joe Glessner

unread,

Jul 4, 2018, 7:23:09 PM7/4/18

to 姚英豪, ParseCNV

Hi Yinghao,

This is a Gaussian mixture model problem.

In other words, multiple normal curves are overlapping each other and we need to try to separate them.

Although the peaks are around -0.5 and 0.5, that is just the mean of the distribution or value of highest density, not a proper threshold point.

My curve fitting analysis shows the curves between normal CN=2 and deletion CN=1 to intersect at -0.2.

My curve fitting analysis shows the curves between normal CN=2 and deletion CN=3 to intersect at 0.2.

Authors of this paper appear to have arrived at a similar conclusion:

http://mcr.aacrjournals.org/content/12/4/485.long

"To extract a set of high confidence CNVs, we used a threshold of 0.2 in segment mean value for amplifications and −0.2 for deletions. We derived these thresholds by examining the distribution of segment mean values from tumor and normal samples."

Regards,

Joseph Glessner

To unsubscribe from this group and stop receiving emails from it, send an email to parsecnv+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward