Issues with replicating DFS curves generated in cBioportal using raw data from TCGA

126 views
Skip to first unread message

sooky...@gmail.com

unread,
Mar 5, 2018, 9:47:17 AM3/5/18
to cBioPortal for Cancer Genomics Discussion Group
Hello,

I was using cBioportal to plot rough disease free survival curves for my genes of interest in the TCGA-PRAD (provisional) study. Since it looked like there might be an effect, I then went to TCGA and downloaded the raw data to re-analyze based on other parameters

However, I realize that the TCGA raw files for PRAD did not have any data related to disease free survival. The closest number I could find to this was in "Days to last follow up" in the tables I downloaded from TCGA, which I understand to be very different from DFS.... Under the column for "days_to_recurrence", everything is left as "NA"

Exactly what numbers are cBioportal plotting when I look at disease free survival curves in the TCGA-PRAD dataset?

Thank you!!!
Jess

Kelsey Zhu

unread,
Mar 6, 2018, 11:09:50 AM3/6/18
to cBioPortal for Cancer Genomics Discussion Group
Hi Jessica,

you can find out those numbers used by the Disease free survival plot in the TCGA-PRAD dataset online.

1. enter "TCGA-PRAD" in the search box on the query page 
2. select "Genomic Hallmarks of Prostate Adenocarcinoma" study and click on the most right "view study summary" icon
3. on the study summary page click on the "Clinical Data" tab. Disease Free (Months) column will show you the numbers used by the disease free survival plot. (see attached screen shot)

Best,
kelsey
Screen Shot 2018-03-06 at 11.02.40 AM.png

JJ Gao

unread,
Mar 7, 2018, 10:07:16 AM3/7/18
to sooky...@gmail.com, cBioPortal for Cancer Genomics Discussion Group
Hi Jess,

Where did you get the TCGA raw files. If you download the study file
(there is download button on the study page), the last two columns in
the file data_bcr_clinical_data_patient.txt are DFS_STATUS and
DFS_MONTHS, which contain the data we used for DFS survival analysis
on the portal.

Thanks,
-JJ
> --
> You received this message because you are subscribed to the Google Groups
> "cBioPortal for Cancer Genomics Discussion Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cbioportal+...@googlegroups.com.
> To post to this group, send email to cbiop...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cbioportal.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cbioportal/ee3b41cd-5c44-40b9-8981-21b4ec53736b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sook Yuin Jessica Ho

unread,
Mar 7, 2018, 11:22:06 PM3/7/18
to JJ Gao, cBioPortal for Cancer Genomics Discussion Group
Dear JJ,

Thank you for your reply!

I downloaded the data I used from the NCI GDC Data Portal. I thought that the GDC dataset would be the original source from which cBioportal charts are derived from. However, in the GDC data set, the column for days to recurrence is empty. Instead, the column with the closest number to DFS_MONTHS in the cbioportal table is "Days to last follow up". I was thus wondering why there was a discrepancy between the two tables, even when they were supposedly derived from the same dataset. Is cBioportal drawing the DFS information from an alternate source?

I attach a screen shot of the table I downloaded from GDC.


GDC table

Thank you once again!

Cheers
Jess


JJ Gao

unread,
Mar 8, 2018, 8:34:06 AM3/8/18
to Sook Yuin Jessica Ho, cBioPortal for Cancer Genomics Discussion Group
Hi Jess,

Our data is still based on pre-GDC TCGA dataset we downloaded from Broad Firehose and TCGA DCC. That may partly explain the discrepancy. And you might also want to look at the follow-up clinical data in GDC for recurrence/progression data.

Best,
-JJ


Sook Yuin Jessica Ho

unread,
Mar 14, 2018, 10:07:49 AM3/14/18
to cBioPortal for Cancer Genomics Discussion Group
Dear JJ,

Apologies for missing your reply last week.

Thank you for your clarifications. I have looked at both the legacy and current GDC tables, but still do not find any data for recurrence and progression.

Is it possible to find out the header name of the columns cBioportal has been using in the pre-GDC TCGA datasets to extract disease free survival data? I fear I might be looking at the wrong columns for my analysis.

Once again, thank you for your help!!

Cheers,
Jess

JJ Gao

unread,
Mar 14, 2018, 5:49:15 PM3/14/18
to Sook Yuin Jessica Ho, cBioPortal for Cancer Genomics Discussion Group
Hi Jess,

It seems that GDC don't have the follow up data in the text output. You can find the data in the xml files. E.g. https://portal.gdc.cancer.gov/files/e0d99d11-f683-4159-8328-bb7cbdf457bd

-JJ 

To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+unsubscribe@googlegroups.com.

To post to this group, send email to cbiop...@googlegroups.com.
Visit this group at https://groups.google.com/group/cbioportal.

Sook Yuin Jessica Ho

unread,
Mar 15, 2018, 10:06:27 AM3/15/18
to JJ Gao, cBioPortal for Cancer Genomics Discussion Group
Thank you JJ! This is very helpful! 
Reply all
Reply to author
Forward
0 new messages