Struggling to get all the TCGA metadata: seeing differences between GDC and Firebrowse

Leonardo Collado Torres

unread,

Nov 10, 2016, 4:16:21 PM11/10/16

to gd...@broadinstitute.org, Abhi Nellore, Jeff Leek, Ben Langmead

Hi,

For a project we are trying to get all the metadata available for 11,285 RNA-seq samples from the TCGA project. Getting this metadata has been harder than we anticipated.

Using the GDC api we can download a json file that has 11,285 rows. However a collaborator pointed to us that some information is missing from this file that is available in Firebrowse (recurrence information on pancreas). Why is that, we don't know.

Now, if we use FirebrowseR https://github.com/mariodeng/FirebrowseR or even look at the raw data at http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/THCA/20160128/ there are some patient barcodes that are missing from Firebrowse that we did get via the GDC api. Manually searching for these barcodes on the GDC website shows that 38 of these 42 barcodes have missing clinical data. For example https://gdc-portal.nci.nih.gov/cases/7d8e6965-51ea-4864-968e-5cecfb7da48a. For the 4 that do have it, their data can be retrieved using the TCGAbiolinks Bioconductor package which queries GDC behind the scenes. One example of those 4 is https://gdc-portal.nci.nih.gov/cases/76c4d298-6f17-4f11-a706-cd844a0e0914. I haven't been able to find information on Firebrowse as to why these 42 barcodes are missing. Note that they are not listed at http://gdac.broadinstitute.org/runs/stddata__latest/samples_report/Redactions.html. With FirebrowseR we do get 11,185 unique file uuid's (almost the 11,285 from GDC), but cannot match those to the GDC data.

* What else would you use to try to match the GDC and FirebrowseR data? Note that with FirebrowseR we get different tables of information for each TCGA project. For example, bcr_followup_uuid is missing in one of the projects.

* How did Firebrowse get the recurrence information that is missing in GDC? (and maybe other variables) See the biochemical_recurrence field on pancreas (PRAD) for example.

* Is it possible to get a flat metadata table via Firebrowse? I could try to do it from the output I get from FirebrowseR but maybe you have it somewhere.

You can find our code and some log files at https://gist.github.com/lcolladotor/18c60a7e42b290cabd555aeda3e19ea7.

Thanks,
Leonardo

Leonardo Collado Torres, Ph. D., Data Scientist
Lieber Institute for Brain Development
Clinical Sciences Division
855 N Wolfe St, Suite 300
Baltimore, MD 21205
Website: http://lcolladotor.github.io/about.html
Blog: http://lcolladotor.github.io/

Leonardo Collado Torres

unread,

Nov 10, 2016, 6:31:08 PM11/10/16

to gd...@broadinstitute.org, Abhi Nellore, Jeff Leek, Ben Langmead

Hi,

I also posted this question about using TCGAbiolinks to try to get this data. https://support.bioconductor.org/p/89315/

Best,

Leonardo

fellg...@gmail.com

unread,

Nov 10, 2016, 6:31:08 PM11/10/16

to Gdac-users, gd...@broadinstitute.org, anel...@gmail.com, jtl...@gmail.com, lan...@cs.jhu.edu, leo.c...@libd.org

Sorry, I meant prostate instead of pancreas!

Michael Noble

unread,

Nov 10, 2016, 6:36:12 PM11/10/16

to fellg...@gmail.com, Gdac-users, anel...@gmail.com, jtl...@gmail.com, lan...@cs.jhu.edu, leo.c...@libd.org

No problem, Leonardo, but thanks for clarifying. We are currently in a crunch period preparing for a release next week and may not be able to look at this until then (or at least for a few more business days). I mention this because the way questions such as you've posed usually play out is through a fair bit of forensics; it seems you probably appreciate that fact by now, but I wanted to acknowledge your note while buying us a little more breathing time to actually poke around in the data to answer. What is your timeline?

Thank You,

Michael S. Noble

Associate Director for Data Science

Cancer Genome Computational Analysis

Broad Institute of MIT and Harvard

Manager, Genome Data Analysis Center

The Cancer Genome Atlas

Leonardo Collado Torres

unread,

Nov 18, 2016, 3:25:58 PM11/18/16

to gd...@broadinstitute.org, Gdac-users

Hi Michael,

No problem and yes, I do appreciate that these things are not easy to figure out. In any case, I prefer getting it right.

We now have metadata extracted from the GDC XML files (see the TCGAbiolinks link), GDC api and CGC (Seven Bridges). So we are in no hurry.