Licensing of the TCGA Pan-Cancer data

157 views
Skip to first unread message

daniel.hi...@gmail.com

unread,
Jul 13, 2016, 2:28:21 PM7/13/16
to UCSC Xena and Cancer Genomics Browser
Greetings,

I'm part of a collaborative open science project called Project Cognoma (https://github.com/cognoma/cognoma). We are building a website to help biologists do machine learning on TCGA data. Our current plan is to use the TCGA Pan-Cancer data from Xena, which is available at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN).

I didn't see a license specified for this data. Due to past experience (https://doi.org/bfmk), I try to clarify all data licensing issues before proceeding with data. We would like to reproduce and publish derivatives of the data under an open license, preferably CC0. Is TCGA data on Xena free of copyright because it was produced by the United States Government? Or is the data subject to copyright and available under a specific license? Or is the data publicly available without a specified license, which defaults to all rights reserved?

Best,
Daniel

Mary Goldman

unread,
Jul 13, 2016, 3:15:09 PM7/13/16
to daniel.hi...@gmail.com, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

That looks like a great project! Looking forward seeing what comes out of it.

There is no license for our data, only our software. We ask that you cite our website (http://xena.ucsc.edu) in whatever websites, software or publications come from your work. You will have to abide by the TCGA publishing guidelines, as applicable: http://cancergenome.nih.gov/publications/publicationguidelines.

Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/



--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Himmelstein

unread,
Jul 15, 2016, 12:16:54 PM7/15/16
to Mary Goldman, UCSC Xena and Cancer Genomics Browser
Thanks Mary for helping with our questions.

From a legal perspective, not having a license means all copyright protections apply. This would prevent us from using Xena data unless we explicitly received permission. Would you or your intellectual property department be able to give us permission to use the Xena TCGA data under an open license?

Let us know if part of the issue is a lack of explicit upstream licensing. For such a major resource as the TCGA, I think it's important to have a copyright-complaint path to open reuse.

Best,
Daniel

Mary Goldman

unread,
Jul 15, 2016, 2:33:21 PM7/15/16
to Daniel Himmelstein, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

I'm not aware of any copyright on the TCGA data, but I would recommend talking to them to be sure: https://tcga-data.nci.nih.gov/tcga/tcgaContact.jsp. Please let us know what they say.

Jing Zhu

unread,
Jul 15, 2016, 5:28:38 PM7/15/16
to daniel.hi...@gmail.com, UCSC Xena and Cancer Genomics Browser
Dear Daniel,

I see this page https://figshare.com/articles/TCGA_Pan-Cancer_expression_and_mutation_data_for_Project_Cognoma/3487685 regarding this email thread.  I would like to point it out that we periodically update our datasets, this is in particular is true for TCGA data. This means that the pan-cancer data set you retrieved will be out of date in the future, when we make our TCGA data updates.  We have meta data in .json files attached to each of our dataset for download, in the metadata .json file, we have version number attached to each of files.  It is important to include that information. For gene expression and mutation datasets, each have different data files, it is also important to include information on which exact data file you are incorporating into your system. Perhaps incorporate the .json file is the minimum thing to do besides incorporating the data. 

Could you please also add our browser web url to your post, the url is http://xena.ucsc.edu

Jing

UCSC Xena Team

Daniel Himmelstein

unread,
Jul 15, 2016, 8:05:11 PM7/15/16
to Jing Zhu, UCSC Xena and Cancer Genomics Browser
Thanks Jing for the input.

I now retain the JSON metadata files in our repository. I updated the figshare to reference the versioning and Xena URL.

When figuring out the download link for the JSON files, I realized we were using an outdated link to retrieve Pan-Cancer datasets. See this commit for more information. Was happy to see that the unstandardized mutation effects resolved with the new URL. An unrelated issue still persists, which is duplicate rows in PANCAN_mutation (see cell 5 in this notebook).

Feel free to get involved on our GitHub, if you'd like to join the effort!

Best,
Daniel

Daniel Himmelstein

unread,
Jul 18, 2016, 6:46:36 PM7/18/16
to Jing Zhu, UCSC Xena and Cancer Genomics Browser
Greetings Jing & Xena Team,

I head back from the TCGA. The Open Access Data Tier of the TCGA is in the public domain. In other words, no restrictions are placed on reuse. You're free to license derivatives however you would like.

With this in mind, would it be possible to openly license to the Xena Browser TCGA data?

Best,
Daniel

Mary Goldman

unread,
Jul 22, 2016, 2:48:24 PM7/22/16
to Daniel Himmelstein, Jing Zhu, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

The person who is in charge or our licenses is out this week but will be back on Monday. I'm guessing this is no problem and am looking at this license: http://opendatacommons.org/licenses/by/summary/. I will make sure this is on his desk on Monday morning for his review.


Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/


---------- Forwarded message ----------
From: Daniel Himmelstein <daniel.hi...@gmail.com>
Date: Mon, Jul 18, 2016 at 3:23 PM
Subject: Re: [ucsc-cancer-genomics-browser] Licensing of the TCGA Pan-Cancer data

Jing Zhu

unread,
Jul 22, 2016, 3:17:07 PM7/22/16
to Daniel Himmelstein, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

regarding the duplicated rows, they are due to that there are multiple sequencing results for these samples, from the  multiple TCGA aliquots map to a single sample. We left the duplications in the file intentionally, if you see the same mutation is called on two rows, it is essentially a biological replicate.  And you should also see occasionally in these duplications, majority of them has, say for example, two exact same rows for sample A, but sometime there is only one row of mutation B for sample A, which means that mutation B is not detected in the biological replicate.   

Jing 
Reply all
Reply to author
Forward
0 new messages