Re: [cognoma/cancer-data] Variable documentation for Xena Browser datasets (#14)

Jing Zhu

unread,

Aug 9, 2016, 1:29:55 AM8/9/16

to cognoma/cancer-data, ucsc-cancer-ge...@googlegroups.com, cognoma/cancer-data, Mention

Forward to google group.

On Mon, Aug 8, 2016 at 6:14 PM, Roshan Shetty <notifi...@github.com> wrote:

Please correct me wherever I am wrong as my knowledge of genomics is nill.

Thanks. Let's start with the clinical matrix dataset. Here's what I understand from variables whose names start with GENOMIC_ID_TCGA_PANCAN.. for eg. _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27

They seem to be some sort of flag variable denoting the gene (eg. HumanMethylation27) present in the sample. If the value is not NaN (it looks like it is the patient ID when it isn't), then the gene is present in the sample.

Also, I did not understand what _RFS, _RFS_UNIT & _RFS_IND mean. It seems like _TIME_TO_EVENT means the time it took for the cell to mutate.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Jing Zhu

unread,

Aug 9, 2016, 2:17:02 PM8/9/16

to UCSC Xena and Cancer Genomics Browser, reply+000f404532d14e7e0e23dc422f414a20c125944...@reply.github.com, cance...@noreply.github.com

Identifiers
_EVENT: event in this case it is overall survival event 
_INTEGRATION: id used for integrating data on the xena browser and across cohort
_OS : overall survival time
_OS_IND : overall survival event
_OS_UNIT: overall survival time unit 
_PANCAN_CNA_PANCAN_K8: 2012 pancan paper publication data 
_PANCAN_Cluster_Cluster_PANCAN: 2012 pancan paper publication data
_PANCAN_DNAMethyl_PANCAN: 2012 pancan paper publication data
_PANCAN_RPPA_PANCAN_K8: 2012 pancan paper publication data
_PANCAN_UNC_RNAseq_PANCAN_K16: 2012 pancan paper publication data
_PANCAN_miRNA_PANCAN: 2012 pancan paper publication data
_PANCAN_mutation_PANCAN: 2012 pancan paper publication data
_PATIENT: TCGA patient id
_RFS: recurrent free survival (xena curated, note: i trust the overall survival data much better)
_RFS_IND: recurrece free survival event
_RFS_UNIT: RFS time unit
_TIME_TO_EVENT: time to event (in this case, it is exactly like overall survival event)
_TIME_TO_EVENT_UNIT: time unit 
_cohort: cohort name (also used as cohort id)
_primary_disease: primary_disease
_primary_site: primary organ of origin
age_at_initial_pathologic_diagnosis
gender
sampleID: sample id (same as _INTEGRATION)
sample_type: sample type
sample_type_id

anything start with _GENOMIC_ID holds legacy mapping information of the original uuids from TCGD DCC (which has been replaced with GDC), therefore I don't think any of these mappings is going to useful anymore, at least to vast majority of people.

also, note you can take a look of the dataset detail page at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_clinicalMatrix&host=https://tcga.xenahubs.net

then, click on "all identifiers" link to see all the variables available: https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.PANCAN.sampleMap%2FPANCAN_clinicalMatrix&label=Phenotypes&allIdentifiers=true

Jing

Mary Goldman

unread,

Aug 11, 2016, 11:37:53 AM8/11/16

to Jing Zhu, UCSC Xena and Cancer Genomics Browser, reply+000f404532d14e7e0e23dc422f414a20c125944...@reply.github.com, cance...@noreply.github.com

Hello,

And to follow up, '_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27' is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset: https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

For _RFS, _RFS_UNIT _RFS_IND and _TIME_TO_EVENT, please see this help page: http://xena.ucsc.edu/km-plot-help/. _RFS is 'recurrence free survival'

Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jing Zhu

unread,

Aug 16, 2016, 1:54:46 PM8/16/16

to cognoma/cancer-data, ucsc-cancer-ge...@googlegroups.com, cognoma/cancer-data, Mention

On Tue, Aug 16, 2016 at 8:28 AM, Daniel Himmelstein <notifi...@github.com> wrote:

@jingchunzhu / Mary -- are the "Sample IDs" in Xena Browser:

TCGA Barcodes?

TCGA UUIDs?

Xena-specific identifiers?

Sample IDs in Xena Browser is TCGA Barcode, in particular, at the sample level

TCGA gives different IDs

. The reason you use this is to get the best integration of the various of genomics data types. You go with level below samples, you will have a lot of more entities with missing dimentions like there is mutation data but no expression data. If you go with patient level, then u will have to handle primary tumor, recurrent tumor and mostly normal sample from the same patient, essentially you probably will end out throw out normal sample data.

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

I can't tell if there is question about

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27

?

It now makes sense now why fields like _PANCAN_mutation_PANCAN are encoded as missing / sample_id rather than binary (0 / 1).

Jing

Reply all

Reply to author

Forward