GDC TCGA BRCA

136 views
Skip to first unread message

Richard

unread,
Jan 13, 2023, 12:31:48 PM1/13/23
to UCSC Xena and Cancer Genomics Browser
I have some questions about GDC TCGA BRCA data vs TCGA BRCA:
- In TCGA BRCA data (Legacy data),  dataset: gene expression RNAseq - IlluminaHiSeq from https://tcga.xenahubs.net have  20,531 identifiers corresponding to about 20000 genes. However,  in GDC TCGA BRCA data ( Harmonized Data),  dataset: gene expression RNAseq - HTSeq - Counts from hub: https://gdc.xenahubs.net, there are  60,489 identifiers. What is the difference between them? Why are there 60,489 identifiers?
- In TCGA BRCA data (Legacy data), I can get  MC3 gene-level non-silent mutation  (somatic mutation (SNP and INDEL)). However, there is no gene-level non-silent mutation data in GDC TCGA BRCA. How can I get this type of data from GDC TCGA BRCA?
- Legacy data vs  Harmonized Data. Which dataset should I use for analysis?
"The "legacy" gene expression data refers to the original processed data (the gene expression analysis methods, genome reference and gene models used may differ between cancer types/projects). The harmonized data was produced by the GDC by reprocessing the data using a single analysis pipeline." Is that true??

Thanks.

Mary Goldman

unread,
Jan 19, 2023, 2:14:49 PM1/19/23
to Richard, UCSC Xena and Cancer Genomics Browser
Hi Richard,

Apologies for the delay in my reply! Please see inline below for my answers. If you have any further questions, please email us at genome...@soe.ucsc.edu

Best,
Mary
-----
Mary Goldman (she/her), Design and Outreach Engineer 

A button with "Hear my name" text for name playback in email signature



---------- Forwarded message ---------
From: Richard <ooc...@gmail.com>
Date: Fri, Jan 13, 2023 at 9:31 AM
Subject: [ucsc-cancer-genomics-browser] GDC TCGA BRCA
To: UCSC Xena and Cancer Genomics Browser <ucsc-cancer-ge...@googlegroups.com>


I have some questions about GDC TCGA BRCA data vs TCGA BRCA:
- In TCGA BRCA data (Legacy data),  dataset: gene expression RNAseq - IlluminaHiSeq from https://tcga.xenahubs.net have  20,531 identifiers corresponding to about 20000 genes. However,  in GDC TCGA BRCA data ( Harmonized Data),  dataset: gene expression RNAseq - HTSeq - Counts from hub: https://gdc.xenahubs.net, there are  60,489 identifiers. What is the difference between them? Why are there 60,489 identifiers?

This is because the GDC mapped to a different set of genes (one with 60,489 genes/transcripts) than the legacy TCGA data. The legacy TCGA data mapped to a set of 20,531 genes.

- In TCGA BRCA data (Legacy data), I can get  MC3 gene-level non-silent mutation  (somatic mutation (SNP and INDEL)). However, there is no gene-level non-silent mutation data in GDC TCGA BRCA. How can I get this type of data from GDC TCGA BRCA?

Unfortunately the GDC does not provide gene-level non-silent mutation data. You can contact the GDC with any questions or comments you might have here: https://gdc.cancer.gov/support.

- Legacy data vs  Harmonized Data. Which dataset should I use for analysis?
"The "legacy" gene expression data refers to the original processed data (the gene expression analysis methods, genome reference and gene models used may differ between cancer types/projects). The harmonized data was produced by the GDC by reprocessing the data using a single analysis pipeline." Is that true??

Yes. Again, you can contact the GDC with any questions or comments you might have here: https://gdc.cancer.gov/support.

Thanks.

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ucsc-cancer-genomics-browser/b6217677-00e6-4efc-bed3-4f37afb7fdfdn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages