Mutation Data Versioning

57 views
Skip to first unread message

grego...@gmail.com

unread,
Aug 10, 2016, 4:57:55 PM8/10/16
to UCSC Xena and Cancer Genomics Browser
Hello,

I have noticed that different versions of PanCancer mutation data have been deposited in the browser here: https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_mutation&host=https://tcga.xenahubs.net

The current version is not a complete representation of previous versions. It lacks several columns and several samples that were present in a previous version the last time I accessed the data on 12 June 2015.

Is there a place where data is versioned besides in the JSON metadata (see discussion here: https://groups.google.com/forum/?fromgroups#!msg/ucsc-cancer-genomics-browser/eg6nJOFSefw/ugDDKLIOAwAJ;context-place=forum/ucsc-cancer-genomics-browser)? If not, would you be able to update the mutation data to the more complete format?

Thanks,
Greg

Jing Zhu

unread,
Aug 10, 2016, 5:32:44 PM8/10/16
to UCSC Xena and Cancer Genomics Browser
Hi Greg,

On Wednesday, August 10, 2016 at 1:57:55 PM UTC-7, Gregory Way wrote:
Hello,

I have noticed that different versions of PanCancer mutation data have been deposited in the browser here: https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_mutation&host=https://tcga.xenahubs.net   
 
The current version is not a complete representation of previous versions. It lacks several columns and several samples that were present in a previous version the last time I accessed the data on 12 June 2015. 
 
Yes. These are not from the same release.  However I am surprised by "lacking several columns", could you say how the columns are different? 

Sample number change is not surprising because we periodically update TCGA data. This particular dataset is compiled by the Xena team at UCSC, and in almost all cases, TCGA has multiple version of mutation calls from several sequencing and analysis groups, broad, WashU, BCM, and UCSC, plus there are curated and automated calls, plus there are different sequencing platforms. So we made our internal decision on which dataset to include, and the exact selection has been changed over time, not drastically, but there are changes. The change will effect sample numbers. 
Starting 2016, we store our release data on AWS S3, which means that all versions of data starting 2016 will be on S3. We plan to do so in the future as long as there is resource to sustain it.  .json files are part of the data releases, which will stores the version information.   Our previous data releases are not on S3.  Do you need the previous version that you retrieved in June 12?  We can send to you directly. 
 
If not, would you be able to update the mutation data to the more complete format?

Jing
 
 
Thanks,
Greg

Gregory Way

unread,
Aug 22, 2016, 12:19:46 PM8/22/16
to UCSC Xena and Cancer Genomics Browser
Hi Jing,

Thank you for your prompt response! The columns in the current data version are far fewer than the 12 June 2015 version.

It is clear that columns were merged because several in the older version are redundant. However, there are also several useful columns (e.g. HGVSc, Entrez_Gene_Id, PolyPhen, etc.) that are omitted.

These are the columns in the current version on the browser:

#sample
chr
start
end
reference
alt
gene
efffect
DNA_VAF
RNA_VAF
Amino_Acid_Change

These are the columns in the 12 June 2015 version:

Hugo_Symbol
Entrez_Gene_Id
Center
NCBI_Build
Chromosome
Start_Position
End_Position
Strand
Variant_Classification
Variant_Type
Reference_Allele
Tumor_Seq_Allele1
Tumor_Seq_Allele2
dbSNP_RS
dbSNP_Val_Status
Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode
Match_Norm_Seq_Allele1
Match_Norm_Seq_Allele2
Tumor_Validation_Allele1
Tumor_Validation_Allele2
Match_Norm_Validation_Allele1
Match_Norm_Validation_Allele2
Verification_Status
Validation_Status
Mutation_Status
Sequencing_Phase
Sequence_Source
Validation_Method
Score
BAM_File
Sequencer
Tumor_Sample_UUID
Matched_Norm_Sample_UUID
HGVSc
HGVSp
HGVSp_Short
Transcript_ID
Exon_Number
t_depth
t_ref_count
t_alt_count
n_depth
n_ref_count
n_alt_count
all_effects
Allele
Gene
Feature
Feature_type
Consequence
cDNA_position
CDS_position
Protein_position
Amino_acids
Codons
Existing_variation
ALLELE_NUM
DISTANCE
STRAND
SYMBOL
SYMBOL_SOURCE
HGNC_ID
BIOTYPE
CANONICALCCDS
ENSP
SWISSPROT
TREMBL
UNIPARC RefSeq
SIFT
PolyPhen
EXON
INTRON
DOMAINS
GMAF
AFR_MAF
AMR_MAF
ASN_MAF
EAS_MAF
EUR_MAF
SAS_MAF
AA_MAF
EA_MAF
CLIN_SIG
SOMATIC
PUBMED
MOTIF_NAME
MOTIF_POS
HIGH_INF_POS
MOTIF_SCORE_CHANGE
IMPACT
PICK
VARIANT_CLASS
TSL
HGVS_OFFSET
PHENO
tumor_type

Thanks!
Greg

Jing Zhu

unread,
Aug 26, 2016, 2:56:29 PM8/26/16
to Gregory Way, UCSC Xena and Cancer Genomics Browser
Hi Greg,

These columns are from TCGA .maf files for sure. 

We download and process these maf files to derive the xena data file (the ones with far small number of columns) that we provide for downloads.  

I don't think we provided .maf file for download, at least not intentionally.   We do provide a raw data link in our web site dataset detailed page to point to where we get the raw data, under "raw data" line.  However, for TCGA data, the url should only points to a TCGA DCC directory, and you will need to further click through their web site to get to the .maf files.    

Not sure what happened. If you are interested in these type of .maf files. You will need to go to GDC (https://gdc-portal.nci.nih.gov/search/s)  to download from there directly. GDC has replaced DCC this summer. Our future TCGA data release will also base on GDC data.  Also, just a note on reference genome, GDC data is mapped to hg38, not hg19. 

Jing

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages