Missing mutations in lusc_tcga_pan_can_atlas_2018 after R download?

129 views
Skip to first unread message

Sam Danziger

unread,
Mar 5, 2021, 6:28:36 PM3/5/21
to cBioPortal for Cancer Genomics Discussion Group

cBioPortal Folks,

I think that some of the Mutations data is missing lusc_tcga_pan_can_atlas_2018 when downloaded using the [R] cBioPortalData package.

The web interface (https://www.cbioportal.org/study/summary?id=lusc_tcga_pan_can_atlas_2018)  lists 484 Mutations in the 'Genomic Profile Sample Counts' box.

However, if I download lusc_tcga_pan_can_atlas_2018 with  cBioPortalData, I see only 469 columns in the Mutations elements of the lusc_tcga_pan_can_atlas_2018 MultiAssayExperiment

[5] mutations_extended: RaggedExperiment with 199105 rows and 469 columns 

[6] mutations_mskcc: RaggedExperiment with 199105 rows and 469 columns

Furthermore, these RaggedExperiment objects seem to have malformed data.
cbio <- cBioPortalData::cBioPortal()
acc <- try(cBioDataPack(cancer_study_id = "lusc_tcga_pan_can_atlas_2018", ask=FALSE))

mutations <- assay(acc[['mutations_extended']])
mutations[1:4,1:4]
                TCGA-18-3406-01 TCGA-18-3407-01 TCGA-18-3408-01 TCGA-18-3410-01
ENSG00000107862 "GBF1"          NA              NA              NA 
ENSG00000108018 "SORCS1"        NA              NA              NA 
ENSG00000151532 "VTI1A"         NA              NA              NA 
ENSG00000197893 "NRAP"          NA              NA              NA             

Can you please advise?

Thank you,

-Sam


mram...@gmail.com

unread,
Mar 9, 2021, 10:02:43 PM3/9/21
to cBioPortal for Cancer Genomics Discussion Group
Hi Sam, 

The data is represented as features by samples (tumor sample barcodes). I think the 484 tally refers to the total number of profiled
samples with mutations which is not the same as the number of samples in the tarball as downloaded here: 
https://cbioportal-datahub.s3.amazonaws.com/lusc_tcga_pan_can_atlas_2018.tar.gz

Looking at the individual files in the tarball, there are 469 unique participant barcodes in the two files: 

# Working on: /tmp/RtmpGiA946/138c2e876ed6_lusc_tcga_pan_can_atlas_2018/lusc_tcga_pan_can_atlas_2018/data_mutations_extended.txt
> length(unique(TCGAutils::TCGAbarcode(dat$Tumor_Sample_Barcode)))
# [1] 469
# Working on: /tmp/RtmpGiA946/138c2e876ed6_lusc_tcga_pan_can_atlas_2018/lusc_tcga_pan_can_atlas_2018/data_mutations_mskcc.txt
> length(unique(TCGAutils::TCGAbarcode(dat$Tumor_Sample_Barcode)))
# [1] 469

The phenotype data does have 487 patientId entries: 
> lusc <- cBioDataPack(cancer_study_id = "lusc_tcga_pan_can_atlas_2018", ask=FALSE)
> dim(colData(lusc))
# [1] 487  95

The 484 number might be coming from the API. When I use the network monitor on Chrome and select the mutations button on the website,
it hits an endpoint that goes to https://www.cbioportal.org/api/mutated-genes/fetch. I am not sure how this endpoint works
but I am not getting the same number of samples when I use the `cBioPortalData` function to obtain mutation data.

I do see the same number of sample barcodes when I do `samplesInSampleLists(cbio, "lusc_tcga_pan_can_atlas_2018_sequenced")[[1]]` 
in R which hits the "getSampleListUsingGet" endpoint
(https://www.cbioportal.org/api/swagger-ui.html#/Sample%20Lists/getSampleListUsingGET)
with "lusc_tcga_pan_can_atlas_2018_sequenced" input. I'd have to look into it further to make sure that these sampleIds are being used
in my mutation data queries to the API. 

To work with RaggedExperiment objects, I'd refer you to our vignette for more information
http://bioconductor.org/packages/devel/bioc/vignettes/RaggedExperiment/inst/doc/RaggedExperiment.html
The `mutation[1:4, 1:4]` code is the first column in the metadata i.e., `mcols(acc[['mutations_extended']])` as a `sparseAssay` 
which gives you the 'Hugo_Symbol'.
You can do `mcols(acc[['mutations_extended']])$Variant_Classification`  to see all the types per gene or use `assay`:

> mutex <- acc[['mutations_extended']]
> head(assay(mutex, "Variant_Classification"))[, 1:3]
                                 TCGA-18-3406-01     TCGA-18-3407-01 TCGA-18-3408-01
ENSG00000107862 "Silent"                       NA                         NA             
ENSG00000108018 "Silent"                       NA                         NA             
ENSG00000151532 "3'UTR"                      NA                          NA             
ENSG00000197893 "Missense_Mutation" NA                          NA             
ENSG00000119965 "Missense_Mutation" NA                          NA             
ENSG00000148773 "Missense_Mutation" NA                          NA             


Best,
Marcel

Sam Danziger

unread,
Mar 10, 2021, 5:55:53 PM3/10/21
to cBioPortal for Cancer Genomics Discussion Group
Marcel,

Thank you for your attention to this problem and the the guidance with respect to how to correctly interact with the RaggedExperiment.  

I attempted to work around this problem by downloading the data directly from the PanCancerAtlas (https://gdc.cancer.gov/about-data/publications/pancanatlas).  If I download the .maf file (mc3.v0.2.8.PUBLIC.maf.gz) and take the subset of those samples that are listed as belonging to 'lusc_tcga_pan_can_atlas_2018', then I see that there are 469 unique patients.  Perhaps the difference between 484 and 469 results from multiple samples for the same patient?  Is there any easy way to test that?

Thank you again,
-Sam
Reply all
Reply to author
Forward
0 new messages