samples with mutations which is not the same as the number of samples in the tarball as downloaded here:
Looking at the individual files in the tarball, there are 469 unique participant barcodes in the two files:
# Working on: /tmp/RtmpGiA946/138c2e876ed6_lusc_tcga_pan_can_atlas_2018/lusc_tcga_pan_can_atlas_2018/data_mutations_extended.txt
> length(unique(TCGAutils::TCGAbarcode(dat$Tumor_Sample_Barcode)))
# [1] 469
# Working on: /tmp/RtmpGiA946/138c2e876ed6_lusc_tcga_pan_can_atlas_2018/lusc_tcga_pan_can_atlas_2018/data_mutations_mskcc.txt
> length(unique(TCGAutils::TCGAbarcode(dat$Tumor_Sample_Barcode)))
# [1] 469
The phenotype data does have 487 patientId entries:
> lusc <- cBioDataPack(cancer_study_id = "lusc_tcga_pan_can_atlas_2018", ask=FALSE)
> dim(colData(lusc))
# [1] 487 95
The 484 number might be coming from the API. When I use the network monitor on Chrome and select the mutations button on the website,
I do see the same number of sample barcodes when I do `samplesInSampleLists(cbio, "lusc_tcga_pan_can_atlas_2018_sequenced")[[1]]`
in R which hits the "getSampleListUsingGet" endpoint
which gives you the 'Hugo_Symbol'.
You can do `mcols(acc[['mutations_extended']])$Variant_Classification` to see all the types per gene or use `assay`:
> mutex <- acc[['mutations_extended']]
> head(assay(mutex, "Variant_Classification"))[, 1:3]
TCGA-18-3406-01 TCGA-18-3407-01 TCGA-18-3408-01
ENSG00000107862 "Silent" NA NA
ENSG00000108018 "Silent" NA NA
ENSG00000151532 "3'UTR" NA NA
ENSG00000197893 "Missense_Mutation" NA NA
ENSG00000119965 "Missense_Mutation" NA NA
ENSG00000148773 "Missense_Mutation" NA NA