Hello, my name is Nick Rydzewski and I appreciate all the work you have put into the cBioPortal database. I have been using the data for my own research and came across some datasets that I had questions about. It appears that for most RNAseq data there is an unnormalized dataset and a dataset you fed through the cbioportal normalization workflow. For some of the studies though the data listed as unnormalized appears to have already been normalized. I am wondering if this could be looked into just to confirm that these following datasets haven’t already been normalized based on other samples in the dataset. Thanks!
Both below are listed as data_RNA_Seq_v2_expression_median but the values are actually quite similar to data_RNA_Seq_v2_mRNA_median_all_samples_Zscores:
luad_oncosg_2020
stad_oncosg_2018
These 3 below are all listed as data_mrna_seq_fpkm and all from CPTAC but all 3 appear to have a different format:
brca_cptac_2020
lusc_cptac_2021
gbm_cptac_2021 (this one I think hasn’t been adjusted so question is about the above two)
Same here, all below are listed as data_mrna_seq_rpkm:
mel_tsam_liang_2017 (different samples will have similar values for certain genes, making me think a cross cohort normalization scheme was performed)
luad_cptac_2020 (has negative values)
difg_glass_2019 (this one I don’t think is adjusted for reference)
nepc_wcm_2016 – RNA_seq_expression_median (this one just had negative values and wanted to check if that would be expected)
And final question is about TCGA Pan Can Atlas data, this just may be due to some batch correction effect but I notice that only the studies listed below have an expression value (RNA_Seq_v2_expression_median) below 0 while all others min value is 0:
laml_tcga_pan_can_atlas_2018
coadread_tcga_pan_can_atlas_2018
esca_tcga_pan_can_atlas_2018
ov_tcga_pan_can_atlas_2018
prad_tcga_pan_can_atlas_2018
stad_tcga_pan_can_atlas_2018
ucec_tcga_pan_can_atlas_2018
I understand if these are just the direct files you got from the original studies, but I just wanted to have this looked into in case some were being processed unintentionally even when not listed as normalized. Thanks! I really appreciate all the work!
Best,
Nick Rydzewski
___________________________________
Nicholas Rydzewski, MD, MPH
Radiation Oncology Chief Resident
Department of Human Oncology
University of Wisconsin Hospital and Clinics
--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/SN6PR06MB6413B4609A6BFA3FB2AEEAA7D84D9%40SN6PR06MB6413.namprd06.prod.outlook.com.
Thanks! I appreciate it. Part of the problem I noticed was that examples like luad_oncosg_2020, stad_oncosg_2018, lusc_cptac_2021, and I think mel_tsam_liang_2017 were normalized by gene even under the data_RNA_Seq_v2_expression_median/data_mrna_seq_fpkm/ data_mrna_seq_rpkm headings. If possible I was hoping to access the datasets that weren’t normalized by gene (for example I found the non gene normalized data for the luad_oncosg_2020 and lusc_cptac_2021 through their papers/associated websites). If not possible to get those it would be helpful to have a confirmation on those/other datasets that don’t have the Zscore headings that they have or haven’t been normalized by gene. Thanks!
Nick