Dear cBioPortal community,
good afternoon and I hope my message finds you well !! Based also on the expansion of a previous post concerning the additional utilization of public multi-omics cancer datasets for translational cancer research (https://groups.google.com/g/cbioportal/c/ajpBL6xaz0Q/m/MQoB1z3xAwAJ) , I would like to ask some specific methodological questions regarding the TCGA PanCancer COAD dataset:
https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_can_atlas_2018
As we would like to harvest additional omics layers present in all patients, specifically gene expression, mutations, CNA and proteomics (RPPA), along with putative additional molecular information, my important questions are the following:
1) Concerning the gene expression data; based on the file named “data_RNA_Seq_v2_expression_median.txt”:
As the relative metadata information file mentions “Batch normalized from Illumina HiSeq_RNASSeqv2”, as also from the following range of the expression values:
range(mm,na.rm=T)
[1] -9.641787e-01 1.003037e+08
Both the negative values, as also the very high values in specific values, denote that these are raw RSEM values that were batch effect corrected? For example for the sequencing platform?
As they seem that do not have any other transformation like log2, as also there are values like 14040.2, etc.?
2) Regarding the RPPA data [data_rppa.txt]:
head(pp)
# A tibble: 6 x 465
Composite.Element.REF `TCGA-A6-2671-01` `TCGA-A6-2684-01` `TCGA-AA-3525-01` `TCGA-AA-3532-01`
<chr> <dbl> <dbl> <dbl> <dbl>
1 YWHAE|14-3-3_epsilon -0.892 -0.865 -0.865 -0.990
2 EIF4EBP1|4E-BP1 0.0544 0.243 0.777 0.487
3 EIF4EBP1|4E-BP1_pS65 0.875 0.708 0.0937 0.125
4 EIF4EBP1|4E-BP1_pT37T46 0.314 0.291 0.330 0.539
5 TP53BP1|53BP1 1.46 0.658 1.43 1.19
6 ACACA ACACB|ACC_pS79 -0.442 -0.304 0.276 0.324
range(pp2,na.rm = T)
[1] -3.807334 7.786275
A) Concerning the actual protein names/IDs: from the column Composite.Element.REF, which is the actual name of the protein? That is before the | separator? And the rest is the antibody name?
For example, in the 6th row there are two names of UniprotID/Gene symbols?
B) For the actual expression values distributed-both from the value range, as also from the attached histogram plot, these can be considered as normalized log2 intensity values? Like as been described here? https://bioinformatics.mdanderson.org/public-software/tcpa/
3) In addition, regarding the also included microbiome signatures with file [data_microbiome.txt]: from the relative exploration of the file, it mentions that these are log RNA Seq CPM values; from a small search also from the following links:
http://cancermicrobiome.ucsd.edu/CancerMicrobiome_DataBrowser/
https://www.nature.com/articles/s41586-020-2095-1
and from relative range of values:
range(mm3,na.rm = T)
[1] -6.065591 25.462914
As also from the attached histogram plot, these values can be directly utilized in downstream correlation and other statistical analyses, correct?
4) Finally, my last but equally important question is related to putative batch effects and relative differences in processing due to different sequencing centers/research groups:
From a small navigation through the specific study, there is information concerning the different sequencing centers (also attached the relative print screen plot):
As it seems there are various sequencing centers, including the majority of the samples in the Indivumed, MSKCC, Greater Poland Cancer Center amongst others; thus, my specific comments are the following:
A) The different sequencing centers do not only refer to gene expression, but also for samples related to all the available omic layers? For example as in the rest of WES, proteomics etc? For proteomics probably most of the data would be created in one center, but for the rest in your opinion, this would impact even not continuous variables like gene expression, but would be even a mild effect in mutations or other types for a merged analysis? In addition, all the included data are always based as usually to the hg19 reference genome, correct?
B) On this premise, if my notion above is correct: in your opinion and expertice, selecting for example patients from the Indivumed sequencing center with all aforementioned available omics layers, resulted in 127 patients; hence, even the smaller sample size for a restricted subgroup, this selected cohort is more homogeneous and could mitigate or minimize the possibility of a batch effect?
Thank you in advance, for your time, patience and consideration on this matter and overall for cBioPortal for providing this infrastructure for aiding translational cancer research J
With Kind Regards,
Efstathios
Efstathios-Iason Vlachavas
Post-doc/Guest Scientist
German Cancer Research Center (DKFZ)
Foundation under Public Law
Im Neuenheimer Feld 280
69120 Heidelberg
Germany
phone: +49 6221 42-5123
fax: +49 6221 42-5109
Efstathios-Ia...@dkfz-heidelberg.de

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich
VAT-ID No.: DE143293537
Dear cBioPortal community,
good afternoon and I hope my message finds you well !! Based also on the expansion of a previous post concerning the additional utilization of public multi-omics cancer datasets for translational cancer research (https://groups.google.com/g/cbioportal/c/ajpBL6xaz0Q/m/MQoB1z3xAwAJ) , I would like to ask some specific methodological questions regarding the TCGA PanCancer COAD dataset:
https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_can_atlas_2018
As we would like to harvest additional omics layers present in all patients, specifically gene expression, mutations, CNA and proteomics (RPPA), along with putative additional molecular information, my important questions are the following:
1) Concerning the gene expression data; based on the file named “data_RNA_Seq_v2_expression_median.txt”:
As the relative metadata information file mentions “Batch normalized from Illumina HiSeq_RNASSeqv2”, as also from the following range of the expression values:
range(mm,na.rm=T)
[1] -9.641787e-01 1.003037e+08
Both the negative values, as also the very high values in specific values, denote that these are raw RSEM values that were batch effect corrected? For example for the sequencing platform?
As they seem that do not have any other transformation like log2, as also there are values like 14040.2, etc.?
2) Regarding the RPPA data [data_rppa.txt]:
head(pp)
# A tibble: 6 x 465
Composite.Element.REF `TCGA-A6-2671-01` `TCGA-A6-2684-01` `TCGA-AA-3525-01` `TCGA-AA-3532-01`
<chr> <dbl> <dbl> <dbl> <dbl>
1 YWHAE|14-3-3_epsilon -0.892 -0.865 -0.865 -0.990
2 EIF4EBP1|4E-BP1 0.0544 0.243 0.777 0.487
3 EIF4EBP1|4E-BP1_pS65 0.875 0.708 0.0937 0.125
4 EIF4EBP1|4E-BP1_pT37T46 0.314 0.291 0.330 0.539
5 TP53BP1|53BP1 1.46 0.658 1.43 1.19
6 ACACA ACACB|ACC_pS79 -0.442 -0.304 0.276 0.324
range(pp2,na.rm = T)
[1] -3.807334 7.786275
A) Concerning the actual protein names/IDs: from the column Composite.Element.REF, which is the actual name of the protein? That is before the | separator? And the rest is the antibody name?
For example, in the 6th row there are two names of UniprotID/Gene symbols?
B) For the actual expression values distributed-both from the value range, as also from the attached histogram plot, these can be considered as normalized log2 intensity values? Like as been described here? https://bioinformatics.mdanderson.org/public-software/tcpa/
3) In addition, regarding the also included microbiome signatures with file [data_microbiome.txt]: from the relative exploration of the file, it mentions that these are log RNA Seq CPM values; from a small search also from the following links:
http://cancermicrobiome.ucsd.edu/CancerMicrobiome_DataBrowser/
https://www.nature.com/articles/s41586-020-2095-1
and from relative range of values:
range(mm3,na.rm = T)
[1] -6.065591 25.462914
As also from the attached histogram plot, these values can be directly utilized in downstream correlation and other statistical analyses, correct?
4) Finally, my last but equally important question is related to putative batch effects and relative differences in processing due to different sequencing centers/research groups:
From a small navigation through the specific study, there is information concerning the different sequencing centers (also attached the relative print screen plot):
As it seems there are various sequencing centers, including the majority of the samples in the Indivumed, MSKCC, Greater Poland Cancer Center amongst others; thus, my specific comments are the following:
A) The different sequencing centers do not only refer to gene expression, but also for samples related to all the available omic layers? For example as in the rest of WES, proteomics etc? For proteomics probably most of the data would be created in one center, but for the rest in your opinion, this would impact even not continuous variables like gene expression, but would be even a mild effect in mutations or other types for a merged analysis? In addition, all the included data are always based as usually to the hg19 reference genome, correct?
B) On this premise, if my notion above is correct: in your opinion and expertice, selecting for example patients from the Indivumed sequencing center with all aforementioned available omics layers, resulted in 127 patients; hence, even the smaller sample size for a restricted subgroup, this selected cohort is more homogeneous and could mitigate or minimize the possibility of a batch effect?