Data processing of different omics layers in the TCGA PanCancer COADREAD dataset and different sequencing centers

213 views

Skip to first unread message

Vlachavas, Efstathios-Iason

unread,

Oct 8, 2021, 8:18:55 AM10/8/21

to cbiop...@googlegroups.com

Dear cBioPortal community,

good afternoon and I hope my message finds you well !! Based also on the expansion of a previous post concerning the additional utilization of public multi-omics cancer datasets for translational cancer research (https://groups.google.com/g/cbioportal/c/ajpBL6xaz0Q/m/MQoB1z3xAwAJ) , I would like to ask some specific methodological questions regarding the TCGA PanCancer COAD dataset:

https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_can_atlas_2018

As we would like to harvest additional omics layers present in all patients, specifically gene expression, mutations, CNA and proteomics (RPPA), along with putative additional molecular information, my important questions are the following:

1) Concerning the gene expression data; based on the file named “data_RNA_Seq_v2_expression_median.txt”:

As the relative metadata information file mentions “Batch normalized from Illumina HiSeq_RNASSeqv2”, as also from the following range of the expression values:

range(mm,na.rm=T)

[1] -9.641787e-01 1.003037e+08

Both the negative values, as also the very high values in specific values, denote that these are raw RSEM values that were batch effect corrected? For example for the sequencing platform?

As they seem that do not have any other transformation like log2, as also there are values like 14040.2, etc.?

2) Regarding the RPPA data [data_rppa.txt]:

head(pp)

# A tibble: 6 x 465

Composite.Element.REF `TCGA-A6-2671-01` `TCGA-A6-2684-01` `TCGA-AA-3525-01` `TCGA-AA-3532-01`

1 YWHAE|14-3-3_epsilon -0.892 -0.865 -0.865 -0.990

2 EIF4EBP1|4E-BP1 0.0544 0.243 0.777 0.487

3 EIF4EBP1|4E-BP1_pS65 0.875 0.708 0.0937 0.125

4 EIF4EBP1|4E-BP1_pT37T46 0.314 0.291 0.330 0.539

5 TP53BP1|53BP1 1.46 0.658 1.43 1.19

6 ACACA ACACB|ACC_pS79 -0.442 -0.304 0.276 0.324

range(pp2,na.rm = T)

[1] -3.807334 7.786275

A) Concerning the actual protein names/IDs: from the column Composite.Element.REF, which is the actual name of the protein? That is before the | separator? And the rest is the antibody name?

For example, in the 6^th row there are two names of UniprotID/Gene symbols?

B) For the actual expression values distributed-both from the value range, as also from the attached histogram plot, these can be considered as normalized log2 intensity values? Like as been described here? https://bioinformatics.mdanderson.org/public-software/tcpa/

3) In addition, regarding the also included microbiome signatures with file [data_microbiome.txt]: from the relative exploration of the file, it mentions that these are log RNA Seq CPM values; from a small search also from the following links:

http://cancermicrobiome.ucsd.edu/CancerMicrobiome_DataBrowser/

https://www.nature.com/articles/s41586-020-2095-1

and from relative range of values:

range(mm3,na.rm = T)

[1] -6.065591 25.462914

As also from the attached histogram plot, these values can be directly utilized in downstream correlation and other statistical analyses, correct?

4) Finally, my last but equally important question is related to putative batch effects and relative differences in processing due to different sequencing centers/research groups:

From a small navigation through the specific study, there is information concerning the different sequencing centers (also attached the relative print screen plot):

As it seems there are various sequencing centers, including the majority of the samples in the Indivumed, MSKCC, Greater Poland Cancer Center amongst others; thus, my specific comments are the following:

A) The different sequencing centers do not only refer to gene expression, but also for samples related to all the available omic layers? For example as in the rest of WES, proteomics etc? For proteomics probably most of the data would be created in one center, but for the rest in your opinion, this would impact even not continuous variables like gene expression, but would be even a mild effect in mutations or other types for a merged analysis? In addition, all the included data are always based as usually to the hg19 reference genome, correct?

B) On this premise, if my notion above is correct: in your opinion and expertice, selecting for example patients from the Indivumed sequencing center with all aforementioned available omics layers, resulted in 127 patients; hence, even the smaller sample size for a restricted subgroup, this selected cohort is more homogeneous and could mitigate or minimize the possibility of a batch effect?

Thank you in advance, for your time, patience and consideration on this matter and overall for cBioPortal for providing this infrastructure for aiding translational cancer research J

With Kind Regards,

Efstathios

Efstathios-Iason Vlachavas

Post-doc/Guest Scientist

German Cancer Research Center (DKFZ)

Foundation under Public Law

Im Neuenheimer Feld 280

69120 Heidelberg

Germany

phone: +49 6221 42-5123

fax: +49 6221 42-5109

Efstathios-Ia...@dkfz-heidelberg.de

www.dkfz.de

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich

VAT-ID No.: DE143293537

RPPA.Histogram.COAD.tiff

Microbiome.Histogram.TCGA.COAD.tiff

SequencingCenter.png

ramyama...@gmail.com

unread,

Oct 22, 2021, 1:43:58 PM10/22/21

to cBioPortal for Cancer Genomics Discussion Group

Hi Efstathios,

Thanks for reaching out and apologies for the late response. Please see below.

On Friday, 8 October 2021 at 08:18:55 UTC-4 Vlachavas, Efstathios-Iason wrote:

Dear cBioPortal community,

good afternoon and I hope my message finds you well !! Based also on the expansion of a previous post concerning the additional utilization of public multi-omics cancer datasets for translational cancer research (https://groups.google.com/g/cbioportal/c/ajpBL6xaz0Q/m/MQoB1z3xAwAJ) , I would like to ask some specific methodological questions regarding the TCGA PanCancer COAD dataset:

https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_can_atlas_2018

As we would like to harvest additional omics layers present in all patients, specifically gene expression, mutations, CNA and proteomics (RPPA), along with putative additional molecular information, my important questions are the following:

1) Concerning the gene expression data; based on the file named “data_RNA_Seq_v2_expression_median.txt”:

As the relative metadata information file mentions “Batch normalized from Illumina HiSeq_RNASSeqv2”, as also from the following range of the expression values:

range(mm,na.rm=T)

[1] -9.641787e-01 1.003037e+08

Both the negative values, as also the very high values in specific values, denote that these are raw RSEM values that were batch effect corrected? For example for the sequencing platform?

As they seem that do not have any other transformation like log2, as also there are values like 14040.2, etc.?

Your understanding is right. The expression data was batch-corrected to adjust for platform differences and the data is not log transformed. For more details on the data normalization and batch-correction you can refer to the methods in the pancan paper - https://www.sciencedirect.com/science/article/pii/S0092867418303027

2)      Regarding the RPPA data [data_rppa.txt]:

head(pp)

# A tibble: 6 x 465

Composite.Element.REF   `TCGA-A6-2671-01` `TCGA-A6-2684-01` `TCGA-AA-3525-01` `TCGA-AA-3532-01`

<chr>                                                <dbl>                         <dbl>                       <dbl>                           <dbl>

1 YWHAE|14-3-3_epsilon              -0.892                       -0.865                      -0.865                         -0.990

2 EIF4EBP1|4E-BP1                           0.0544                      0.243                       0.777                          0.487

3 EIF4EBP1|4E-BP1_pS65                 0.875                        0.708                       0.0937                       0.125

4 EIF4EBP1|4E-BP1_pT37T46            0.314                       0.291                      0.330                          0.539

5 TP53BP1|53BP1                                1.46                          0.658                      1.43                           1.19

6 ACACA ACACB|ACC_pS79              -0.442                       -0.304                      0.276                        0.324

range(pp2,na.rm = T)

[1] -3.807334 7.786275

A)      Concerning the actual protein names/IDs: from the column Composite.Element.REF, which is the actual name of the protein? That is before the | separator? And the rest is the antibody name?

That is correct.

For example, in the 6^th row there are two names of UniprotID/Gene symbols?

Yes. The same antibody detects the S79 phosphosite of ACACA and ACACB. Please refer to https://www.cbioportal.org/faq#how-can-i-query-phosphoprotein-levels-in-the-portal on how to query the phophoprotein level data in the portal.

B) For the actual expression values distributed-both from the value range, as also from the attached histogram plot, these can be considered as normalized log2 intensity values? Like as been described here? https://bioinformatics.mdanderson.org/public-software/tcpa/

Yes. We integrated the level 3 data to portal. You can refer to https://groups.google.com/g/cbioportal/c/powxLnyEs8I/m/V8-rbAn13kkJ for details.

3) In addition, regarding the also included microbiome signatures with file [data_microbiome.txt]: from the relative exploration of the file, it mentions that these are log RNA Seq CPM values; from a small search also from the following links:

http://cancermicrobiome.ucsd.edu/CancerMicrobiome_DataBrowser/

https://www.nature.com/articles/s41586-020-2095-1

and from relative range of values:

range(mm3,na.rm = T)

[1] -6.065591 25.462914

As also from the attached histogram plot, these values can be directly utilized in downstream correlation and other statistical analyses, correct?

Correct. The raw data is available here - ftp://ftp.microbio.me/pub/cancer_microbiome_analysis/TCGA but needs a bit of processing to get the required file format.

4)      Finally, my last but equally important question is related to putative batch effects and relative differences in processing due to different sequencing centers/research groups:

From a small navigation through the specific study, there is information concerning the different sequencing centers (also attached the relative print screen plot):

As it seems there are various sequencing centers, including the majority of the samples in the Indivumed, MSKCC, Greater Poland Cancer Center amongst others; thus, my specific comments are the following:

A)      The different sequencing centers do not only refer to gene expression, but also for samples related to all the available omic layers? For example as in the rest of WES, proteomics etc? For proteomics probably most of the data would be created in one center, but for the rest in your opinion, this would impact even not continuous variables like gene expression, but would be even a mild effect in mutations or other types for a merged analysis? In addition, all the included data are always based as usually to the hg19 reference genome, correct?

B)      On this premise, if my notion above is correct: in your opinion and expertice, selecting for example patients from the Indivumed sequencing center with all aforementioned available omics layers, resulted in 127 patients; hence, even the smaller sample size for a restricted subgroup, this selected cohort is more homogeneous and could mitigate or minimize the possibility of a batch effect?

We noticed the Sequencing centers listed here actually are the Tissue Source Sites. The clinical attribute was named incorrectly. Sorry for the inconvenience and thanks for bringing this to our attention. We will correct the issue here - https://github.com/cBioPortal/datahub/issues/1512. Batch effects within a single study should be relatively small. Please refer to the paper - https://www.sciencedirect.com/science/article/pii/S0092867418303027 for more details on the data normalization and batch-correction.

svlac...@eie.gr

unread,

Oct 25, 2021, 9:54:31 AM10/25/21

to cBioPortal for Cancer Genomics Discussion Group

Dear Ramya,

good afternoon and thank you very much for your answer !! Also apologies for my updated answer, but unfortunately still I did not get and notifications that an answer was created; based on your comprehensive answer, just to be on the safe side and confirm some crucial points, my updated comments are the following:

1) Thank you for your confirmation concerning the gene expression data; I will take a detailed look on the methods section of your mentioned paper, and see in detail the processing steps: as these are rsem values and not normalized, probably an approach like normalizeQuantiles from limma or something like vst would sound more helpful, as there are not "classical" raw counts but rather estimated;

2) A) Thank you for your initial confirmation, as I would like to only keep the protein ID, which is essentially in cBioPortal the gene name/symbol, correct? for example, EIF4EBP1|4E-BP1 the EIF4EBP1 is the important part reflecting the tested gene/protein?

In addition, I would like further your opinion: as for my downstream analysis, I would like to perform multi-omics integration, I will need in each omic layer, unique gene symbols or omic identifiers in the rows; thus, for simplicity, in cases such as ACACA ACACB|ACC_pS79, for simplicity I could just keep the first gene symbol? as even might losing some info, it suits my purpose of unique features in the rows? and does also make less complicated things?

4) Thank you very much for highlighting this discrepancy concerning the sequencing centers and Tissue Source Sites; as this is more complicated, if I have understood well, still the Tissue Source Sites, as you referred here https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes and here https://github.com/cBioPortal/datahub/issues/1512 essentially have a similar meaning like the sequencing centers?

For example, from the dataset discussed here, you would suggest instead of sequencing site use the Tissue Source Site, as this include more "homogeneous samples"? and the batch effect is less evident? for example the AA Tissue Source Site instead of Indivumed Sequencing Center? Even if it returns less samples? As our ultimate goal is to merge and integrate the different omic layers for CRC, but aiming in the "most" homogeneous "sub-cohort"/batch that has the more uniformity? I insist here, as besides gene expression, the other omic layers might not present the same degree of batch effect and/or processing heterogeneity, but still it would be more safe to select for example the AA sub-set ? even loosing sample number?

Kind Regards,

Efstathios

Reply all

Reply to author

Forward

0 new messages