Important questions concerning the robust utilization of a colon cancer cBioPortal dataset for multi-omics integration in translational cancer research

Vlachavas, Efstathios-Iason

unread,

Sep 17, 2021, 6:32:25 AM9/17/21

to cbiop...@googlegroups.com

Dear cBioPortal Team,

good afternoon and I hope my message finds you safe and healthy !! My name is Efstathios-Iason Vlachavas, and I’m a post-doc scientist at the German Cancer Research Center (DKFZ) in the department of Molecular Genome Analysis under Prof. Dr. Stefan Wiemann.

Briefly, based on an ongoing international collaborative project between DKFZ and other Greek research institutions regarding personalized medicine in cancer (http://www.accc.gr/aboutACCC_info.html), we have acquired around 34 samples with DNA and RNA which were sequenced and pre-processed in the DKFZ bioinformatics facility. Overall, my major goal in my current post-doc is to investigate, if there any specific mutational patterns in any of the 3 separated groups of patients, that might interrelate the presence of specific mutational patterns (KRASonly, BRAFonly and wild type), as the ultimate goal is to investigate the molecular landscape of these 3 defined groups-

on this premise, we were trying to navigate through the cBioPortal web recourse portal not only to expand our sample size with independent cohort studies, as also to perform multi-omics integration in order to correlate the pre-defined mutational status (KRAS and BRAF mutations) with different omics layers and pathways, and we found the Colon Cancer CPTAC-2 study:

https://www.cbioportal.org/study/summary?id=coad_cptac_2019

My main methodological questions are the following:

1) Concerning the actual study-as we would like to focus on the common patients/samples that have been profiled in all the following selected omics layers, such as copy number alterations, mutations, gene expression and proteomics, the most direct way would be to download directly the clinical data from the selected cases through here : https://www.cbioportal.org/study/summary?id=coad_cptac_2019 ? And then subset each omic layer with the selected tumor sample barcodes?

2) In addition, concerning specifically the proteomic layers, such as Protein expression levels (mass spectrometry by CPTAC) and protein level z-scores:

A) Are there any information about the pre-processing of the protein expression data matrix? For example based on the relative data_protein_quantification.txt file:

Composite.Element.REF 01CO005 01CO006 01CO008 01CO013 01CO014 01CO015 01CO019 01CO022

A1BG | A1BG -1.1 -1.12 -1.2 -1.89 -0.523 -1.62 -0.311 0.906

A1CF | A1CF 0.318 -0.441 0.16 0.112 -0.248 0.263 -0.27 -0.778

A2M | A2M -0.487 -0.347 -1.85 -0.329 -0.638 -0.976 -0.921 1.45

AAAS | AAAS 0.0995 -0.0029 0.119 0.67 0.289 0.522 0.226 0.359

AACS | AACS 0.155 0.0957 -0.0924 0.116 0.378 -0.273 0.0528 0.219

AAGAB | AAGAB 0.169 0.396 0.0187 0.313 0.822 0.504 0.0428 0.771

These are normalized protein intensities? Or any other data transformation and/or preprocessing has been applied?

B) Is there a rationale why the above rownames have duplicated gene symbols? Could we just for simplicity and gene symbol concordance across all assays, to keep just one unique row name?

C) Moreover, the relative protein expression mass spectrometry z-scores are computed similarly as in the rnaseq data? Based on the rest of the tumor samples?

3) In parallel, concerning the mutational data-based on the available files to download, is there any specific differences between the maf files “data_mutations_extended” and “data_mutations_mskcc” ? From a small inspection through the R package maftools, that the mskcc file contains 14914 genes, whereas the extended 14881; but I could not notice any other major differences in the column names or mutational attributes; Thus, for my purpose of separating the patients based on KRAS/BRAF/WT mutations, could I use either of them? Or there are more important differences regarding each file?

4) Furthermore, in the above link, the study mentions 110 patients/samples, whereas the highest number of patients profiled is 106; Is there a rationale for this small discrepancy? As far as I have checked, no patients have duplicated samples? https://www.cbioportal.org/study/summary?id=coad_cptac_2019

5) Finally, concerning the actual clinical data-using either data_clinical_sample or data_clinical_patient would not have any impact, correct?

Thank you in advance for your time and consideration on this matter !!

With Kind Regards,

Efstathios-Iason Vlachavas

Efstathios-Iason Vlachavas

Post-doc/Guest Scientist

German Cancer Research Center (DKFZ)

Foundation under Public Law

Im Neuenheimer Feld 280

69120 Heidelberg

Germany

phone: +49 6221 42-5123

fax: +49 6221 42-5109

Efstathios-Ia...@dkfz-heidelberg.de

www.dkfz.de

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich

VAT-ID No.: DE143293537

Vlachavas, Efstathios-Iason

unread,

Sep 21, 2021, 3:26:33 AM9/21/21

to cbiop...@googlegroups.com

JJ Gao

unread,

Sep 21, 2021, 6:18:27 PM9/21/21

to Vlachavas, Efstathios-Iason, cbiop...@googlegroups.com

Dear Efstathios-Iason,

Apologies for the late reply. Please see my comments below.

On Fri, Sep 17, 2021 at 6:32 AM Vlachavas, Efstathios-Iason <Efstathios-Ia...@dkfz-heidelberg.de> wrote:

Dear cBioPortal Team,

good afternoon and I hope my message finds you safe and healthy !! My name is Efstathios-Iason Vlachavas, and I’m a post-doc scientist at the German Cancer Research Center (DKFZ) in the department of Molecular Genome Analysis under Prof. Dr. Stefan Wiemann.

Briefly, based on an ongoing international collaborative project between DKFZ and other Greek research institutions regarding personalized medicine in cancer (http://www.accc.gr/aboutACCC_info.html), we have acquired around 34 samples with DNA and RNA which were sequenced and pre-processed in the DKFZ bioinformatics facility. Overall, my major goal in my current post-doc is to investigate, if there any specific mutational patterns in any of the 3 separated groups of patients, that might interrelate the presence of specific mutational patterns (KRASonly, BRAFonly and wild type), as the ultimate goal is to investigate the molecular landscape of these 3 defined groups-

on this premise, we were trying to navigate through the cBioPortal web recourse portal not only to expand our sample size with independent cohort studies, as also to perform multi-omics integration in order to correlate the pre-defined mutational status (KRAS and BRAF mutations) with different omics layers and pathways, and we found the Colon Cancer CPTAC-2 study:

https://www.cbioportal.org/study/summary?id=coad_cptac_2019

My main methodological questions are the following:

1) Concerning the actual study-as we would like to focus on the common patients/samples that have been profiled in all the following selected omics layers, such as copy number alterations, mutations, gene expression and proteomics, the most direct way would be to download directly the clinical data from the selected cases through here : https://www.cbioportal.org/study/summary?id=coad_cptac_2019 ? And then subset each omic layer with the selected tumor sample barcodes?

That sounds like a good plan.

2)      In addition, concerning specifically the proteomic layers, such as Protein expression levels (mass spectrometry by CPTAC) and protein level z-scores:

A)      Are there any information about the pre-processing of the protein expression data matrix? For example based on the relative data_protein_quantification.txt file:

Composite.Element.REF             01CO005            01CO006              01CO008            01CO013            01CO014              01CO015            01CO019            01CO022

A1BG | A1BG                                       -1.1                     -1.12                      -1.2                 -1.89                -0.523                 -1.62                        -0.311             0.906

A1CF | A1CF                                      0.318                    -0.441                    0.16                   0.112                -0.248                 0.263                     -0.27              -0.778

A2M | A2M                                        -0.487                    -0.347                    -1.85                -0.329              -0.638                   -0.976                      -0.921             1.45

AAAS | AAAS                                      0.0995                 -0.0029                    0.119                   0.67             0.289                      0.522                       0.226             0.359

AACS | AACS                                        0.155                   0.0957                   -0.0924               0.116             0.378                    -0.273         0.0528           0.219

AAGAB | AAGAB                                0.169                    0.396                    0.0187               0.313             0.822                       0.504        0.0428            0.771

These are normalized protein intensities? Or any other data transformation and/or preprocessing has been applied?

The processing steps were documented in the paper: https://pubmed.ncbi.nlm.nih.gov/31031003/. Please reach out to the investigators if it's not clear.

B) Is there a rationale why the above rownames have duplicated gene symbols? Could we just for simplicity and gene symbol concordance across all assays, to keep just one unique row name?

Please ignore the duplicated gene names. I was related to some special code we had but no longer applied to the CPTAC data. We will fix the file (https://github.com/cBioPortal/datahub/issues/1487).

C) Moreover, the relative protein expression mass spectrometry z-scores are computed similarly as in the rnaseq data? Based on the rest of the tumor samples?

Yes.

3) In parallel, concerning the mutational data-based on the available files to download, is there any specific differences between the maf files “data_mutations_extended” and “data_mutations_mskcc” ? From a small inspection through the R package maftools, that the mskcc file contains 14914 genes, whereas the extended 14881; but I could not notice any other major differences in the column names or mutational attributes; Thus, for my purpose of separating the patients based on KRAS/BRAF/WT mutations, could I use either of them? Or there are more important differences regarding each file?

The default transcripts were different for the two MAFs. "data_mutations_extended" uses uniprot canonical isoforms while "data_mutations_mskcc" uses msk-impact-defined isoforms. We recommend using "data_mutations_extended" since that's the one we use for the public portal by default.

4) Furthermore, in the above link, the study mentions 110 patients/samples, whereas the highest number of patients profiled is 106; Is there a rationale for this small discrepancy? As far as I have checked, no patients have duplicated samples? https://www.cbioportal.org/study/summary?id=coad_cptac_2019

We will look into this: https://github.com/cBioPortal/datahub/issues/1488

5) Finally, concerning the actual clinical data-using either data_clinical_sample or data_clinical_patient would not have any impact, correct?

They are different - one for patient level data (e.g. sex) and the other sample level (e.g. Primary Site), so you might want to choose or use both based on your analysis.

Thank you in advance for your time and consideration on this matter !!

With Kind Regards,

Efstathios-Iason Vlachavas

Efstathios-Iason Vlachavas

Post-doc/Guest Scientist

German Cancer Research Center (DKFZ)

Foundation under Public Law

Im Neuenheimer Feld 280

69120 Heidelberg

Germany

phone: +49 6221 42-5123

fax: +49 6221 42-5109

Efstathios-Ia...@dkfz-heidelberg.de

www.dkfz.de

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich

VAT-ID No.: DE143293537

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/8ed2a941775b4384abafc4d5828a32a6%40dkfzex02n1.ad.dkfz-heidelberg.de.

Vlachavas, Efstathios-Iason

unread,

Sep 22, 2021, 6:27:07 AM9/22/21

to JJ Gao, cbiop...@googlegroups.com

Dear Dr. Gao,

Thank you very much for your valuable comments and suggestions; very briefly, two important clarification points before processing the data:

1) Initially concerning my second question regarding the processing pipeline of Protein expression levels (mass spectrometry by CPTAC): as I tried to contact the authors, but they mentioned just another repository without any further explanation or clue concerning cBioPortal;

In addition, searching through the materials and methods from the original publication:

https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930292-2 (page 22):

Label-free Proteomics Data Analysis

“Spectral count data were filtered by removing proteins with zero counts in all samples and quantile-normalized using the R package preprocessCore (version 1.42.0, https://github.com/bmbolstad/preprocessCore). We further filtered low abundant proteins with

average raw count < 1.4 as we did previously (Zhang et al., 2014). The normalized and filtered counts were then log2 transformed for

downstream analysis.”

I suppose also from your relative publication: https://www.mcponline.org/article/S1535-9476(20)31786-2/fulltext#supplementaryMaterial

The protein expression levels hosted in cBioPortal, refer to these above? And the same pipeline was used for processing? Apologies for insisting on this, it is just that from the original paper there are also different protein assays and pipelines, and I would just need to be certain for future documentation to document any processing steps in prior;

Overall, from a small inspection of the values and relative distributions, the data look normalized and Gaussian-ish, and probably can utilized directly;

2) For the clinical txt files named data_clinical_sample or data_clinical_patient: as you mention they contain different level of information; however, as through navigation from the portal, each patient contains each sample, then the tumor sample barcodes could be used safely to extract different information, and there is no issue as essentially patients and samples are the same, correct?

Thank you for your help and consideration J

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich

VAT-ID No.: DE143293537

JJ Gao

unread,

Sep 23, 2021, 5:04:17 PM9/23/21

to Vlachavas, Efstathios-Iason, cbiop...@googlegroups.com

Dear Efstathios-Iason,

(1) The one you found from the Cell paper should be the right one. Our MCP publication was used for CPTAC1 studies but not this one.

(2) Your understanding is correct.

Best,

-JJ

Vlachavas, Efstathios-Iason

unread,

Sep 27, 2021, 3:23:58 PM9/27/21

to JJ Gao, cbiop...@googlegroups.com

Dear Dr. Gao,

thank you a gazillion for your help and support !! I do not know the exact reason but I did not get directly any notifications both from the group and via email; my last comment would be something for the processing of rna-seq data that I would like your confirmation:

for the rnaseq, one of the options is mRNA expression (RNA Seq V2 RSEM UQ Log2)- as I saw from a relative histogram, the values look gausian-ish and the range is around from 0 to ~20.04; in your opinion, because also in the relative description of the file it mentions: “data_filename: data_RNA_Seq_v2_expression_median.txt”, there was also an additional transformation such as “median scaling”?

Sorry to insist on this, but as due to downstream multi-omics integration, I would like to be certain that adequate normalization, especially for continuous omics layers such as gene expression has been performed prior statistical modeling;

With Kind Regards,

Efstathios

Histogram.RNASeq.EDA.png

JJ Gao

unread,

Sep 27, 2021, 4:33:41 PM9/27/21

to Vlachavas, Efstathios-Iason, cbiop...@googlegroups.com

Hi Efstathios-Iason,

It was log transformed (log(value+1)) RESM data we downloaded from http://www.linkedomics.org/data_download/CPTAC-COAD/. Please ignore the _median part - it was a misnomer on our side.

-JJ

Vlachavas, Efstathios-Iason

unread,

Oct 4, 2021, 6:42:00 AM10/4/21

to JJ Gao, cbiop...@googlegroups.com

Dear Dr. Gao,

thank you very much for your confirmation and notification-just one final important question that I would like your feedback and from the portal; based on our collaborators, we would like except the multi-omics integration to perform a statistical comparison in the proteomics and phosphoproteomics datasets, to identify features that are DE/or differentially activated, between mutational groups (i.e. KRAS vs BRAF);

as from our current post discussion, both proteomics/phosphoproteomics data are normalized/processed, and both the histogram intensities look gausian-ish; probably, as there are a lot of negative values, we could assume that these are similar to z-scores or ratios, like the following concerning the proteomics intensities:

range(as.matrix(proteome.dat.clean))

[1] -5.41 3.98

Thus, based on your expertise and experience; as the also included z-scores for the proteomics are not relevant for our type of analysis needed, and there are a lot of negative values-which probably do not mean no-expression but possibly under-expression- a simple t-test would suffice for these comparisons? Like a feature wise comparison between needed groups?

Thank you one more time for your overall help and time J

Histogram.Proteomics.EDA.png

svlac...@eie.gr

unread,

Oct 4, 2021, 9:01:51 AM10/4/21

to cBioPortal for Cancer Genomics Discussion Group

Dear Dr. Gao,

just an important update to take into consideration:

After a focused search for also the linkedomics page, I found that the cBioPortal mass spec proteomics data, have identical values (except minor changes in the header and format) with the file in the linkedomics called:

http://www.linkedomics.org/data_download/CPTAC-COAD/

Proteome (PNNL, Gene level, Tumor TMT Unshared Log Ratio):

Proteome data for tumor samples log-ratio normalized (TMT data for Tumor samples, from Pacific Northwest National Laboratory, Gene-level, Unshared log-ratio);

As there are no other information in the paper, I would assume that these could be interpreted similarly as z-scores, with a negative value denoting an under-expression of that protein, and not that the protein is not expressed, correct? I will also contact the authors for further information;

Kind Regards,

Efstathios

JJ Gao

unread,

Oct 4, 2021, 10:09:27 AM10/4/21

to svlac...@eie.gr, cBioPortal for Cancer Genomics Discussion Group

Dear Efstathios-Iason,

Technically the values you are look at are log-transformed RESM scores instead of zscores, but understanding is correct: a negative value means under-expression (instead of no expression) and t-test should be appreciated when comparing samples.

I am not sure if I answered all your questions - please feel free to follow up if you have additional questions.

Best,

-JJ

To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/5a41ed6a-b0e8-46fc-b078-d1df37b0dc2an%40googlegroups.com.

svlac...@eie.gr

unread,

Oct 4, 2021, 11:48:09 AM10/4/21

to cBioPortal for Cancer Genomics Discussion Group

Dear JJ,

thank you very much for your immediate feedback; just to be fully certain that we cover all my parts of my previous question;

you mention for the RSEM values, but I was mainly referring to the proteomics data, and the directionality of the processed values:

1) If I understood well, the normalized values both of the proteomics and phosphoproteomics, have a similar translation ? based on the negative and positive values, that have a similar interpretation as the z-scores?

2) Just to add a further validation of this: based on the linkedomics data that is mentioned in the publication: http://www.linkedomics.org/data_download/CPTAC-COAD/

there are various proteomics options to download-as from an initial check I saw that the file "Protein expression levels (mass spectrometry by CPTAC)" in cBioPortal, has same values with the file in the linkedomics with name Proteome (PNNL, Gene level, Tumor TMT Unshared Log Ratio), I suppose these are the same files, correct? and they refer to the TMT proteomics, not the label-free, right? Apologies for insisting on this, but this justifies my next crucial question concerning the interpretation of the proteomics values:

3) From the authors of the relative publication, concerning the above file in the linkedomics portal, they mentioned for the processing:

“Quantification of TMT Global Proteomics Data” in StarMethods:

Basically, channel 131 was used for labeling an internal reference sample (pooled from all tumor and normal samples with equal contribution) throughout the TMT analysis. Relative protein abundance was calculated as the ratio of sample abundance to reference abundance using the summed reporter ion intensities from peptides that could be uniquely mapped to a gene. The relative abundances were log2 transformed and zero-centered for each gene to obtain relative abundance values. Finally, the median log2 relative protein abundance for each sample was computed and re-centered to achieve a common median of 0.

Thus, my major final quick comments are the following:

A) As the total dataset contains both label-free and TMT proteomics, cBioPortal hosts the same proteomics-that is TMT-as the original portal, that is linkedomics, correct? This is really pivotal to detect the correct processing methods;

B) If my notion above is correct and indeed the cBioPortal proteomics and phosphoproteomics are the same TMT data, from the above description, you would describe both proteome values like "scaled" intensity values? and indeed a t-test would be also appropriate for our analysis purpose?

Thank you for your overall help and support on this demanding part, and apologies for the many questions so far :)

Kind Regards,

Efstathios

JJ Gao

unread,

Oct 4, 2021, 5:24:41 PM10/4/21

to svlac...@eie.gr, cBioPortal for Cancer Genomics Discussion Group

Dear Efstathios,

Please see my comments below.

On Mon, Oct 4, 2021 at 11:48 AM svlac...@eie.gr <svlac...@eie.gr> wrote:

Dear JJ,

thank you very much for your immediate feedback; just to be fully certain that we cover all my parts of my previous question;

you mention for the RSEM values, but I was mainly referring to the proteomics data, and the directionality of the processed values:

Sorry I got mixed up. Please ignore the comment about RSEM and see my comments below.

1) If I understood well, the normalized values both of the proteomics and phosphoproteomics, have a similar translation ? based on the negative and positive values, that have a similar interpretation as the z-scores?

The data was log2 transformed and therefore containing negative and positive values. They are not z-scores though.

2) Just to add a further validation of this: based on the linkedomics data that is mentioned in the publication: http://www.linkedomics.org/data_download/CPTAC-COAD/

there are various proteomics options to download-as from an initial check I saw that the file "Protein expression levels (mass spectrometry by CPTAC)" in cBioPortal, has same values with the file in the linkedomics with name Proteome (PNNL, Gene level, Tumor TMT Unshared Log Ratio), I suppose these are the same files, correct? and they refer to the TMT proteomics, not the label-free, right? Apologies for insisting on this, but this justifies my next crucial question concerning the interpretation of the proteomics values:

3) From the authors of the relative publication, concerning the above file in the linkedomics portal, they mentioned for the processing:

“Quantification of TMT Global Proteomics Data” in StarMethods:

Basically, channel 131 was used for labeling an internal reference sample (pooled from all tumor and normal samples with equal contribution) throughout the TMT analysis. Relative protein abundance was calculated as the ratio of sample abundance to reference abundance using the summed reporter ion intensities from peptides that could be uniquely mapped to a gene. The relative abundances were log2 transformed and zero-centered for each gene to obtain relative abundance values. Finally, the median log2 relative protein abundance for each sample was computed and re-centered to achieve a common median of 0.
Thus, my major final quick comments are the following:
A) As the total dataset contains both label-free and TMT proteomics, cBioPortal hosts the same proteomics-that is TMT-as the original portal, that is linkedomics, correct? This is really pivotal to detect the correct processing methods;

We are using the TMT log2 data downloaded from linkedomics.

B) If my notion above is correct and indeed the cBioPortal proteomics and phosphoproteomics are the same TMT data, from the above description, you would describe both proteome values like "scaled" intensity values? and indeed a t-test would be also appropriate for our analysis purpose?

Yes, I think so.

Στάθης Βλαχάβας

unread,

Oct 5, 2021, 9:17:50 AM10/5/21

to JJ Gao, cBioPortal for Cancer Genomics Discussion Group

Dear JJ,

thank you very much for your confirmation and for validating my crucial points !! It is great to utilize cBioPortal in translational cancer research, and hopefully we get some biologically useful results !! For putative additional questions related to other datasets, I will create a separate post :)