Dear cBioPortal Team,
good afternoon and I hope my message finds you safe and healthy !! My name is Efstathios-Iason Vlachavas, and I’m a post-doc scientist at the German Cancer Research Center (DKFZ) in the department of Molecular Genome Analysis under Prof. Dr. Stefan Wiemann.
Briefly, based on an ongoing international collaborative project between DKFZ and other Greek research institutions regarding personalized medicine in cancer (http://www.accc.gr/aboutACCC_info.html), we have acquired around 34 samples with DNA and RNA which were sequenced and pre-processed in the DKFZ bioinformatics facility. Overall, my major goal in my current post-doc is to investigate, if there any specific mutational patterns in any of the 3 separated groups of patients, that might interrelate the presence of specific mutational patterns (KRASonly, BRAFonly and wild type), as the ultimate goal is to investigate the molecular landscape of these 3 defined groups-
on this premise, we were trying to navigate through the cBioPortal web recourse portal not only to expand our sample size with independent cohort studies, as also to perform multi-omics integration in order to correlate the pre-defined mutational status (KRAS and BRAF mutations) with different omics layers and pathways, and we found the Colon Cancer CPTAC-2 study:
https://www.cbioportal.org/study/summary?id=coad_cptac_2019
My main methodological questions are the following:
1) Concerning the actual study-as we would like to focus on the common patients/samples that have been profiled in all the following selected omics layers, such as copy number alterations, mutations, gene expression and proteomics, the most direct way would be to download directly the clinical data from the selected cases through here : https://www.cbioportal.org/study/summary?id=coad_cptac_2019 ? And then subset each omic layer with the selected tumor sample barcodes?
2) In addition, concerning specifically the proteomic layers, such as Protein expression levels (mass spectrometry by CPTAC) and protein level z-scores:
A) Are there any information about the pre-processing of the protein expression data matrix? For example based on the relative data_protein_quantification.txt file:
Composite.Element.REF 01CO005 01CO006 01CO008 01CO013 01CO014 01CO015 01CO019 01CO022
A1BG | A1BG -1.1 -1.12 -1.2 -1.89 -0.523 -1.62 -0.311 0.906
A1CF | A1CF 0.318 -0.441 0.16 0.112 -0.248 0.263 -0.27 -0.778
A2M | A2M -0.487 -0.347 -1.85 -0.329 -0.638 -0.976 -0.921 1.45
AAAS | AAAS 0.0995 -0.0029 0.119 0.67 0.289 0.522 0.226 0.359
AACS | AACS 0.155 0.0957 -0.0924 0.116 0.378 -0.273 0.0528 0.219
AAGAB | AAGAB 0.169 0.396 0.0187 0.313 0.822 0.504 0.0428 0.771
These are normalized protein intensities? Or any other data transformation and/or preprocessing has been applied?
B) Is there a rationale why the above rownames have duplicated gene symbols? Could we just for simplicity and gene symbol concordance across all assays, to keep just one unique row name?
C) Moreover, the relative protein expression mass spectrometry z-scores are computed similarly as in the rnaseq data? Based on the rest of the tumor samples?
3) In parallel, concerning the mutational data-based on the available files to download, is there any specific differences between the maf files “data_mutations_extended” and “data_mutations_mskcc” ? From a small inspection through the R package maftools, that the mskcc file contains 14914 genes, whereas the extended 14881; but I could not notice any other major differences in the column names or mutational attributes; Thus, for my purpose of separating the patients based on KRAS/BRAF/WT mutations, could I use either of them? Or there are more important differences regarding each file?
4) Furthermore, in the above link, the study mentions 110 patients/samples, whereas the highest number of patients profiled is 106; Is there a rationale for this small discrepancy? As far as I have checked, no patients have duplicated samples? https://www.cbioportal.org/study/summary?id=coad_cptac_2019
5) Finally, concerning the actual clinical data-using either data_clinical_sample or data_clinical_patient would not have any impact, correct?
Thank you in advance for your time and consideration on this matter !!
With Kind Regards,
Efstathios-Iason Vlachavas
Efstathios-Iason Vlachavas
Post-doc/Guest Scientist
German Cancer Research Center (DKFZ)
Foundation under Public Law
Im Neuenheimer Feld 280
69120 Heidelberg
Germany
phone: +49 6221 42-5123
fax: +49 6221 42-5109
Efstathios-Ia...@dkfz-heidelberg.de

Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich
VAT-ID No.: DE143293537
Dear cBioPortal Team,
good afternoon and I hope my message finds you safe and healthy !! My name is Efstathios-Iason Vlachavas, and I’m a post-doc scientist at the German Cancer Research Center (DKFZ) in the department of Molecular Genome Analysis under Prof. Dr. Stefan Wiemann.
Briefly, based on an ongoing international collaborative project between DKFZ and other Greek research institutions regarding personalized medicine in cancer (http://www.accc.gr/aboutACCC_info.html), we have acquired around 34 samples with DNA and RNA which were sequenced and pre-processed in the DKFZ bioinformatics facility. Overall, my major goal in my current post-doc is to investigate, if there any specific mutational patterns in any of the 3 separated groups of patients, that might interrelate the presence of specific mutational patterns (KRASonly, BRAFonly and wild type), as the ultimate goal is to investigate the molecular landscape of these 3 defined groups-
on this premise, we were trying to navigate through the cBioPortal web recourse portal not only to expand our sample size with independent cohort studies, as also to perform multi-omics integration in order to correlate the pre-defined mutational status (KRAS and BRAF mutations) with different omics layers and pathways, and we found the Colon Cancer CPTAC-2 study:
https://www.cbioportal.org/study/summary?id=coad_cptac_2019
My main methodological questions are the following:
1) Concerning the actual study-as we would like to focus on the common patients/samples that have been profiled in all the following selected omics layers, such as copy number alterations, mutations, gene expression and proteomics, the most direct way would be to download directly the clinical data from the selected cases through here : https://www.cbioportal.org/study/summary?id=coad_cptac_2019 ? And then subset each omic layer with the selected tumor sample barcodes?
2) In addition, concerning specifically the proteomic layers, such as Protein expression levels (mass spectrometry by CPTAC) and protein level z-scores:
A) Are there any information about the pre-processing of the protein expression data matrix? For example based on the relative data_protein_quantification.txt file:
Composite.Element.REF 01CO005 01CO006 01CO008 01CO013 01CO014 01CO015 01CO019 01CO022
A1BG | A1BG -1.1 -1.12 -1.2 -1.89 -0.523 -1.62 -0.311 0.906
A1CF | A1CF 0.318 -0.441 0.16 0.112 -0.248 0.263 -0.27 -0.778
A2M | A2M -0.487 -0.347 -1.85 -0.329 -0.638 -0.976 -0.921 1.45
AAAS | AAAS 0.0995 -0.0029 0.119 0.67 0.289 0.522 0.226 0.359
AACS | AACS 0.155 0.0957 -0.0924 0.116 0.378 -0.273 0.0528 0.219
AAGAB | AAGAB 0.169 0.396 0.0187 0.313 0.822 0.504 0.0428 0.771
These are normalized protein intensities? Or any other data transformation and/or preprocessing has been applied?
B) Is there a rationale why the above rownames have duplicated gene symbols? Could we just for simplicity and gene symbol concordance across all assays, to keep just one unique row name?
C) Moreover, the relative protein expression mass spectrometry z-scores are computed similarly as in the rnaseq data? Based on the rest of the tumor samples?
3) In parallel, concerning the mutational data-based on the available files to download, is there any specific differences between the maf files “data_mutations_extended” and “data_mutations_mskcc” ? From a small inspection through the R package maftools, that the mskcc file contains 14914 genes, whereas the extended 14881; but I could not notice any other major differences in the column names or mutational attributes; Thus, for my purpose of separating the patients based on KRAS/BRAF/WT mutations, could I use either of them? Or there are more important differences regarding each file?
4) Furthermore, in the above link, the study mentions 110 patients/samples, whereas the highest number of patients profiled is 106; Is there a rationale for this small discrepancy? As far as I have checked, no patients have duplicated samples? https://www.cbioportal.org/study/summary?id=coad_cptac_2019
5) Finally, concerning the actual clinical data-using either data_clinical_sample or data_clinical_patient would not have any impact, correct?
Thank you in advance for your time and consideration on this matter !!
With Kind Regards,
Efstathios-Iason Vlachavas
Efstathios-Iason Vlachavas
Post-doc/Guest Scientist
German Cancer Research Center (DKFZ)
Foundation under Public Law
Im Neuenheimer Feld 280
69120 Heidelberg
Germany
phone: +49 6221 42-5123
fax: +49 6221 42-5109
Efstathios-Ia...@dkfz-heidelberg.de
Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich
VAT-ID No.: DE143293537
--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/8ed2a941775b4384abafc4d5828a32a6%40dkfzex02n1.ad.dkfz-heidelberg.de.
Dear Dr. Gao,
Thank you very much for your valuable comments and suggestions; very briefly, two important clarification points before processing the data:
1) Initially concerning my second question regarding the processing pipeline of Protein expression levels (mass spectrometry by CPTAC): as I tried to contact the authors, but they mentioned just another repository without any further explanation or clue concerning cBioPortal;
In addition, searching through the materials and methods from the original publication:
https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930292-2 (page 22):
Label-free Proteomics Data Analysis
“Spectral count data were filtered by removing proteins with zero counts in all samples and quantile-normalized using the R package preprocessCore (version 1.42.0, https://github.com/bmbolstad/preprocessCore). We further filtered low abundant proteins with
average raw count < 1.4 as we did previously (Zhang et al., 2014). The normalized and filtered counts were then log2 transformed for
downstream analysis.”
I suppose also from your relative publication: https://www.mcponline.org/article/S1535-9476(20)31786-2/fulltext#supplementaryMaterial
The protein expression levels hosted in cBioPortal, refer to these above? And the same pipeline was used for processing? Apologies for insisting on this, it is just that from the original paper there are also different protein assays and pipelines, and I would just need to be certain for future documentation to document any processing steps in prior;
Overall, from a small inspection of the values and relative distributions, the data look normalized and Gaussian-ish, and probably can utilized directly;
2) For the clinical txt files named data_clinical_sample or data_clinical_patient: as you mention they contain different level of information; however, as through navigation from the portal, each patient contains each sample, then the tumor sample barcodes could be used safely to extract different information, and there is no issue as essentially patients and samples are the same, correct?
Thank you for your help and consideration J
Management Board: Prof. Dr. med. Michael Baumann, Ursula Weyrich
VAT-ID No.: DE143293537
Dear Dr. Gao,
thank you a gazillion for your help and support !! I do not know the exact reason but I did not get directly any notifications both from the group and via email; my last comment would be something for the processing of rna-seq data that I would like your confirmation:
for the rnaseq, one of the options is mRNA expression (RNA Seq V2 RSEM UQ Log2)- as I saw from a relative histogram, the values look gausian-ish and the range is around from 0 to ~20.04; in your opinion, because also in the relative description of the file it mentions: “data_filename: data_RNA_Seq_v2_expression_median.txt”, there was also an additional transformation such as “median scaling”?
Sorry to insist on this, but as due to downstream multi-omics integration, I would like to be certain that adequate normalization, especially for continuous omics layers such as gene expression has been performed prior statistical modeling;
With Kind Regards,
Efstathios
Dear Dr. Gao,
thank you very much for your confirmation and notification-just one final important question that I would like your feedback and from the portal; based on our collaborators, we would like except the multi-omics integration to perform a statistical comparison in the proteomics and phosphoproteomics datasets, to identify features that are DE/or differentially activated, between mutational groups (i.e. KRAS vs BRAF);
as from our current post discussion, both proteomics/phosphoproteomics data are normalized/processed, and both the histogram intensities look gausian-ish; probably, as there are a lot of negative values, we could assume that these are similar to z-scores or ratios, like the following concerning the proteomics intensities:
range(as.matrix(proteome.dat.clean))
[1] -5.41 3.98
Thus, based on your expertise and experience; as the also included z-scores for the proteomics are not relevant for our type of analysis needed, and there are a lot of negative values-which probably do not mean no-expression but possibly under-expression- a simple t-test would suffice for these comparisons? Like a feature wise comparison between needed groups?
Thank you one more time for your overall help and time J
After a focused search for also the linkedomics page, I found that the cBioPortal mass spec proteomics data, have identical values (except minor changes in the header and format) with the file in the linkedomics called:
http://www.linkedomics.org/data_download/CPTAC-COAD/
Proteome (PNNL, Gene level, Tumor TMT Unshared Log Ratio):
Proteome data for tumor samples log-ratio normalized (TMT data for Tumor samples, from Pacific Northwest National Laboratory, Gene-level, Unshared log-ratio);
As there are no other information in the paper, I would assume that these could be interpreted similarly as z-scores, with a negative value denoting an under-expression of that protein, and not that the protein is not expressed, correct? I will also contact the authors for further information;
Kind Regards,
Efstathios
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/5a41ed6a-b0e8-46fc-b078-d1df37b0dc2an%40googlegroups.com.
“Quantification of TMT Global Proteomics Data” in StarMethods:
Basically, channel 131 was used for labeling an internal reference sample (pooled from all tumor and normal samples with equal contribution) throughout the TMT analysis. Relative protein abundance was calculated as the ratio of sample abundance to reference abundance using the summed reporter ion intensities from peptides that could be uniquely mapped to a gene. The relative abundances were log2 transformed and zero-centered for each gene to obtain relative abundance values. Finally, the median log2 relative protein abundance for each sample was computed and re-centered to achieve a common median of 0.
Thus, my major final quick comments are the following:
A) As the total dataset contains both label-free and TMT proteomics, cBioPortal hosts the same proteomics-that is TMT-as the original portal, that is linkedomics, correct? This is really pivotal to detect the correct processing methods;
B) If my notion above is correct and indeed the cBioPortal proteomics and phosphoproteomics are the same TMT data, from the above description, you would describe both proteome values like "scaled" intensity values? and indeed a t-test would be also appropriate for our analysis purpose?
Thank you for your overall help and support on this demanding part, and apologies for the many questions so far :)
Kind Regards,
Efstathios
Dear JJ,thank you very much for your immediate feedback; just to be fully certain that we cover all my parts of my previous question;you mention for the RSEM values, but I was mainly referring to the proteomics data, and the directionality of the processed values:
1) If I understood well, the normalized values both of the proteomics and phosphoproteomics, have a similar translation ? based on the negative and positive values, that have a similar interpretation as the z-scores?
2) Just to add a further validation of this: based on the linkedomics data that is mentioned in the publication: http://www.linkedomics.org/data_download/CPTAC-COAD/there are various proteomics options to download-as from an initial check I saw that the file "Protein expression levels (mass spectrometry by CPTAC)" in cBioPortal, has same values with the file in the linkedomics with name Proteome (PNNL, Gene level, Tumor TMT Unshared Log Ratio), I suppose these are the same files, correct? and they refer to the TMT proteomics, not the label-free, right? Apologies for insisting on this, but this justifies my next crucial question concerning the interpretation of the proteomics values:3) From the authors of the relative publication, concerning the above file in the linkedomics portal, they mentioned for the processing:“Quantification of TMT Global Proteomics Data” in StarMethods:
Basically, channel 131 was used for labeling an internal reference sample (pooled from all tumor and normal samples with equal contribution) throughout the TMT analysis. Relative protein abundance was calculated as the ratio of sample abundance to reference abundance using the summed reporter ion intensities from peptides that could be uniquely mapped to a gene. The relative abundances were log2 transformed and zero-centered for each gene to obtain relative abundance values. Finally, the median log2 relative protein abundance for each sample was computed and re-centered to achieve a common median of 0.
Thus, my major final quick comments are the following:
A) As the total dataset contains both label-free and TMT proteomics, cBioPortal hosts the same proteomics-that is TMT-as the original portal, that is linkedomics, correct? This is really pivotal to detect the correct processing methods;
B) If my notion above is correct and indeed the cBioPortal proteomics and phosphoproteomics are the same TMT data, from the above description, you would describe both proteome values like "scaled" intensity values? and indeed a t-test would be also appropriate for our analysis purpose?