Calculating Z-Scores using RNASeq Data

6,213 views
Skip to first unread message

Muhammad Shah

unread,
Jun 3, 2013, 2:07:12 PM6/3/13
to cbiop...@googlegroups.com
Hello,

I am in the process of calculating Z-Scores for matched tissue samples with RNASeq V2 (RSEM) data available from TCGA (the same data hosted on cBioPortal). I am reading your FAQ regarding calculating Z-Scores and am a little confused by the wording. It reads as follows:

"For mRNA and microRNA expression data, we typically compute the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is either all tumors that are diploid for the gene in question, or, when available, normal adjacent tissue. The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples."

Also, I understand that the formula for calculating Z-scores is :

z = (expression in tumor sample) - (mean expression in normal sample)/ (standard deviation of expression in normal sample)

Hence, since these are all matched tissue samples will we take the RSEM value of a specific gene in the tumor sample and subtract the RSEM value for the same gene in the normal sample? How will we calculate the standard deviation?

Thank you,
Ahmad Shah

JianJiong Gao

unread,
Jun 14, 2013, 10:27:08 AM6/14/13
to cbiop...@googlegroups.com, muhammad...@gmail.com
Hi Ahmad,

Sorry for the late response. The reference population for calculating z-scores of RSEM data in TCGA studies are diploid samples.

z = (expression in tumor sample) - (mean expression in diploid samples) / (standard deviation of expression in diploid samples)

Best,
-JJ



--
You received this message because you are subscribed to the Google Groups "The cBio Cancer Genomics Portal Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nikolaus Schultz

unread,
Jun 20, 2013, 10:55:53 AM6/20/13
to cbiop...@googlegroups.com, muhammad...@gmail.com
Hi Ahmad,

Normal mRNA is only available for a smaller subset of samples, not for every tumor sample.

It is also not always clear what the cell of origin of a tumor is, so the mRNA expression in normal adjacent tissue can sometimes be misleading. 
This is why we chose to compare expression within the set of tumors only.

Niki.


Yes this makes sense, but I am relatively positive that TCGA provides
matched normal samples for their RNASeq data. For each patient_id (looking
at the TCGA barcode), the same patient_id will appear twice in the raw data
file as such (example):

TCGA-91-6835-11A-01R-1858-07
TCGA-91-6835-01A-11R-1858-07

Where for the patient_id 6835, the field after the patient_id -11A- and
-01A- tell us that 11A is the normal and 01A is the tumor. This can be found
here:
https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTable=Sample%20type

Thus, would it make sense instead to calculate the Z-Score based off the
matched normal (considering only cases where a matched normal is available).



You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.

paola...@gmail.com

unread,
Sep 8, 2014, 1:04:45 PM9/8/14
to cbiop...@googlegroups.com, muhammad...@gmail.com
Hi!
I have a question regarding this issue.
When calculating the z-score for a given gene X in a set of tumors, if you take as reference population the group of tumors that are diploid for X, then you have different reference populations for each gene you want to analyze..
So how can you do this kind of calculation genome-wide?
I am doing it by taking as reference the samples that are labelled as "normal" in the TCGA datasets (the -11A- samples), so that I can use the same reference for all genes.
The values I get are quite far from those that cbioportal gives..
Now I am a bit lost!!
Any help will be extremely appreciated!!
Thanks
Paola

Nikolaus Schultz

unread,
Nov 27, 2014, 2:26:49 PM11/27/14
to cbiop...@googlegroups.com, muhammad...@gmail.com, paola...@gmail.com, Seb ...
Hi Paola,

You are correct, the way we calculate the z-scores, we use  different reference population for each gene. These z-scores are really just supposed to be a way to indicate relative over- or under-expression for each gene in comparison to the rest of the cohort. 

We are working on importing normal expression data for all TCGA studies and will then be able to compute z-scores relative to the normal samples. 

Niki.


On Nov 20, 2014, at 11:32 AM, Seb ... <seba...@gmail.com> wrote:

Paola

it might be that the data that you are using are not the same used for the cBioportal. I think cBioportal is updated monthly (or less), if you downloaded the TCGA rsem level 3 data from the TCGA ftp server you might have a different version.
I used too the "normal" counterpart as "diploid sample"

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Seb ...

unread,
Nov 20, 2014, 11:32:19 AM11/20/14
to cbiop...@googlegroups.com, muhammad...@gmail.com, paola...@gmail.com
Paola

it might be that the data that you are using are not the same used for the cBioportal. I think cBioportal is updated monthly (or less), if you downloaded the TCGA rsem level 3 data from the TCGA ftp server you might have a different version.
I used too the "normal" counterpart as "diploid sample"


On Monday, September 8, 2014 1:04:45 PM UTC-4, paola...@gmail.com wrote:

johnsc...@gmail.com

unread,
Aug 4, 2016, 10:14:30 AM8/4/16
to cBioPortal for Cancer Genomics Discussion Group, muhammad...@gmail.com
Is there any literature that supports this method of calculating upregulation?  Have their been published papers that state that they have defined upregulation in this way?  Any link to references will be most appreciated:)
Reply all
Reply to author
Forward
0 new messages