Hello,
Here is my point of view on z-scores calculation in tranSMART ETL pipelines. I would like to break it into 4 points:
1. Do we need z-score calculation by ETL ("global z-scores") and storage in database?
In my eyes, it is in most of cases not appropriate to calculate z-scores at the time of data loading because z-scores should be calculated based on the particular analysis which is run. This is why we have requested the "calculate z-scores on-the-fly" option in tranSMART workflows on high dim data. We are currently debating whether high dim workflows should not systematically retrieve log2 values from the database and calculate z-scores on-the-fly on the samples used in the analysis. Of course, "global z-scores" could be calculated on-the-fly by selecting all subjects and all nodes in a workflow.
Here are 2 examples where "global z-scores" cannot be used:
A.
When you want to
focus your analysis on a subset of subjects from the study, let’s say old
women, and you are not interested on the level of expression as compared to the
other subjects in the study, but only among those women, global z-scores are
not appropriate. Z-scores should be calculated only across samples from old
women.
B.
When several tissues
or samples have been analyzed for gene expression, let’s say blood and urine,
global z-scores will mainly reflect differences in expression between blood and
urine for a single gene, but not differences between subjects in blood (or in
urine) for genes that have very different levels of expression in the 2
tissues. For differentiating subjects based on biomarker expression in blood,
the appropriate normalization would be to calculate z-scores only across blood
samples.
2. What formula for z-score calculation should be used?
So far, the calculation that was present from early versions of tranSMART has not been changed (to my knowledge). Z-score calculation is done by the following formula:
zscore
= (log_intensity – median_log_intensity ) / stddev_log_intensity
The most common formula for z-score calculation uses mean (not median) to center the data, so we can wonder why median is used here. Z-score transformation is justified when data follow a normal distribution. Log-transformation of the intensities is used to make the data normal (or close to normal) and in a normal distribution, mean and median are equal, so both formulas should be equivalent. Of course, we know that the distributions of log_intensities for biomarkers are not really normal, and in particular that some very high values usually exist. Median is probably used to abolish the influence of these very high values. However, the problem is that the estimation of variance, which is done with stddev, is not robust to high values. Therefore I would either stick with the usual formula zscore = (log_intensity – mean_log_intensity ) / stddev_log_intensity (assuming close to normal distribution) or use a more robust formula.
3. Should z-scores be "clipped" (or winsorized) at -2.5 and 2.5?
Again this was present from earlier versions of tranSMART and has not been changed. Obviously this winsorization will reduce the influence of extreme values of z-scores (which may largely be due to incorrect estimation of stddev, see above) for further calculations in workflows such as Marker Selection and Clustering. Although winsorization is questionable, I personally would keep it for this effect. But I don't have a precise idea on the appropriate value to use (why not -3 and +3 for example?)
4. Z-score winsorization and color ranges in heatmaps
There are several options in R to set the color range of a heatmap. In tranSMART version RC2 (and I believe 1.2), the color range have been systematically fixed between -2.5 and 2.5 because Z-scores are clipped to 2.5 and -2.5. Therefore in all heatmaps, bright green represents -2.5 (lowest intensity) and bright red +2.5 (highest intensity), and this is very simple for users.
I hope that these comments are useful!
Best regards,
Annick