Why do we do clamping/clipping for Z-scoring at -2.5/2.5. Should we?

Ward Weistra

unread,

Oct 5, 2015, 12:30:08 PM10/5/15

to transmart-discuss, Axel Oehmichen, eugene.ra...@thomsonreuters.com, Askar Obulkasim, Sander de Ridder, Wim van der Linden, Michael T. McDuffie

Dear all,

Historically the tranSMART ETL pipelines have been cutting of Z-scores at -2.5 and 2.5. For our recent developments on transmart-batch we have mirrored this implementation to stay in line with that: https://github.com/thehyve/transmart-batch/blob/master/docs/hd-data-processing-details.md.

From a previous report by Axel (https://jira.transmartfoundation.org/browse/TMART-410) it seems that this might be needed for the heatmap, to make sure the color range is the same for all datasets. This is probably related to the analyses in https://github.com/transmart/Rmodules/blob/master/web-app/Rscripts/Heatmap/HeatmapLoader.R, which uses color.range.clamps = c(-2.5,2.5).

However, we think it might be better to make sure the analyses clip their Z-scores if they really need it, and see no reason to force this upon all analyses and the export by doing the clipping already at ETL.

My questions to you:

Do you know why this choice was made originally, for this to be handled at ETL?
Do you agree this should be handled in the analyses and ETL should not clip Z-scores?
How is this handled in other ETL pipelines (tmDataLoader?) and what are your plans with it?

Thanks again for your answers!

Best regards,

Ward

Sidenote: At some point it might be valuable to have a dedicated ETL/data handling discussion group for these kind of questions.

Annick Peleraux

unread,

Oct 9, 2015, 10:59:36 AM10/9/15

to transmart-discuss, axelfrancois...@imperial.ac.uk, eugene.ra...@thomsonreuters.com, as...@thehyve.nl, san...@thehyve.nl, wim.van.d...@philips.com, Michael_...@hms.harvard.edu

Hello,

Here is my point of view on z-scores calculation in tranSMART ETL pipelines. I would like to break it into 4 points:

1. Do we need z-score calculation by ETL ("global z-scores") and storage in database?

In my eyes, it is in most of cases not appropriate to calculate z-scores at the time of data loading because z-scores should be calculated based on the particular analysis which is run. This is why we have requested the "calculate z-scores on-the-fly" option in tranSMART workflows on high dim data. We are currently debating whether high dim workflows should not systematically retrieve log2 values from the database and calculate z-scores on-the-fly on the samples used in the analysis. Of course, "global z-scores" could be calculated on-the-fly by selecting all subjects and all nodes in a workflow.

Here are 2 examples where "global z-scores" cannot be used:

A. When you want to focus your analysis on a subset of subjects from the study, let’s say old women, and you are not interested on the level of expression as compared to the other subjects in the study, but only among those women, global z-scores are not appropriate. Z-scores should be calculated only across samples from old women.

B. When several tissues or samples have been analyzed for gene expression, let’s say blood and urine, global z-scores will mainly reflect differences in expression between blood and urine for a single gene, but not differences between subjects in blood (or in urine) for genes that have very different levels of expression in the 2 tissues. For differentiating subjects based on biomarker expression in blood, the appropriate normalization would be to calculate z-scores only across blood samples.

2. What formula for z-score calculation should be used?

So far, the calculation that was present from early versions of tranSMART has not been changed (to my knowledge). Z-score calculation is done by the following formula:

zscore = (log_intensity – median_log_intensity ) / stddev_log_intensity

The most common formula for z-score calculation uses mean (not median) to center the data, so we can wonder why median is used here. Z-score transformation is justified when data follow a normal distribution. Log-transformation of the intensities is used to make the data normal (or close to normal) and in a normal distribution, mean and median are equal, so both formulas should be equivalent. Of course, we know that the distributions of log_intensities for biomarkers are not really normal, and in particular that some very high values usually exist. Median is probably used to abolish the influence of these very high values. However, the problem is that the estimation of variance, which is done with stddev, is not robust to high values. Therefore I would either stick with the usual formula zscore = (log_intensity – mean_log_intensity ) / stddev_log_intensity (assuming close to normal distribution) or use a more robust formula.

3. Should z-scores be "clipped" (or winsorized) at -2.5 and 2.5?

Again this was present from earlier versions of tranSMART and has not been changed. Obviously this winsorization will reduce the influence of extreme values of z-scores (which may largely be due to incorrect estimation of stddev, see above) for further calculations in workflows such as Marker Selection and Clustering. Although winsorization is questionable, I personally would keep it for this effect. But I don't have a precise idea on the appropriate value to use (why not -3 and +3 for example?)

4. Z-score winsorization and color ranges in heatmaps

There are several options in R to set the color range of a heatmap. In tranSMART version RC2 (and I believe 1.2), the color range have been systematically fixed between -2.5 and 2.5 because Z-scores are clipped to 2.5 and -2.5. Therefore in all heatmaps, bright green represents -2.5 (lowest intensity) and bright red +2.5 (highest intensity), and this is very simple for users.

I hope that these comments are useful!

Best regards,

Annick

Anthony Rowe

unread,

Oct 12, 2015, 7:33:01 AM10/12/15

to transmart-discuss, axelfrancois...@imperial.ac.uk, eugene.ra...@thomsonreuters.com, as...@thehyve.nl, san...@thehyve.nl, wim.van.d...@philips.com, Michael_...@hms.harvard.edu

Ward,

You have to go back to the original implementations which used Gene Pattern as its analysis engine and presented the results in the Gene Pattern Applet viewers. The Z-score attribute was a pre-computed value to use in the heat map visualization and only used for presentation of the results.

So since then the Gene Pattern dependency has been removed and the results are images that are embedded directly in the HTML. Additionally since 2008 performance of the underlying technologies has improved too so as an optimization it may not be required any more.

Anthony

p.s .As an aside - Gene Pattern use to manage a simple job queue so ensure that tasks did not get overloaded. I think the current implementation just assumes RServ will take care of that leading to a situation where big analysis jobs that can crash RServ.

as...@thehyve.nl

unread,

Oct 15, 2015, 3:07:58 AM10/15/15

to transmart-discuss, axelfrancois...@imperial.ac.uk, eugene.ra...@thomsonreuters.com, as...@thehyve.nl, san...@thehyve.nl, wim.van.d...@philips.com, Michael_...@hms.harvard.edu

Dear all,

I completely agree with the arguments brought by Annick Peleraux. On the top of that, I would like to add the following: the reason why z-score are clipped at 2.5, I believe, due to the fact that in a standard normal distribution ~99% of the data points fall under the curve, i.e. Z(0.01) = -2.5; Z(0.99) = +2.5, after clipping.

Kind Regards,

Askar

Op maandag 5 oktober 2015 18:30:08 UTC+2 schreef Ward Weistra:

Ward Weistra

unread,

Nov 8, 2015, 7:05:51 PM11/8/15

to transmart-discuss, axelfrancois...@imperial.ac.uk, eugene.ra...@thomsonreuters.com, as...@thehyve.nl, san...@thehyve.nl, wim.van.d...@philips.com, Michael_...@hms.harvard.edu, wei...@uni.lu

Thank you all for your reactions!

If I may try to summarize:

Everyone thinks we should move away from pre-calculated Z-scores to Z-score calculation on the fly.
We should change the Z-score calculation from to zscore = (log_intensity – mean_log_intensity ) / stddev_log_intensity. (Actually transmart-batch already uses mean: https://github.com/thehyve/transmart-batch/blob/master/docs/hd-data-processing-details.md)
We can leave the clipping for now in the pre-calculated Z-scores.

Please let me know if your opinions differ.

Best regards,
Ward

Op donderdag 15 oktober 2015 09:07:58 UTC+2 schreef as...@thehyve.nl:

Annick Peleraux

unread,

Nov 9, 2015, 4:12:10 AM11/9/15

to transmart-discuss, axelfrancois...@imperial.ac.uk, eugene.ra...@thomsonreuters.com, as...@thehyve.nl, san...@thehyve.nl, wim.van.d...@philips.com, Michael_...@hms.harvard.edu, wei...@uni.lu

Hello,

I agree. Moving to z-scores on the fly and the standard z-score formula ( zscore = (log_intensity – mean_log_intensity ) / stddev_log_intensity ) without clipping is indeed the path we want to take in our current project of re-factoring several high dim data advanced workflows (heatmap, hierarchical and k-means clustering, marker selection) for tranSMART v1.3. By the way, our project also includes the integration of the heatmap smartR workflow which also uses unclipped on the fly zcores calculated by the standard formula.

Annick

Reply all

Reply to author

Forward