the same input data, totally different results; does the normalization method effects on the output?

Li Ma

unread,

Jun 30, 2022, 6:22:04 PM6/30/22

to gsea-help

Hi folks,

I am using the latest version of GSEA on analyzing a RNA-seq dataset. Yesterday, I pumped the gene count matrix which was normalized using the function varianceStabilizingTransformation() from DESeq2. 18 gene sets were significant at FDR < 25% and our targeted pathway p53 was found as one of the 18 gene sets.

However, I got a totally different result today which had only 6 gene sets were significant at FDR < 25% with the same dataset and the same settings, and the p53 pathway was gone.

Then, I regenerated the dataset by normalizing the count matrix using the counts() function from DESeq2. And this time, with the new dataset and the same settings, I got 17 gene sets were significant at FDR < 25%, and p53 was there.

My questions are:

1. Why did I get two different results with the same input dataset?

2. Does the normalization method effect on the output?

Thanks

I use Ubuntu 20.04, and the version of java is openjdk 14.0.2.

Thanks again

Li Ma

Anthony Castanza

unread,

Jun 30, 2022, 6:57:03 PM6/30/22

to gsea...@googlegroups.com

Hello,

GSEA can experience a (typically) small run-to-run variance in normalized scores and significance as a result of the random number seed used to generate the permuted matrix that underlies the null distribution.

This is generally only has a small impact on scores when the dataset fits the expectations of the GSEA algorithm. The variance can be exacerbated by issues such as running phenotype permutation with fewer than 7 samples per phenotype or running with a highly restricted dataset where many of the genes that were not differentially expressed were removed. Datasets should be normalized, all expressed genes should be provided to GSEA, and for datasets with fewer than 7 samples per phenotype, gene set permutation should be used instead of phenotype permutation.

Can you tell us a little more about the dataset (number of genes and number of samples) and settings that you used?

With regard to normalization, yes, changing the normalization can have a large impact on the scores that are produced by GSEA. Generally we recommend the normalized counts output from DESEq2 and this is what the pipelines we’ve put together for the GenePattern platform use, however I don’t recall if we did direct comparisons with the vst method.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/7402bc55-4454-4544-88f3-166cd20c607en%40googlegroups.com.

Li Ma

unread,

Jun 30, 2022, 9:22:11 PM6/30/22

to gsea...@googlegroups.com

Hi Anthony,

Thanks so much for your quick response!

Our dataset contains 46909 genes and 3 classes with each class has 6 replicates. And we normalized the count matrix using DESeq2, just with different functions. One is the counts() function, another one is the varianceStabilizingTransformation() function. May I know which function do you use to normalize the dataset?

Here are our settings:

"Gene sets database": c2.cp.kegg.v7.5.1.symbols.gmt

"Phenotype labels": Tumor_versus_Normal

"Permutation type": gene_set

"Chip platform": Human_ENSEMBL_Gene_ID_MSigDB.v7.5.1.chip

For the rest settings, we used the default values.

Best regards

Li Ma

l...@ualr.edu

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/SJ0PR05MB7609118EDCC0DDEE2186093EF7BA9%40SJ0PR05MB7609.namprd05.prod.outlook.com.

Reply all

Reply to author

Forward