Clarifying standard vs preranked GSEA

13 views
Skip to first unread message

Daniel Jian-Ho Yaw

unread,
Oct 28, 2025, 1:15:32 AM (2 days ago) Oct 28
to gsea-help
Hello, 

I'm looking to perform GSEA for proteomics on S. cerevisiae. Through MS/MS, we have detected 4561/6067 (~75% coverage) across our 6 conditions (12 replicates each). 

I was looking to clarify the following points: 
  1. I've scoured the group to understand if we are to use standard or preranked GSEA. Considering that we have more than 3 replicates per condition, should we use standard GSEA with phenotype permutation? 
  2. Else, if preranked is preferred, would log2FC or the sign(log2(fc))*-log10(pValue) metric be preferable? There appears to be no consensus on which is "better". 
  3. Also, from the other threads, I understand that GSEA is meant for a larger number of genes/proteins. Is GSEA still suitable for our small(er) data set? 
  4. If standard GSEA is set to log2FC with gene set permutation, would that be the same as preRanked GSEA (ranked based on log2FC)? 
Thank you for your time. 

Take care,
Dan Yaw

Anthony Castanza

unread,
Oct 28, 2025, 5:23:27 PM (2 days ago) Oct 28
to gsea-help
Hi Dan,

Unsupported organisms like  S. cerevisiae, are a bit tricky in a couple cases because we don't offer gene set databases or mapping files for them necessitating that you supply your own. That said, for the technical questions you're asking here I can offer some answers.

With regard to the dataset size issue; GSEA is intended to be run on the whole gene expression universe of the organism, the issue isn't strictly "how many genes you have in your dataset", it's "how many genes you are assessing in relationship to the number of genes expressed in the organism", if you have 75% genome coverage, you're probably fine for comparing your dataset against gene sets that were generated for your organism based on it's gene universe. 

In order to run the standard GSEA method you need to have more than 7 samples per phenotype, if you meet this, which it sounds like you do, then running the standard method of GSEA with the default phenotype permutation is preferred. 
In phenotype permutation mode GSEA performs it's assessments based on the probability that those specific gene sets would be enriched in a case where there is no phenotype correlation across the samples. In comparison, GSEA Preranked (and the standard mode in gene_set permutation ode) uses a null hypothesis essentially based on the probability that a random set of the same size would be enriched in your sample phenotype comparison.

The null hypothesis for the phenotype permutation mode is generally more informative; being directly linked to strength of the phenotype, however it's not possible to do this with a reasonable number of permutations in cases where there are fewer than 7 samples per phenotype, hence, the use of the alternative null distribution in the gene_set mode. If you aren't limited by replicate number, the phenotype mode is superior, however, this method does have lower power necessitating a looser FDR cutoff (GSEA makes calls based on a 0.25 FDR cutoff, as described in the documentation).

With respect to the standard (albeit in gene_set permutation) mode vs. GSEA preranked; there is no consensus and we don't generally offer specific advice on which to use in the circumstance where you can not use phenotype permutation mode. GSEA Preranked uses the same gene set permutation mode, so the only decision to make is if you prefer the internal GSEA ranking metric (signal to noise ratio), or you prefer to use your own calculated metric. GSEA run with gene set permutation mode and the non-default log2_ratio_of_classes method should be equivalent to running preranked on computed log2FC rankings. By default, however, GSEA uses the signal to noise ratio, which scales the magnitude of the expression difference by the standard deviation of the expression levels, this would generally be an improvement over running the Log2FC (in preranked or otherwise) in isolation as the magnitude of the log2fc isn't inherently linked to the significance and can be highly affected by expression variability. The method of scaling the Log2FC by the -log10(pValue) attempts to address this by rescaling the distribution using the significance of the change. I think in cases where you do not have enough samples to run phenotype permutation, like in this case you describe, the significance scaled log2FC distribution might be the best option, however, as these methods have not been rigorously benchmarked against each other, we can't offer a hard, official, recommendation. In your case however, assuming you have sufficient sample numbers in all cases, you phenotype permutation mode is preferred over either standard-in-gene-set-perm mode, or preranked.

Hopefully this helps, let me know if you have any additional questions, or if there was anything I failed to address

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego


Daniel Jian-Ho Yaw

unread,
Oct 29, 2025, 3:29:17 AM (21 hours ago) Oct 29
to gsea-help
Hi Anthony, 

Thank you very much for your detailed response. It has clarified most of my queries. But I do have some follow-up questions, if you may.
  
- As proteomics does not have as exhaustive as the gene coverage for transcriptomics, would you recommend filtering the gene set (which I've obtained from ELTE bioinformatics https://github.com/ELTEbioinformatics/GMT_files_for_mulea/tree/main/GMT_files) to include only the detected proteins or should we use the gene set in its entirety? I tried filtering the gene set based on the detected proteins and when comparing with the results using the original gene set, there are less gene set hits for the filtered gene sets. I'm unsure if the difference between the filtering that GSEA does automatically and my filtering method comes from the statistics or different manner of processing the data. 
- For multiple gene sets, (ie GO, KEGG, Transcription factors, Reactome pathways), should one process all the gene sets together or individually? I tried processing them together and individually. While the smaller gene sets give similar results (KEGG and Reactome pathways), the larger gene sets (GO, Transcription factors), don't do well (much less gene set hits). 

Thank you very much!

Thank you, 
Dan Yaw 

Anthony Castanza

unread,
Oct 29, 2025, 5:35:20 PM (7 hours ago) Oct 29
to gsea-help
Hi Dan,

GSEA performs automatic filtration of the input gene sets to restrict them to the gene universe presented in the input expression/ranked gene list. You should not need to perform any additional filtration there. If you manually filtered the sets for only your detected proteins, and this didn't match the result of GSEA performing what should be the same filtering (if I'm understanding correctly), I might suspect an error in the manual processing. Are GSEA's post-filtering set member counts matching yours?
That said, some variability in GSEA results between runs is expected due to the nature of the use of a random seed used in construction of the permutation distribution, but the effects of this should be minor and only impact sets where the scores would be considered "borderline" anyway. You can fix the seed in the advanced fields parameters, and the seed for a run with a random initialization should be given on the report index in the comments section. Fixing this seed value should maintain absolute consistency in the generation of the distribution.

I would generally recommend running collections generated from independent database sources separately. The massive levels of inter-set redundancy between, for example, KEGG and Reactome, can adversely affect GSEA's FDR calculation. That said, performing this test in with the multiple sources lumped together, can give some idea if source is a slightly better fit for the data overall if the sets from that database are generally being favored by the FDR computation.

As always, let me know if you have any additional questions


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages