Preranked GSEA - pathways that look significant with heatmaps are not significant

166 views
Skip to first unread message

Jiwoon Park

unread,
Aug 6, 2021, 4:09:51 PM8/6/21
to gsea-help
Hello,

I'm running a preranked GSEA analysis on proteomics data generated from mass spectrometry. I have triplicates from a control and a treatment, I ranked my 2000 proteins using log(pvalue) * (sign of FC) and ran the analysis, but my results are very different from what I expected and I'm trying to figure out what happened.

I had a set of specific pathways I wanted to search for (I'll call it pathway A), so I had looked up all proteins involved in pathway A, then created a heatmap plotting abundances of those proteins, and found protein expressions to be much higher in the treatment than in the control. 

But when I ran the GSEA analysis, the p value for pathway A was very large (0.6). I checked the rank of the leading edges that GSEA identified in my 2000 ranked protein list, and they were all very high (all before rank 200). I find it really hard to imagine that GSEA can randomly assign ranks to my 2000 protein list and still find all my proteins in pathway A within the top 200. 

Is there anything I should be thinking about when I'm trying to interpret this result?

Anthony Castanza

unread,
Aug 6, 2021, 4:25:04 PM8/6/21
to gsea...@googlegroups.com

Hello,

 

When you said you ranked using log(pvalue) * (sign of FC), did you mean the -log? The log transformation function should be a negative log.

 

Also, I'm not sure what you mean by GSEA "randomly assigning ranks", GSEA uses the user-supplied ranks in GSEA Preranked mode.

 

With regard to issues with pValue calculation, GSEA isn't designed to operate on proteomic datasets and may struggle with the small size of the dataset you're using as input. This can be because in Preranked mode, GSEA performs significance testing by gene-set permutation, meaning, it generates random lists of genes of the same size as the gene set being tested to compare the likelihood of a gene set of that size being randomly enriched in the data. If your "pathway A" is relatively large in comparison to your relatively small mass spec gene list (the expected size of an input dataset is 10,000-20,000  genes/features), then GSEA might not be able to compute reasonable nulls.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/ab0793d2-abad-47dc-bc5e-d1b82c716ce4n%40googlegroups.com.

Jiwoon Park

unread,
Aug 6, 2021, 4:49:00 PM8/6/21
to gsea-help
Hello Anthony,

Yes I meant -log, sorry about that!

I thought GSEA was randomly assigning ranks to the genes, then calculating null enrichment scores with that new ranked list to compare with the preranked list-computed enrichment scores, but according to your explanation, looks like I misunderstood how permutation tests are done.

Perhaps this is also a misunderstanding, but is there really any pathway that is relatively large enough in comparison to the input list? For instance, I'm looking specifically at the KEGG Glycolysis pathway, which only contains 62 genes. But if GSEA expects 10,000 genes as input, 62 sounds like a really small number compared to 10,000, doesn't it?

Anthony Castanza

unread,
Aug 6, 2021, 5:06:04 PM8/6/21
to gsea...@googlegroups.com

The mode of calculating nulls that you're describing is similar to that used in the Phenotype permutation mode of GSEA, which computes the nulls by scrambling your sample labels and computing differential expression between the randomized groups, for the obvious reason (no groups and no per-sample information) this mode of null generation is not available in GSEA Preranked.

 

62 genes in on the smaller end of gene set sizes, the default minimum is 50. GSEA is about calculating how overrepresented at one end of the distribution a collection of genes are as a whole compared to random chance, until you get to extremely small gene sets, or extremely large (relative to the size of the ranked list) gene sets the size of the expression list is more important than the size of the gene set. 62 genes out of a 2000 member ranked list is a relative size that should be reasonable, so, I would bet that the issue with the pValues might have to do with the lists ranking.

 

If a large proportion of those 2000 genes are highly ranked, then the spread of enriched nulls might be skewed. On the enrichment results page for your gene set of interest, there will be two plots, the "Enrichment Plot: " and at the very bottom, the "Random ES Distribution", the Enrichment plot shows you how the positions of your genes of interest compare to the ranked list as a whole – what you'll want to look for in this plot is a skew in the zero cross, or a "stepping" like appearance in what should be a smooth distribution. These could indicate issues with your ranked list. In the Random ES distribution plot, the normal appearance is a relatively tight but smooth bimodal distribution of the random ES deviations from this can indicate issues with generating the null.

 

If you're willing to send those plots so I can take a look and give you my thoughts, you can find the raw images in GSEA's results directory.

Reply all
Reply to author
Forward
0 new messages