background set and ranking metrics

Candice Contet

unread,

Apr 7, 2022, 7:33:18 PM4/7/22

to gsea-help

Hi,

we have been using GSEA to analyze proteomic datasets from mouse brain tissue. We ranked our protein lists using the following metrics: S= -log10(p-value)*(fold change sign), such that the most significantly upregulated proteins are at the top of the list, and the most significantly downregulated proteins are at the bottom. Several of the MSigDB C8 gene sets that are enriched at either end of our lists are related to astrocytes.

We are worried that this enrichment may result from the enrichment of astrocytic markers in our input lists compared to the entire genome (i.e., sample source bias). Can you clarify what background set GSEA is using to determine the significance of gene set enrichment? Is it restricted to the proteins from our ranked list or is it the entire genome?

As we were pondering this potential issue, we came across the SetRank program (Simillion et al. BMC Bioinformatics (2017) 18:151), which explicitly addresses the issue of background set selection. Would you be able to comment on the advantages and disadvantages of using SetRank instead of GSEA?

One more question: can you indicate whether including the sign of fold change in our ranking metrics is appropriate or whether we should instead rank by significance only, regardless of the direction of change (such that the most significantly altered proteins would be at the top of our ranked lists, while those at the bottom would be the least affected)?

Thanks in advance for your input!

Candice

Anthony Castanza

unread,

Apr 7, 2022, 7:48:00 PM4/7/22

to gsea...@googlegroups.com

Hi Candice,

GSEA's background is restricted to the genes in the ranked list, this is why it is important to not use pValue cutoffs when preparing the ranked list. All the genes for which a test statistic is available should be provided.

To the best of my understanding there isn't really a good way to correct for detection bias in the original experiment in the context of enrichment analysis – if your experiment consists of primarily astrocytes, astrocytic pathways are what is going to be assessed. However, GSEA will not introduce a bias with respect to the overrepresentation of specific cell-type genes with respect to the genomic background as GSEA is not aware of the genomic background, only the background of the experiment specific expressed gene universe.

You would want to use the sign of the log2fc here, yes. This allows GSEA to compute up and downregulated pathways separately. In fact, while using the sign of the logfc*-log10(pvalue) is common, it's also quite common to run GSEA with just the log2FC as the metric. In the standard, (i.e. non-preranked) mode of GSEA, the signal to noise ratio is used. This scales the difference in the changes between the two groups by the sum of the standard deviations. You might want to consider an alternative statistic that introduces more information like this into the gene ranking. What are you using to compute differential expression here? If it's something like DESeq2 it might return a test statistic that could be used directly in GSEA.

While I can't speak to the specifics of SetRank, from my understanding it has a similar approach taking in a list of all genes for which there is a statistic available. It might be worthwhile to do a comparative analysis of the results from both tools!

Let me know if you have any additional questions

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/69b3b072-4dac-4e89-8d34-68dc880ec6d8n%40googlegroups.com.

Candice Contet

unread,

Apr 7, 2022, 9:13:37 PM4/7/22

to gsea-help

Hi Anthony,

thanks for clarifying the issue of background set. This scenario is what I was hoping for, and it means that the enrichment of astrocytic markers among our most significantly affected proteins is not related to sample source bias (we analyzed brain tissue, not pure astrocytes - the enrichment of astrocytic markers is a very interesting/informative outcome).

We used limma to analyze differential expression, so for each protein, we have log FC, average expression, nominal p-value, and adjusted p-value. I didn't want to rely solely on fold change to rank the proteins because small changes of strong statistical significance may be more biologically meaningful than large, but highly variable, changes. Thanks for confirming that we did the right thing by incorporating the fold change direction in our ranking metrics. That makes sense based on the GSEA output (na_pos vs na_neg phenotypes).

We are currently comparing the output from GSEA and SetRank for the same datasets. Reassuringly, those same astrocyte-related gene sets are also among the most significant with SetRank (in fact, at much higher significance than with GSEA). Now that I know that the same background set is being used in both analyses, I am debating whether SetRank might be advantageous for other reasons (e.g., in the way it handles the overlap and relationships between gene sets) but disadvantageous for other reasons (e.g., it does not separate enrichment among upregulated proteins from enrichment among downregulated proteins - could be a good or a bad thing). If you can think of other factors I should weigh up in my comparison of both approaches, don't hesitate to bring them to my attention (I am a bioinformatic novice and eager to learn/understand!).