Using AUC/Wilcox rank sum test to pre rank genes

535 views
Skip to first unread message

Evan

unread,
Aug 10, 2023, 11:56:35 AM8/10/23
to gsea-help

Hi GSEA Team,


I would like to do GSEA on a large single-cell data set (~600,000 cells) using some custom gene sets. I was looking for a way to do the analysis in a timely manner as the FindAllMarkers command from Seurat was taking a very long time to rank the genes. I came across the tutorial linked below that uses the wilcoxonauc command from the presto package to rank genes and the fgsea package to perform the actual enrichment analysis. 


In the tutorial, the AUC statistic is used to rank the genes and as input to the GSEA. Is the AUC/Wilcoxon rank sum test a reasonable metric to use? The wilcoxauc command also provides a logFC variable. However, when I use the logFC to rank the genes, I get larger p-values. I would like to maximize sensitivity so would prefer to use AUC, but want to make sure this is a sound method. Any info you could provide would be greatly appreciated! 


Link to tutorial: https://crazyhottommy.github.io/scRNA-seq-workshop-Fall-2019/scRNAseq_workshop_3.html


Anthony Castanza

unread,
Aug 10, 2023, 12:54:24 PM8/10/23
to gsea-help
Hi Evan,

I would note this warning from the tutorial you linked:

"Warning in fgsea(fgsea_sets, stats = ranks, nperm = 1000): There are ties in the preranked stats (21% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results."

This is a significant issue with performing GSEA on single cell data due to the presence of many many genes with zero expression. This can introduce substantial uncertainty that is not accounted for in the significance statistics. It is something we're actively working to mitigate with methodological advancements, however we don't have a specific approach to advise as of yet.

That said, this issue aside, Tommy's approach appears reasonable to me but it is not a ranking metric I have specifically tested. Unfortunately the area of how to apply GSEA methods to single cell data is not one that's particularly well-developed.

Sorry I couldn't be of more assistance 

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/2ccbce25-f8e2-4786-93ba-bccee08e4d5en%40googlegroups.com.

Evan

unread,
Aug 11, 2023, 3:43:19 PM8/11/23
to gsea-help

Hi Anthony,

Thank you so much for your reply! Would it make sense to filter out genes with zero/no expression before running the analysis? If so, do you have any suggested criteria for filtering? 

I have tried a few different methods of measuring activity of gene sets in single cell data including the AUCell package and Seurat’s AddModuleScore. The results were consistent with the GSEA in that the same cell types that had the greatest activity with these two methods were the same cell types with the greatest normalized enrichment scores. I just wanted to confirm that the Wilcoxon rank sum test/AUC makes sense as a metric to use for ranking.   

Anthony Castanza

unread,
Aug 18, 2023, 1:17:30 PM8/18/23
to gsea-help
Hi Evan,

My apologies for the delay in getting back to you, I still don't have official recommendations for how to do analysis here for single-cell data, but it would be my inclination to remove genes that are not expressed in any cluster, i.e. using the threshold of <1TPM summed across all cells to remove genes. We do make a recommendation to a similar step for bulk RNA-sequencing data so I think it is a reasonable approach here. I probably wouldn't go as far as individually filtering the non-expressed genes for each cluster, that would cause the gene universe to vary substantially between the clusters and would likely adversely impact your ability to compare scores.

-Anthony


Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego


Reply all
Reply to author
Forward
0 new messages