Generating ranked lists for preranked GSEA

1,830 views
Skip to first unread message

Alena

unread,
Apr 6, 2021, 2:55:41 PM4/6/21
to gsea-help

I am using sign(log2(FC))*-log(p value) to generate a rank for genes obtained from single cell RNA sequencing analysis for input into GSEA.

I would like to check that the individual values for ranking are not important (as long as classic is selected, where weighting = 1), but what is relevant is that the rank produces a gene order and that the genes at the top and bottom of the list are more important that the ones in the middle.

Because some of the p values are so small (less than 10^-312), some of the rankings are calculated to be -Inf and Inf which cannot be processed by GSEA. Several posts 
(https://www.biostars.org/p/276783/https://www.biostars.org/p/343383/https://www.biostars.org/p/297029/) suggest replacing the Inf and -Inf values with a number larger/smaller than the maximum/minimum values already present in the ranking. Is this an appropriate approach? Does it matter what this value is?

Thank you

Alena

Anthony Castanza

unread,
Apr 6, 2021, 3:12:56 PM4/6/21
to gsea...@googlegroups.com

Hi Alena,

 

If you set the Enrichment statistic to “classic” the ranking value is not directly used in the computation, only the positions in the gene lists are used. I should mention here that we don’t recommend this procedure.

 

I don’t know what tool you’ve used to calculate your differential expression, but if you have access to the test statistic column (may be called wald stat, test stat, or similar) I’d use that over the ranking formula you’ve proposed, and using the weighted enrichment statistic. If not, then using the n+max/n-min arbitrary assignments like you’ve suggested, while not ideal, is also what I’d do in your situation.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/08a7b72f-19ca-4d5a-a736-641aa1340f0dn%40googlegroups.com.

Alena

unread,
Apr 6, 2021, 3:32:36 PM4/6/21
to gsea-help
Hi Anthony,

Thank you for your prompt response.

I am using Seurat (from the Satija lab) to analyze my single cell data. Specifically, I am using FindMarkers() to identify differentially expressed genes between different cell clusters/populations on my UMAP plot. The outputs of this function are p values, adjusted p values, and average log fold change for each gene, in addition to percentage of cells which express the gene in the two groups I am comparing. I am following Juri Reimand's protocol (from the Bader lab) which takes GSEA data and plots it with the EnrichmentMap plugin on Cytoscape (https://baderlab.github.io/EnrichmentMap_Protocol/gsea-enrich.html#gsea-enrich).

Are you suggesting I take the adjusted p value and use that directly? I considered doing that but I think that I will lose information about the direction of regulation (whether up or down).

Thanks again.

Alena

Anthony Castanza

unread,
Apr 6, 2021, 3:39:21 PM4/6/21
to gsea...@googlegroups.com
Hi Alena,

No, unfortunately seurat doesn't give the test statistic information I was referring to. In that case the ranking formula you've proposed is probably the best option.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Alena

unread,
Apr 6, 2021, 3:52:14 PM4/6/21
to gsea-help
So in this case is it appropriate to replace the Inf and -Inf values with values that are as close as possible to the maximum and minimum rankings for that list of genes generated by the FindMarkers function? Would I still be able to use this list for a weighted (p=1) analysis?

Alena

Anthony Castanza

unread,
Apr 6, 2021, 3:59:55 PM4/6/21
to gsea...@googlegroups.com
Hi Alena,

If you're using the classic (non-weighted) statistic, then running with arbitrary values to replace the infinities is no problem since the exact values aren't used anyway.

For a weighted statistic we generally recommend that the values be directly biologically meaningful, which a pValue isn't. You can try it though, people do use that ranking metric in weighted mode and it seems to work, but its not something we actively support. Replacing the infinities will only under-weight those genes with respect to their "true" infinite value.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Alena

unread,
Apr 6, 2021, 7:40:59 PM4/6/21
to gsea-help
Hi Anthony,
Thank you for your insight. I think I have one last question. My graph of avglogFC (x axis) vs pvalue (y axis) looks like this (see attachment). 
Is it still appropriate to use avglog(FC) as a weighted statistic?
Thank you
Alena

Screen Shot 2021-04-06 at 7.29.34 PM.png

Anthony Castanza

unread,
Apr 6, 2021, 8:04:03 PM4/6/21
to gsea...@googlegroups.com
Hi Alena,

Yes, I would be concerned about the genes with a large Log2(FC) but a P=~1, dealing with those kind of genes are precisely why people have developed alternative ranking metrics like the one you proposed in your original message.



--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages