Using GSEA tool without expression data

453 views
Skip to first unread message

Didem Döken

unread,
Jan 7, 2022, 2:20:47 AM1/7/22
to gsea-help
Hello!
I have a significant gene list in my hand I want to compute the overlaps between some group of genes. Normally, I do it using online version MSigDb and clicking on investigate gene sets. However, now I have 1200 genes to see the enriched groups in C2 curated sets. The online one just accepts max 500 genes. I couldn't use GSEA desktop app without expression data. Am I correct? I can also try R codes but I also couldn't find a why to investigate gene sets codes on R packages. 

Can you help me to do that? I need to compute overlaps/investigate my gene set without an expression data.

Thank you so much,
I appreciate any help or advice!

Anthony Castanza

unread,
Jan 7, 2022, 2:25:09 PM1/7/22
to gsea...@googlegroups.com

Hello,

 

The MSigDB webtools use a hypergeometric overlap statistic test. These form of tests are not particularly robust, and tend to give unreasonably permissive statistics when used with large gene lists. So that, combined with the computational limitations of our webserver being a shared resource have resulted in the submission limit you're hitting here.

 

You don't necessarily need the full expression data to run GSEA, GSEA supports a mode called "Preranked" where you can give it a scored list of genes. So if you have access to the full differential expression results (i.e. log2FC's for every gene without filtering for significant or not), I would strongly urge you to consider that approach instead, which is likely to give you similar results to what you would've gotten from the overlap test but with much more reasonable statistics, as well as the other benefits of a "full" enrichment analysis approach, such as quantitative metrics of the magnitude of enrichment.

 

If you don't have access to that data; does that 1200 gene list contain both positive and negative effects? You might be able to run the positive and negative sides separately though the webtool. There is a little bit of flexibility in the 500 gene cap so a split list might make it through.

 

Let me know if neither of those approaches will work for you and perhaps we can come up with an alternative.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/20c6d259-b38d-4cb8-8b7f-41562761d416n%40googlegroups.com.

Nicole Ho

unread,
Aug 6, 2022, 12:19:52 PM8/6/22
to gsea-help
Hi Anthony,

Thank you very much for your reply! I have the same question of having more than 1500 genes for computing the overlaps using Hallmark gene sets. Could you explain more on how could I do the analysis by using the log2FC? I have the log2FC for each gene, and more negative means they rank higher in my study. Should I put the absolute(log2FC) as the ranking for each gene so that I could run the Preranked? I actually tried this and it gave me a warning saying the features are too few for GSEA. Moreover, some gene sets are filtered out during the analysis, it said "Gene set size filters (min=15, max=10000) resulted in filtering out 36 / 50 gene sets; The remaining 14 gene sets were used in the analysis". I am so confused on the gene sets being filtered. Could you please suggest on what should I do for computing the 1500 genes? 

Thank you in advance!

Best regards,
Nicole

Anthony Castanza 在 2022年1月8日 星期六凌晨3:25:09 [UTC+8] 的信中寫道:

Anthony Castanza

unread,
Aug 8, 2022, 2:58:07 PM8/8/22
to gsea...@googlegroups.com
Hi Nicole,

In order to do an analysis with ranking information like Log2FC, you would have to use GSEAPreranked from the GSEA application. That said, we don't support performing GSEA with restricted datasets like this. GSEA expects information for all expressed genes, not a filtered list of just 1500. If you only have 1500 genes, GSEA can not reasonably analyze many of the sets that are available as there is no information for many of the genes in those sets, so they get thrown away by GSEA's minimum thresholds. There isn't really much we can do about that, sorry.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Nicole Ho

unread,
Aug 8, 2022, 11:45:07 PM8/8/22
to gsea-help
Many thanks Anthony!!

Anthony Castanza 在 2022年8月9日 星期二凌晨2:58:07 [UTC+8] 的信中寫道:
Reply all
Reply to author
Forward
0 new messages