gsea preranked , order of the ranked genes

191 views
Skip to first unread message

Varun Gupta

unread,
Feb 14, 2024, 4:29:40 PM2/14/24
to gsea-help
Hi,
I am using- GSEA preranked tool. My data comes from DESeq2 output. If I use wald statistic or LOG2FC * -log10(pvalue) as a metric, do I need to arrange the list into any order? Like descending or ascending order?
Can I just provide the 2 column file (gene and metric stats) as is? If I remember, GSEA does a descending sort anyway.

Hope to hear from you soon.

Regards,
Varun

Anthony Castanza

unread,
Feb 14, 2024, 5:31:26 PM2/14/24
to gsea-help
Hi Varun,

As long as you're following GSEA's rnk data format you shouldn't need to do anything else. GSEA will handle the sorting internally.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/b7193e7a-2128-476b-99bd-23e46082767bn%40googlegroups.com.

Varun Gupta

unread,
Feb 15, 2024, 12:04:43 PM2/15/24
to gsea-help
Hello Anthony,
Thank you for your reply. I have few follow up questions.
1. It is mentioned that FDR to be taken into consideration should be < 25%. Is this because there are many gene sets with which our expression data is being tested?
2. When I look into the Details of the pathway (by default, top 20 pathways have Details link. Can I increase it to top 40?), I look at genes that make up the pathway in my data set. It says in one of the column, Rank in gene list. I do arrange my list(rnk file) in descending sort, and the rank does not match what is displayed. First of all does the ranking starts with a 0?
3. Comparing my rnk data with for example C2 group. What is the difference between comparing with the whole c2 group and just c2.cgp group.

Thanks for your help and support.

Regards,
Varun

Anthony Castanza

unread,
Feb 15, 2024, 12:57:06 PM2/15/24
to gsea...@googlegroups.com
Hi Varun,
To answer your questions in order:
1. The recommendation for a 0.25 FDR cutoff applies to GSEA in phenotype permutation mode, if you use gene_set permutation mode more standard 0.05 cutoffs would be recommended. Yes, the FDR is intended to serve as a consideration for multiple testing.
2. Yes it is possible to change the number of sets that GSEA creates a detailed results page for, but this requires rerunning GSEA. The parameter is found under GSEA's "Advanced fields" settings, called "Plot graphs for the top sets of each phenotype". Do be warned that setting this to large numbers can cause significant performance issues. Also, since this requires rerunning GSEA the results may vary slightly as a result of the random seeds used. If you want to return the same values, you will need to get the Permutation seed value from the "Comments" sectionm on your results index.html page, and supply that value to the "seed for permutation" parameter (also in the advanced fields section) replacing the default "timstamp" value. The nternal ranking (the rank in gene list value) is zero indexed.
3. the C2.CGP subcollection omits the "canonical pathways" component of MSigDB instead just containing the sets curated from the literature. You can see the full breakdown of what is included in what collection on the MSigDB website https://msigdb.org/ by clicking the collection of interest from the main page.

Let me know if you have additional questions, or I missed anything in my response

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

VG

unread,
Feb 15, 2024, 1:16:43 PM2/15/24
to gsea...@googlegroups.com
Dear Anthony,
Thanks for your reply. Regarding FDR < 0.25, I found this on FAQs on GSEA:

Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?

An FDR of 25% indicates that the result is likely to be valid 3 out of 4 times, which is reasonable in the setting of exploratory discovery where one is interested in finding candidate hypotheses to be further validated as a result of future research. Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results

Can you explain what the bold highlighted line means?

-> For the UI version of GSEA, can I run multiple runs together or do I have to do them one by one? If yes, can you tell how?

-> For the 3rd point above, I know I won't be having a canonical pathway component, but (What is the difference between comparing with the whole c2 group and just c2.cgp group?) Does it change the FDR since we have more gene sets in all of the C2 group?

I really appreciate all your help, time and support.

Regards,

Varun




Anthony Castanza

unread,
Feb 15, 2024, 2:38:05 PM2/15/24
to gsea...@googlegroups.com
Hi Varun,

The bolded sentence is just saying that given the dataset properties of a typical GSEA run, with the phenotype permutation mode, the 0.25 cutoff is generally reasonable. By coherence, it means that the actual signal in most expression data is pretty weak, and if you limit yourself to only the most extreme results, you're more likely to miss real findings than the hard math would suggest.

For performing multiple runs, you can set them up in the UI, click Run, then in the lower left there will be a jobs queue that will show all the running analyses. You can have multiple running at the same time for sure. But how many will depend on the amount of system memory you have available.

Yes, the FDR will change due to the change in the gene set universe being tested (the omission of the c2.cp sets when running just .cgp compared to all c2.)

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

VG

unread,
Feb 15, 2024, 3:32:30 PM2/15/24
to gsea...@googlegroups.com
Dear Anthony,
Thank you for the wonderful explanation. I have another follow up question. Let's say after running my rnk file with a hallmark data set, I get a pathway HALLMARK_INFLAMMATORY_RESPONSE upregulated in my phenotype na_pos.
I look at the genes where core enrichment says yes. There are around 40 genes. I take these 40 genes and on this page: https://www.gsea-msigdb.org/gsea/msigdb/human/annotate.jsp?geneSetName=HALLMARK_APOPTOSIS
add these 40 genes on the left and compute the overlap between let's say c2 group. Will the results be valid or is it cherry picking?

Thanks for all the help you have been providing me.

Regards,
Varun

Anthony Castanza

unread,
Feb 15, 2024, 3:42:31 PM2/15/24
to gsea-help
Would the results be valid for what? It would tell you what other sets might also be involving the genes that were driving the enrichment of the Hallmark you obtained but it wouldn't be a particularly statistically rigorous analysis. Potentially suggestive of areas to investigate further though.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages