GSEA pre-ranked

S

unread,

Aug 21, 2018, 4:58:46 PM8/21/18

to gsea-help

Hi,

I am trying to run GSEA pre-ranked after an RNA-SEQ experiment (control vs samples treated with drugs)

Should i use the whole list of differentially expressed genes( ~20,000 genes) or only genes that pass a certain threshold (Example: p-value <0.05 or fold change cut-off) which reduces the list to ~1000-2000 differentially expressed genes.

Thanks,
Subhi

David Eby

unread,

Aug 22, 2018, 12:04:46 PM8/22/18

to gsea-help

Hi Subhi,

Our standard advice is to use the full, unfiltered list as GSEA's statistical tests benefit from the additional data, and processing time for a dataset of this size is unlikely to be an issue.

There are a few pieces of additional info & advice, however:

In general, poorly expressed or undifferentiated genes will end up populating the middle of the ranked list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score.
However, for Preranked our advice is to choose a metric that avoids duplicate ranking values (ties) as GSEA will not resolve these and the order of genes will be arbitrary and the results possibly erroneous. In practice, a modest amount of ties in the middle of the list won't have a significant effect while a large number of ties can lead to a skewed or unbalanced distribution. If this is the case then filtering or, more likely, the choice of a different ranking metric might be appropriate.
If you have RNA-Seq data with Ensembl IDs, it may be better to use our Ensembl CHIP files so that you can run "regular" GSEA instead of Preranked as that is the more generally preferred method anyway. Preranked is better for folks who are particular about the ranking method or who have data not accommodated by our other CHIP platform files.
As that above page notes, in this case it may be beneficial to filter out low-count measurements before further preprocessing, normalization, and analysis. You might find this beneficial even if you stick with using Preranked (ahead of computing the ranking metric).

I hope this helps!

Regards,

David

David Eby
www.gsea-msigdb.org
igv.org
genepattern.org

Young

unread,

Aug 19, 2019, 11:49:46 AM8/19/19

to gsea-help

Hi David,

As a following up question, if the gene list contains genes that smaller than the 20 (we got the ranking metrics using some novel statistical method), can we still use the pre-rank method? If not, what is the reasoning for the problem behind this?

I remember I searched for the answer of the question in the google group a year ago, but I couldn't find it now.

Thank you,

Young

David Eby

unread,

Aug 20, 2019, 5:00:15 AM8/20/19

to gsea...@googlegroups.com

Hi Young,

The basic answer is the same as my earlier note to Subhi. There is no "problem" behind this; it's simply due to the way the method works. I am not a statistician so I'll give a bit more of an intuitive reply. Perhaps someone else with chime in with a more rigorous follow-up.

To give a bit more insight, having so few genes in the dataset means that very few Gene Sets - and very likely none - will wind up matching your data simply due to the extremely small number being barely above the matching threshold. As well, note that there is typically a large amount of information in the rest of dataset that, even if if it's not at the top of the list, is nonetheless still important to the analysis.

Recall that Gene Sets typically represent biological pathways or processes of some kind, comprising multiple genes or features. Thus, even if your statistical method is capturing the Top N Most Relevant Features, there are still some other (typically large) number of somewhat-lesser features that ALSO contribute to these pathways or processes. Focusing on just the Top N loses the additional information present in the rest of the data set.

By instead looking across the entire genome, the information from all the genes contributing to the pathway / process will be included, thus distinguishing which is most enriched overall. In a pathway analysis, it's important to have information from multiple participant genes - as many as possible - in each pathway.

I hope this helps.

Regards,

David

David Eby
www.gsea-msigdb.org
igv.org
genepattern.org

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/0ead3a6c-d7b9-45f2-adbd-b234d7793c17%40googlegroups.com.

Reply all

Reply to author

Forward