Inquiry regarding Gene Content (lncRNA/miRNA) in MSigDB Collections (C2, C5)

7 views
Skip to first unread message

장한 (케로베로스)

unread,
May 19, 2026, 11:51:28 AM (yesterday) May 19
to gsea-help
I am currently conducting GSEA analysis using the C2 and C5 collections and have a question regarding gene biotype inclusion.(RNA_seq)

Specifically, my questions are:
1. For C2 and C5 collections, is it sufficient to include only protein-coding genes in the ranked gene list?

2. If lncRNAs and miRNAs are excluded from the input gene list when using C2, would this significantly affect the results?

Thank you for your time and I look forward to your response.

Anthony Castanza

unread,
May 19, 2026, 7:37:58 PM (21 hours ago) May 19
to gsea...@googlegroups.com
Hello,

MSigDB itself is not specifically filtered to include only protein coding genes, however, the ncRNA component is generally quite small across most sets.
The largest impact will actually be in the gene distribution that GSEA uses to compute rankings, where filtering these genes out will change the underlying differential expression distribution.
I can't speak specifically to how much the impact will be, it is extremely variable from dataset to dataset. That said, it is not inherently incorrect to do so if you are interested in just the protein coding components.

Let me know if you have any other questions,

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/26a07d46-3a9f-4573-a2d2-0c65f11e27e7n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages