Hi Yeeshouw,
Yes, we recommend the use of natural scale data with GSEA. I can’t speak to the specifics of the recommendation that Gordon gave when using the voom cpm transform, but it is not the normalization I would use.
Generally, the recommendation that I give is to use the DESeq2 sizeFactor correction “counts( , normalized = TRUE)” method. I will admit that our testing of normalization methods for RNA-seq data for input into GSEA was not particularly comprehensive, but the sizeFactor correction appeared to perform well (hence our recommendation).
As to using the Log2FC data directly in GSEAPreranked, this approach can work, but the results will likely differ slightly. In standard GSEA genes are by default ranked by the signal-to-noise ratio which incorporates information about both the magnitude of the change in expression between the samples, and the standard deviation of that change. If supplying the log2fc list, you only provide GSEA with information about the magnitude. If you have access to the underlying data, the full method is likely superior to the preranked method (assuming reasonable sample numbers).
Finally, EntrezGeneIDs are not the same as gene symbols. NCBI/Entrez gene ids are a numerical gene identifier (i.e. 7157), whereas Gene Symbols (which are curated by HGNC, not NCBI) are the gene names that most are familiar with (i.e. TP53).
When running GSEA, we almost always recommend setting the Collapse/Remap parameter to “Collapse”. But when you do so you also have to speciy the correct CHIP file for your data. This allows GSEA to map whatever identifiers you’re using in your dataset to the correct versions of gene symbols that match the ones used in a given version of MSigDB. If you provide a sampling of the gene identifiers used in your data I can tell you more specifically which chip file you should use.
Let me know if you have any more questions.
-Anthony
Anthony S. Castanza, PhD
Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/51cbdd96-de76-437a-bbe2-1b3232bba595n%40googlegroups.com.

Hi Yeeshouw,
Those appear to be Mouse Gene Symbols. MGI is the authority for mouse gene symbols not HGNC (which is the authority for Human Symbols).
Since this is mouse data, you’d likely want to run with the “Mouse_Gene_Symbol_Remapping_Human_Orthologs_MSigDB.v2022.1.Hs.chip” chip file if you’re running gene sets from the standard “Human collections” from MSigDB (gene sets that end in the .Hs suffix), or with the “Mouse_Gene_Symbol_Remapping_MSigDB.v2022.1.Mm.chip” chip file if you’re running gene sets from the “Mouse collections” component of MSigDB (gene sets that end with the .Mm suffix).
-Anthony
Anthony S. Castanza, PhD
Department of Medicine
University of California, San Diego
From: Yeeshouw Wang
Sent: Friday, December 16, 2022 12:45 PM
To: gsea-help
Subject: Re: [gsea-help] GSEA Normalization Input through DESeq2 and extra
Hello Anthony,
Thank you very much for those clarifications! I'll continue to use the sizeFactor normalized read counts.
Here is a sample from the Expression Data, under the sizeFactor normalization I inputted into GSEA.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/54877144-0b80-4cf4-bc75-723ed5accfe5n%40googlegroups.com.
Hi Yeeshouw,
This would seem like a good use case for taking advantage of the orthology mapping files we provide (the Mouse_Gene_Symbol_Remapping_Human_Orthologs_MSigDB.v2022.1.Hs.chip file) to run your data against the Human MSigDB collections. As to the Hallmark collection specifically, unlike most of the other collections in the Mouse Database which are derived from mouse native sources, the Hallmarks are just an orthology converted version of the human collection, so there shouldn’t be much difference running them in the mouse namespace.
As to the different formats of gene identifiers and which chip file is appropriate for each one, unfortunately we don’t provide a detailed breakdown of what each one is for. The file names of each chip should generally provide enough information to determine the correct platform, but when in doubt, it is always possible to download the file from our downloads page, open it in a text editor and check for yourself.
If this is helpful information, based on our sequencing provider, they used NCBI's mice genome assembly version GRCm38.p6 to align the pair-end fastq files using HISAT2 and provided us with the BAM files. I used Rsubreads package featureCounts() to quantify the reads using the GTF annotation file from the same mice genome assembly, outputting those GeneIDs (which I believe now are actually HGNC Gene symbols? And not NCBI Entrez Gene ID).
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+..@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/54877144-0b80-4cf4-bc75-723ed5accfe5n%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/0a174ae7-53ed-4c71-b2c1-42af9956c745n%40googlegroups.com.