GSEA Normalization Input through DESeq2 and extra

Yeeshouw Wang

unread,

Dec 16, 2022, 1:47:03 PM12/16/22

to gsea-help

Hello,

I have a couple questions on the data preprocessing pipelines before inputting such data into GSEA tools to ensure I am not inputting incorrect data and receive misleading results. I am not well educated in statistics as of yet, but trying my best to understand the theory behind it, so any mistakes I say here please correct me.

One question is regarding DESeq2 Normalization methods. It seems from this FAQ on GSEA, natural scale data should be used instead of log2 scale data, and it seems this post (correct me if I am wrong) that TPM normalized data is not suitable as input data into GSEA. And it has been stated in the user manual that DESeq2 or Voom normalized methods can be used. It seems DESeq2 has multiple ways to normalize data however.

Two functions in DESeq2 is varianceStabalizedTransformation() (or the wrapper for is vst()) and rlog(). The other is using counts(dds, normalized = TRUE), which stated by Michael Love here only divides the counts by a sizeFactor correction (not exactly sure what this means), but could this be the normalization method GSEA suggest from DESeq2? Personally, I have used the counts( , normalized = TRUE) method, yielding 33939 non zero rows, and the enriched gene sets have reasonable biological significance. Though it seems from this discussion on Bioconductor by Michael Love, he prefers vst(), but he says "for anything involving a distance" so might not be applicable for GSEA?

Gordon Smyth in this biostars post says using cpm(y, log = TRUE) should be sufficient, but going slightly against the suggestion from the FAQ on GSEA to use natural scale data instead of log scaled.

Second question is the difference in results/methodology (if any) from using the ranked log2FC list generated from DESeq() function versus inputting normalized counts into GSEA or related GSEA R packages. From the same biostars post, and individual suggested use the GSEAPreranked algorithm, but I am not sure how up-to-date these packages are with respect to the Broad Institute software.

Clarification Question: it seems if I used NCBI based gene symbols (I believe also called EntrezGene IDs), I have to select the "No Collapse" option before running GSEA (otherwise it will give me an error stating all my genes were collapsed--I forgot the error type), and I was wondering what the reason behind this is.

Any suggestions, comments, or directions to other resources are greatly appreciated.

Best,

Yeeshouw Wang

Castanza, Anthony

unread,

Dec 16, 2022, 2:46:09 PM12/16/22

to gsea...@googlegroups.com

Hi Yeeshouw,

Yes, we recommend the use of natural scale data with GSEA. I can’t speak to the specifics of the recommendation that Gordon gave when using the voom cpm transform, but it is not the normalization I would use.

Generally, the recommendation that I give is to use the DESeq2 sizeFactor correction “counts( , normalized = TRUE)” method. I will admit that our testing of normalization methods for RNA-seq data for input into GSEA was not particularly comprehensive, but the sizeFactor correction appeared to perform well (hence our recommendation).

As to using the Log2FC data directly in GSEAPreranked, this approach can work, but the results will likely differ slightly. In standard GSEA genes are by default ranked by the signal-to-noise ratio which incorporates information about both the magnitude of the change in expression between the samples, and the standard deviation of that change. If supplying the log2fc list, you only provide GSEA with information about the magnitude. If you have access to the underlying data, the full method is likely superior to the preranked method (assuming reasonable sample numbers).

Finally, EntrezGeneIDs are not the same as gene symbols. NCBI/Entrez gene ids are a numerical gene identifier (i.e. 7157), whereas Gene Symbols (which are curated by HGNC, not NCBI) are the gene names that most are familiar with (i.e. TP53).

When running GSEA, we almost always recommend setting the Collapse/Remap parameter to “Collapse”. But when you do so you also have to speciy the correct CHIP file for your data. This allows GSEA to map whatever identifiers you’re using in your dataset to the correct versions of gene symbols that match the ones used in a given version of MSigDB. If you provide a sampling of the gene identifiers used in your data I can tell you more specifically which chip file you should use.

Let me know if you have any more questions.

-Anthony

Anthony S. Castanza, PhD

Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/51cbdd96-de76-437a-bbe2-1b3232bba595n%40googlegroups.com.

Yeeshouw Wang

unread,

Dec 16, 2022, 3:45:49 PM12/16/22

to gsea-help

Hello Anthony,

Thank you very much for those clarifications! I'll continue to use the sizeFactor normalized read counts.

Here is a sample from the Expression Data, under the sizeFactor normalization I inputted into GSEA.

If this is helpful information, based on our sequencing provider, they used NCBI's mice genome assembly version GRCm38.p6 to align the pair-end fastq files using HISAT2 and provided us with the BAM files. I used Rsubreads package featureCounts() to quantify the reads using the GTF annotation file from the same mice genome assembly, outputting those GeneIDs (which I believe now are actually HGNC Gene symbols? And not NCBI Entrez Gene ID).

Best,

Yeeshouw Wang

Castanza, Anthony

unread,

Dec 16, 2022, 4:09:03 PM12/16/22

to gsea...@googlegroups.com

Hi Yeeshouw,

Those appear to be Mouse Gene Symbols. MGI is the authority for mouse gene symbols not HGNC (which is the authority for Human Symbols).

Since this is mouse data, you’d likely want to run with the “Mouse_Gene_Symbol_Remapping_Human_Orthologs_MSigDB.v2022.1.Hs.chip” chip file if you’re running gene sets from the standard “Human collections” from MSigDB (gene sets that end in the .Hs suffix), or with the “Mouse_Gene_Symbol_Remapping_MSigDB.v2022.1.Mm.chip” chip file if you’re running gene sets from the “Mouse collections” component of MSigDB (gene sets that end with the .Mm suffix).

-Anthony

Anthony S. Castanza, PhD

Department of Medicine

University of California, San Diego

From: Yeeshouw Wang
Sent: Friday, December 16, 2022 12:45 PM
To: gsea-help
Subject: Re: [gsea-help] GSEA Normalization Input through DESeq2 and extra

Hello Anthony,

Thank you very much for those clarifications! I'll continue to use the sizeFactor normalized read counts.

Here is a sample from the Expression Data, under the sizeFactor normalization I inputted into GSEA.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/54877144-0b80-4cf4-bc75-723ed5accfe5n%40googlegroups.com.

Yeeshouw Wang

unread,

Dec 16, 2022, 6:13:04 PM12/16/22

to gsea-help

Hello Anthony,

Thank you again for the correction, I'll make this change to either of the Chip Platforms.

Two follow up questions:

1.) My project is based upon a transgenic mice model for a human mutation, and ultimately would like to see how that transgenic gene in mice translates to humans. As for the two Chip platforms that would work and the gene sets available in the human collection vs. mice collection, which one would you suggest to use? Or does it not necessarily matter? For example, if I wanted to see how the HallMark genesets are enriched, would it be more fitting to use the mice or human based Hallmark geneset (with the respective Chip platform) to find enriched biological pathways? I will most likely try both, but would like a second opinion.

2.) For future reference, what are your recommended resources to become familiar with different gene symbols so that a correct Chip platform can be selected? I see now that it will be based on context from the database from which I am aligning and quantifying my sequences from, but it might be helpful none the less. I may have missed it in the "Chip Annotations" section in the User Manual.

Apologies if these are trivial questions.

Best,

Yeeshouw Wang

Castanza, Anthony

unread,

Dec 16, 2022, 6:53:37 PM12/16/22

to gsea...@googlegroups.com

Hi Yeeshouw,

This would seem like a good use case for taking advantage of the orthology mapping files we provide (the Mouse_Gene_Symbol_Remapping_Human_Orthologs_MSigDB.v2022.1.Hs.chip file) to run your data against the Human MSigDB collections. As to the Hallmark collection specifically, unlike most of the other collections in the Mouse Database which are derived from mouse native sources, the Hallmarks are just an orthology converted version of the human collection, so there shouldn’t be much difference running them in the mouse namespace.

As to the different formats of gene identifiers and which chip file is appropriate for each one, unfortunately we don’t provide a detailed breakdown of what each one is for. The file names of each chip should generally provide enough information to determine the correct platform, but when in doubt, it is always possible to download the file from our downloads page, open it in a text editor and check for yourself.

If this is helpful information, based on our sequencing provider, they used NCBI's mice genome assembly version GRCm38.p6 to align the pair-end fastq files using HISAT2 and provided us with the BAM files. I used Rsubreads package featureCounts() to quantify the reads using the GTF annotation file from the same mice genome assembly, outputting those GeneIDs (which I believe now are actually HGNC Gene symbols? And not NCBI Entrez Gene ID).

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+..@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/54877144-0b80-4cf4-bc75-723ed5accfe5n%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/0a174ae7-53ed-4c71-b2c1-42af9956c745n%40googlegroups.com.

Yeeshouw Wang

unread,

Dec 16, 2022, 7:05:48 PM12/16/22

to gsea-help

Hello Anthony,

Thank you very much for your quick and extensive help, and time. This clears things up for me.

Have a wonderful holiday.

Best,

Yeeshouw Wang

Reply all

Reply to author

Forward