GSEA can not recognize the gene ensembl ID in both gene sets and expression dataset

M Esmaeli

unread,

May 31, 2021, 9:50:07 AM5/31/21

to gsea-help

I've realized that GSEA can not recognize some Ensembl IDs in gensets. I read this from the user guide:
""" A table of genes in the gene set ordered by their position in the ranked list of genes. The analysis includes only those genes in the gene set that are also in the expression dataset.""".
I thought that GSEA will recognize all gene IDs from gensets and compared them with the expression dataset for analysis. But I found some genes that even are in the top-ranked genes list and also available in gensets but will not come up in results.
I am wondering it is technical problem or just the nature of the algorithm.
Thank you

acas...@cloud.ucsd.edu

unread,

May 31, 2021, 3:59:06 PM5/31/21

to gsea...@googlegroups.com

Hi,

By default, an Ensembl ID will be omitted from the CHIP if it does not have both an assigned gene symbol, and an assigned NCBI gene ID. Genes missing this information will also never be found in a gene set. However, since these genes can still have an impact on the gene ranking, to do allow these ensemble ids to be passed through the collapse operation “as-is”. This option is under “Advanced fields” called “Omit features with no symbol match”, by default, this is set to “true” which restricts the expression set to just genes that fit the Ensembl/HGNC/NCBI consensus criteria, setting this to “false” will include the Ensembl IDs. We think that omitting these genes is generally fine as they are generally lower quality constructs (genes with less strong evidence of being “real” as they are only present in Ensembl’s transcriptome model and not in the NCBI model), but switching this parameter to “true” will give you a technically more accurate depiction of how your enrichment compares to all the constructs that ensemble says are possible.

If this doesn’t address your question, can you give an example of an Ensembl ID that was omitted that you think shouldn’t have been?

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/87137502-2bcf-468c-bc0d-1df63a94a3ean%40googlegroups.com.

M Esmaeli

unread,

Jun 1, 2021, 12:15:57 AM6/1/21

to gsea...@googlegroups.com

Hi Anthony
Many thanks; I understand how it works now. I prepared the genesets from BioMart for zebrafish and my expression datasets are 2000-2500 genes. I have 27 samples and the data are data is "transformed log2". Do you recommend “Omit features with no symbol match to be false" for my situation? (as I think still zebrafish or Atlantic salmon is not comprehensively investigated like human or mice).
Also, I have another question regarding the ranking process. I read in the original paper of GSEA which in ranking option "Diff_of_Class" is good for small datasets although the problem related to false positve is available. Can you please give me your idea about selecting "Diff_of_Clsss" or "Signal2 Noise" for my dataset?.
Thank you in advance

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/031801d75657%2467946180%2436bd2480%24%40cloud.ucsd.edu.

Anthony Castanza

unread,

Jun 1, 2021, 1:27:36 PM6/1/21

to gsea...@googlegroups.com

Hello,

27 samples is well within the range of what we’d consider an acceptably large dataset (assuming they’re divided somewhat equally into two phenotypes), so for this signal2noise would be the recommended metric.

There are a couple issues here though, 2000-2500 genes in the input dataset is well below what we’d recommend. GSEA is designed to be run on all the expressed genes, and usually indicates some sort of significance filtering which isn’t a part of the recommended GSEA data processing pipeline. We also don’t recommend using log2 transformed data generally, the input you’d want for GSEA is some sort of between-sample normalized counts (various methods exist to do this normalization, such as options in DESeq2 which are enabled by default in the GenePattern.org and usegalaxy implementations of DESeq2)

When you say “I prepared the genesets from BioMart for zebrafish” do you mean you prepared gene sets for the gene sets database input, or you produced orthology conversion chips to be used with MSigDB gene sets database files?

The “omit” parameter is only applicable when collapsing the dataset with a chip file, if you’ve done orthology conversion I recommend leaving this as-is (false).

I should also note that we don’t officially offer support for zebrafish so I don’t know how much I’m going to be able to help with specific details.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqAS6BwXjg%3DHxaBwtcowpetQ9c4nHpXwF7fmEn-9hf2L3g%40mail.gmail.com.

M Esmaeli

unread,

Jun 1, 2021, 9:00:16 PM6/1/21

to gsea...@googlegroups.com

Hi Anthony
Apologies, I had to give you more information. The data is proteomics and because of that, I have only 2000-2500 genes. My data has passed the process of normalization between samples (global normalization; median MS2 intensity values). Can I use this original LFQ LC-MS data for GSEA? If not, can you please suggest to me which method I should use to do normalization instead of Log2?
I prepared gene sets from BioMart for the "Gene sets database" input. For running GSEA, I choose "No_Collapse" and for Permutation type " gene-set".
While we know GSEA is not a perfect option for zebrafish and proteomics, I am just wanted to try that to probably get it close to something acceptable.
Many thanks

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/BYAPR05MB5782D6052C4985F04C70B855F73E9%40BYAPR05MB5782.namprd05.prod.outlook.com.

Anthony Castanza

unread,

Jun 1, 2021, 9:24:32 PM6/1/21

to gsea...@googlegroups.com

I see, I’m not super familiar with proteomics pipelines or the normalization scheme you’ve described, but my suggestion would be to use the non-log transformed normalized data with signal-to-noise ratio as the ranking metric. If you only have the log transformed value for each sample, It should be fairly trivial to unlog the sample-by-gene matrix without affecting the global normalization that was done.

One other suggestion is that, since you have so few genes, you might lower the min size parameter under basic fields to “10” or “5”, this will change the threshold that GSEA uses to reject gene sets that don’t have enough members present in the input assay. Normally we don’t recommend changing this parameter too much, and I’d be cautious about sets that small, ,but with only 2000ish genes, you might have to.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqApPdj%2BrtWYxG%2Bb-EqnVmJMQ_4uintZtMJi39Z015bwUg%40mail.gmail.com.

M Esmaeli

unread,

Jun 1, 2021, 10:44:53 PM6/1/21

to gsea...@googlegroups.com

Thanks Anthony
I just tested the output of LFQ LC_MS data in GSEA and worked well with no errors. You helped me a lot with answering all my questions and your suggestions; I appreciate that.
Kind regards

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/BYAPR05MB578272AEB70267DA579142FCF73D9%40BYAPR05MB5782.namprd05.prod.outlook.com.

M Esmaeli

unread,

Jun 2, 2021, 6:21:31 AM6/2/21

to gsea...@googlegroups.com

Hi Anthony
As the last question, I could run GSEA and the results were quite biologically meaningful. I have two options now, it would be great to give me your idea about it and say which one is recommended based on my dataset:
1. Putting all GO terms+KEGG pathways as "Gene sets database" which GSEA would use around 3500 genes set for the analysis (as a result, more gene sets are significant at FDR < 25%).
2. Putting only GOBP term+KEGG pathways as "Gene sets database" which GSEA would use around 2100 genes set for the analysis (as a result, fewer gene sets are significant at FDR < 25%).
Thanks

Anthony Castanza

unread,

Jun 2, 2021, 6:06:37 PM6/2/21

to gsea...@googlegroups.com

Hi,

We generally recommend running sub-collections individually. GSEA's false discovery rate, like all of GSEA's statistics, is empirical, so the FDR it gives you is a FDR for the universe of sets that you've put in. Putting in more sets you might get more significant results overall, but each individual set might be less significant than it would be if you'd just compared it against its collection.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqApoVYmki90_%2B1UuaiE7EPf0Xvp%2Bb9cX-rxHjTspa3GVQ%40mail.gmail.com.

Reply all

Reply to author

Forward