missing NOM p-val

席甲甲

unread,

Aug 10, 2021, 11:25:44 AM8/10/21

to gsea-help

Hi, everyone,

I did an GESA analysis among a data set. The top 28 gene sets were listed without "NES" or "NOM p-val", like the attachment. I have two questions:

First: does these GOs really have no NES or NOM p-val or the values are too low to show?

Second: for comparison purpose, if I have to have a value for following up presentation of these data, could I use 0.0001 or the smallest value in this group to present them?

Thank you very much for your concerns!

Jiajia

Screen Shot 2021-08-10 at 11.18.46 AM.png

Anthony Castanza

unread,

Aug 10, 2021, 1:28:25 PM8/10/21

to gsea...@googlegroups.com

Hi Jiajia,

The explanation for this is fairly complicated, when GSEA produces a null distribution of enrichment scores for random permutations of the data, the null distribution has a positive and negative component. GSEA compares the true enrichment score of the gene set to the side of this distribution with the same sign. In rare cases, all of the null distribution will be on one side, if that side is the opposite sign of the true enrichment score, GSEA will get a NaN in its pValue calculation function.

This most frequently occurs when there is a large skew in the underlying expression data, but GSEA found a putative enrichment on the opposite side of the skew (the small side), and when gene set permutation is used, it can be difficult to sample enough random genes from the smaller side to actually produce a null distribution for that side.

Were you running in Gene Set permutation mode? If so, can you try increasing the number of permutations to 10,000 instead of the default 1000?

It's worth noting that the FDRs for those dets aren't valid either.

Can you send us an enrichment plot from one of the gene sets that returned a "---" as well as the associated "Random ES distribution" plot, similarly, those two plots from one of the negative enrichment score gene sets that did return valid values? That will help confirm that this distribution error is in fact what occurred here.

Thanks

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3b82f584-6f5a-486b-80b1-fbfb9c444db9n%40googlegroups.com.

Message has been deleted

席甲甲

unread,

Aug 13, 2021, 12:22:21 PM8/13/21

to gsea-help

Hi, Anthony;

Thank you very much for your explanation!

1) Here I attached two pairs of enrichment plot and ES distribution plot. The plots related with "Activation of immune response '' are from a gene set with "---" NES and NOM p-val; while the plots related with "Establishment of protein location to ER" are from a gene set with a valid value. The ''activation of immune response" is actually the first gene set listed in this analysis, while the "Establishment of protein location to ER" is only listed at 29th in the GS list. Frankly speaking, I do not know how to read these plots. Could you explain a little bit?

2) Which value(s) should I use to compare across different gene sets? Or present gene sets about their authenticity or their relevance with my input data set? In different literatures, I read about the presentation of GSEA with size, NES, NOM p val or FDR q val of particular gene set. I am wondering which of these factors should be included in scientific presentation of GSEA. is there a standard? And if NES is used, is NES value the smaller the better or the larger the better?

3) I am curious about the order of the gene sets in the "detailed enrichment results in html format". Are they ordered according to NES or NOM p val? Does this NES value include only the expression of genes in indicated phenotype ( phenotype A), or in the pair of phenotypes used in this analysis (phenotype A and B), or in all input phenotype ( totally four phenotypes were subject to this analysis with duplicated samples)?

4) In GS details, the gene set details report show a table of genes in the gene set ordered by their position in the ranked list of genes. I am concerned about the genes' RANK in the list, does this meaning the contribution of genes to the selected gene set? Could I understand as: the smaller the rank is, the critical the gene is to the selected gene set? If this is the case, when the core enrichment genes are with larger number of RANK IN THE LIST, is our input data still close relevant with the selected gene set?

5) Another question I have about the GSEA is: in a paired analysis (phenotype A vs B) if a gene set is enriched in phenotype A bout not phenotype B, could I explained as that the selected gene set decreases ( or, is down regulated) in phenotype B?

Thank you again for your assistance! I am looking forward to your reply!

Jiajia

enplot_GOBP_ACTIVATION_OF_IMMUNE_RESPONSE_847.png

Screen Shot 2021-08-13 at 10.13.20 AM.png

Screen Shot 2021-08-13 at 10.18.07 AM.png

Screen Shot 2021-08-13 at 10.18.48 AM.png

Anthony Castanza

unread,

Aug 13, 2021, 6:59:55 PM8/13/21

to gsea-help

Hi Jiajia,

Thanks for sending these plots!

1) The description of the enrichment plots and how to interpret them is available in our user guide here: https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results

2) Generally we consider gene sets with a NOM pValue <0.05 and a FDR<0.25 (for gene set permutation FDR <0.05) to be "significant" then within those sets the NES is useful for comparing how highly sets are enriched compared to other sets. A larger NES (either positively or negatively) is an indication of stronger enrichment of the gene set's members.

3) Gene sets are ordered by their NES (the gene set size normalized enrichment score). I'm not sure what exactly you mean by the rest of this question, GSEA will have performed whatever differential expression calculation you select from your CLS file and then will search across that ranked lists for how pertrubed each gene set is. Gene sets consisting of genes that are over-all skewed towards the positive side will be assigned positive scores (on balance upregulated in the selected comparison), and gene sets consisting of genes over-all skewed towards the negative side will be assigned negative scores (on balance downregulated in the selected comparison). GSEA only by default supports a binary comparison (i.e. Phenotype A vs Phenotype B), this is extended somewhat by the ability to do One vs. REST comparisons from the UI, as well as correlation analysis with continuous vectors. GSEA by default can not directly compare four groups. If you only had two samples per phenotype you would need to select gene_set permutation instead of Phenotype permutation for the "Permutation type" parameter. GSEA is only about to construct a valid null distribution in phenotype permutation mode with 7 samples on each "side" of the A vs. B comparison, or 7 samples total if running in the correlation/continuous vector mode. Selecting the wrong permutation mode for your dataset may explain the inability to calculate valid statistics.

4) In the details report, the genes are ordered by their rank in the gene list. This rank is derived from the internal metric of differential expression GSEA calculated for each gene. The more extreme the "rank metric score" of a gene, the more it contributed to the gene set's enrichment score. The "core enrichment genes" are those genes that were the most important to the gene set's enrichment score for your specific dataset. I.e. when walking down the ranked list calculating the enrichment score, those core enrichment genes are the ones in the gene set that the algorithm found before it found the enrichment score peak value (the maximum deviation from zero on the "mountain" (Enrichment) plot.

5) GSEA is symmetrical, if a gene set is enriched ("upregulated") in Phenotype A, it is depleted ("downregulated") in Phenotype B.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Reply all

Reply to author

Forward