GSEAPreranked is successful with warnings

1,062 views
Skip to first unread message

陳冠雅

unread,
Jan 17, 2022, 12:58:56 AM1/17/22
to gsea-help
Hello,

I am running GSEAPreranked with GSEA desktop and the parameter for enrichment statistic is weighted. The number in the gene list I used means correlation coefficient between -1 to 1 without NA or inf, which represent some biological meanings. When the analysis is done, it shows that the analysis has succeeded but with warnings (pink color). 
I checked th index.html page, it showed the text below: "Scoring produced infinite or NaNs values which may have prevented plotting for certain gene sets. See the log for more details".  Alos, the log showed the text with warn for 72 GOBP like:

     985900   [WARN  ] - Scoring of GOBP_MONONUCLEAR_CELL_DIFFERENTIATION produced infinite Or NaN value(s)
     985900   [WARN  ] - Scoring of GOBP_REGULATION_OF_HEMOPOIESIS produced infinite Or NaN value(s)
     985900   [WARN  ] - Scoring of GOBP_REGULATION_OF_APOPTOTIC_SIGNALING_PATHWAY produced infinite Or NaN value(s)
     985901   [INFO  ] - Scoring produced infinite or NaNs values which may have prevented plotting for certain gene sets.  See the log for more details.

However, when I set the parameter for enrichment statistic to classic, the analysis is successful without any warnings. 
So I wonder know that why the warnings appear and whether I should change the parameter for enrichment statistic from weighted to classic or not.



Thanks and best regards,
Wendy

Anthony Castanza

unread,
Jan 17, 2022, 4:26:08 PM1/17/22
to gsea-help
Hi Wendy,

We would not recommend setting the enrichment statistic to "classic"; setting it to "classic" removes the weighting on the K-S statistic on the basis of the strength of the ranking that was one of the core improvements to the GSEA algorithm.
Generally, this warning isn't anything to worry about. GSEA intercepts these values to prevent them having an adverse effect on the enrichment score calculation. However, it is unusual to see these warnings in Preranked data. GSEA interprets blank values as NA/NaN. Is it possible that there is a missing value for a gene somewhere in your gene list?
The feature to print these warnings is a new addition to GSEA, I suppose it could be possible that something is triggering these incorrectly, would you mind sharing your ranked list with us? You can do so confidentially by sending it to gsea...@broadinstitute.org. That would help us say definitively what might've happened here. 

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/9fe31215-a6d7-4c28-8bd8-d9028ddc34dbn%40googlegroups.com.

陳冠雅

unread,
Jan 17, 2022, 9:19:05 PM1/17/22
to gsea-help
Hi, 

I send my ranked list to   gsea...@broadinstitute.org.   with title "[gsea-help] GSEAPreranked is successful with warnings". 
I think the warnings show because the gene set at bottom and top has "NaN" as NES and NOM p-val. Also, this issue was reported before: https://groups.google.com/g/gsea-help/c/X2rF94Pxn-Q. So should I increase the number of permutations to maybe 10000 or higher? 

Thanks!


Wendy

ranked_list1.JPGranked_list2.JPG
Anthony Castanza 在 2022年1月18日 星期二上午5:26:08 [UTC+8] 的信中寫道:

Anthony Castanza

unread,
Jan 18, 2022, 12:44:00 PM1/18/22
to gsea-help
Hi Wendy,

Ah, I see, I misunderstood the error message. This is a slightly different error where rather than NA/NaN values in the input ranked list, GSEA has instead failed to generate a valid null distribution for those gene sets.
Looking at the data you sent to gsea-team, it appears as if Ranked List 1 has many more negative correlations than positive correlation, and conversely, Ranked List 2 has many more negative correlations than positive. With highly unbalanced distributions like this, GSEA can fail to generate nulls that are capable of generating enrichments for sets on that smaller side. Increasing the permutation number is really the only thing we can offer as a potential solution here, since this is something that occurs due to the nature of the datasets themselves.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

陳冠雅

unread,
Jan 19, 2022, 2:54:19 AM1/19/22
to gsea...@googlegroups.com
Hi Anthony, 

Thanks for answering my questions. I realize why the warning shows and what it means. 
I set the permutation number to 10000, 20000, 40000, 60000, but there are still "NaN" values in NES and NOM p-val with fewer and fewer pathways. When I set  the permutation number to 100000, there is an error: OutOfMemoryError: Java heap space. I have increased heap space through -Xmx8g, which reached the memory maximum. Thus I think I fail to do anything to improve the result or solve the warning problem. However, I only need the data with NES and FDR q-val for top 50 GOBP, so it seems that those "NaN" values don't influence the other data I need, right?

Thanks!


Wendy



Anthony Castanza <acas...@gmail.com> 於 2022年1月19日 週三 上午1:44寫道:
You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/vPNwgxeflEE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/95af0484-ad48-4d95-aad5-ef0199d5718bn%40googlegroups.com.

Anthony Castanza

unread,
Jan 19, 2022, 12:42:52 PM1/19/22
to gsea...@googlegroups.com

Hi Wendy,

 

I wouldn't have recommended going much beyond 10,000, I haven't crunched the numbers on actual maximum valid permutations for a gene list of N features, but it's going to hit a point of diminishing returns, especially with the memory usage tradeoff as you've noticed.

The issue is that those gene sets which have NaN for statistics could very well be considered hits (or at least candidates). Think of it this way, that set received an enrichment score of, say, +0.8 in the real data, when we generated the permutation matrix for a set of that size there were zero sets in that matrix that also had a positive sign (or vice versa for a set with negative enrichment score in the real data, there would have been results with a negative sign in the permuted matrix for that set). GSEA in Preranked/gene set permutation mode, calculates statistics by looking at the real enrichment score and using it as a threshold for evaluating the null distribution, asking "how many times in the null distribution did the permutations for this set score as well as the real set". So let's say you have an ES of +0.8 again, and you did 1000 permutations. Of those 1000 permutations, 500 have a positive score, of those 500, only 2 have a score >=+0.8, so, the pValue for that set would be 2/500, or 0.004. The FDR is a little more complicated in that it looks at the global distribution of set scores, but basically follows the same principle. The issue here is arising from that "2/500" calculation. What is happening here is that there are not 500 sets in the denominator of that calculation, there are 0 which is causing a divide by zero. By increasing the permutation number, we were trying to eliminate that divide by zero.

 

In this case, it unfortunately appears as if the skew in the dataset is too extreme to overcome by increasing the permutation number, so those sets can't be fully evaluated, but GSEA also isn't able to rule them out entirely either.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

From: gsea...@googlegroups.com <gsea...@googlegroups.com> on behalf of 陳冠雅 <t771...@gmail.com>
Date: Tuesday, January 18, 2022 at 11:54 PM
To: gsea...@googlegroups.com <gsea...@googlegroups.com>
Subject: Re: [gsea-help] GSEAPreranked is successful with warnings

Hi Anthony, 

 

Thanks for answering my questions. I realize why the warning shows and what it means. 

I set the permutation number to 10000, 20000, 40000, 60000, but there are still "NaN" values in NES and NOM p-val with fewer and fewer pathways. When I set  the permutation number to 100000, there is an error: OutOfMemoryError: Java heap space. I have increased heap space through -Xmx8g, which reached the memory maximum. Thus I think I fail to do anything to improve the result or solve the warning problem. However, I only need the data with NES and FDR q-val for top 50 GOBP, so it seems that those "NaN" values don't influence the other data I need, right?

 

Thanks!

 

 

Wendy

 

 

 

Anthony Castanza <acas...@gmail.com> 2022119 週三 上午1:44寫道:

Hi Wendy,

 

Ah, I see, I misunderstood the error message. This is a slightly different error where rather than NA/NaN values in the input ranked list, GSEA has instead failed to generate a valid null distribution for those gene sets.
Looking at the data you sent to gsea-team, it appears as if Ranked List 1 has many more negative correlations than positive correlation, and conversely, Ranked List 2 has many more negative correlations than positive. With highly unbalanced distributions like this, GSEA can fail to generate nulls that are capable of generating enrichments for sets on that smaller side. Increasing the permutation number is really the only thing we can offer as a potential solution here, since this is something that occurs due to the nature of the datasets themselves.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

On Monday, January 17, 2022 at 6:19:05 PM UTC-8 陳冠雅 wrote:

Hi, 

 

I send my ranked list to   gsea...@broadinstitute.org.   with title "[gsea-help] GSEAPreranked is successful with warnings". 

I think the warnings show because the gene set at bottom and top has "NaN" as NES and NOM p-val. Also, this issue was reported before: https://groups.google.com/g/gsea-help/c/X2rF94Pxn-Q. So should I increase the number of permutations to maybe 10000 or higher? 

 

Thanks!

 

 

Wendy

陳冠雅

unread,
Jan 20, 2022, 5:36:28 AM1/20/22
to gsea...@googlegroups.com
Hi Anthony,

OK, I know the reason thoroughly. So perhaps those gene sets with NaN may be the top significant enriched GO biological processes, right? If I use the result but rule the gene sets with NaN out entirely, the data could omit some biological processes which may be the significant enriched GO biological processes in the input gene list?


Thanks!


Wendy

Anthony Castanza <acas...@cloud.ucsd.edu> 於 2022年1月20日 週四 上午1:42寫道:

Anthony Castanza

unread,
Jan 20, 2022, 5:59:50 PM1/20/22
to gsea-help
The general conclusion is that GSEA is unable to statistically rule those sets out, and the reason that it is unable to rule them out leans towards a circumstance where it looks like they might be biologically relevant. Beyond that, it's difficult to draw any conclusions. If you're going to be reporting these results, I might report them separately from the statistically significant results.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Kayondo Fadhir

unread,
Jun 30, 2023, 7:28:33 PM6/30/23
to gsea-help
Hi Anthony,

I came across this thread and found it useful as it addresses the same concern am having.
In addition to that error message, my results tend to have all Gene sets having an FDR = 1. Although the GO terms on top have known biological relationships with my trait of interest,  I fail to pick any as significant as none has FDR < 25%. 

To follow up on the cause of all the FDRs being 1, I plotted the histogram of the nominal p-values and realized it is U-shaped. (Please see the below). Is it usual for this plot to be U-shaped in a normal run? If not, could you trace where my issue is?

Histogram of Nominal p-values.png 

I will appreciate any assistance. Thanks

Fazhir 

David Eby

unread,
Jul 3, 2023, 2:13:09 PM7/3/23
to gsea...@googlegroups.com
Hi Fahzir,

Anthony is traveling for an extended holiday and won't be back until later this week.  I'm sorry to say that I can't offer much help on your question, but I'll make sure he sees this when he returns.


--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

Castanza, Anthony

unread,
Jul 5, 2023, 12:35:12 PM7/5/23
to gsea...@googlegroups.com

Hi Fazhir,

 

Generally we ask that people create new threads with their specific issues so that they can be directly addressed, rather than bumping old threads.

 

Could you tell me a little bit more about the dataset you’re using here, particularly number of samples in each phenotype group,  permutation mode used, ranking metric, etc? It is difficult to make any determinations from the single plot you’ve shown. Based on general failure modes of GSEA however, my initial thought is that your experiment might be under-powered for the statistical assumptions of the GSEA method.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

From: David Eby
Sent: Monday, July 3, 2023 11:13 AM
To: gsea...@googlegroups.com
Subject: Re: [gsea-help] Re: GSEAPreranked is successful with warnings

 

Hi Fahzir,

 

Anthony is traveling for an extended holiday and won't be back until later this week.  I'm sorry to say that I can't offer much help on your question, but I'll make sure he sees this when he returns.

 

 

On Fri, Jun 30, 2023 at 4:28 PM Kayondo Fadhir <kamf...@gmail.com> wrote:

Hi Anthony,

 

I came across this thread and found it useful as it addresses the same concern am having.

In addition to that error message, my results tend to have all Gene sets having an FDR = 1. Although the GO terms on top have known biological relationships with my trait of interest,  I fail to pick any as significant as none has FDR < 25%. 

 

To follow up on the cause of all the FDRs being 1, I plotted the histogram of the nominal p-values and realized it is U-shaped. (Please see the below). Is it usual for this plot to be U-shaped in a normal run? If not, could you trace where my issue is?

 

 

 

I will appreciate any assistance. Thanks

 

Fazhir 

 

On Sunday, 16 January 2022 at 23:58:56 UTC-6 陳冠雅 wrote:

Hello,

 

I am running GSEAPreranked with GSEA desktop and the parameter for enrichment statistic is weighted. The number in the gene list I used means correlation coefficient between -1 to 1 without NA or inf, which represent some biological meanings. When the analysis is done, it shows that the analysis has succeeded but with warnings (pink color). 

I checked th index.html page, it showed the text below: "Scoring produced infinite or NaNs values which may have prevented plotting for certain gene sets. See the log for more details".  Alos, the log showed the text with warn for 72 GOBP like:

 

     985900   [WARN  ] - Scoring of GOBP_MONONUCLEAR_CELL_DIFFERENTIATION produced infinite Or NaN value(s)

     985900   [WARN  ] - Scoring of GOBP_REGULATION_OF_HEMOPOIESIS produced infinite Or NaN value(s)
     985900   [WARN  ] - Scoring of GOBP_REGULATION_OF_APOPTOTIC_SIGNALING_PATHWAY produced infinite Or NaN value(s)

     985901   [INFO  ] - Scoring produced infinite or NaNs values which may have prevented plotting for certain gene sets.  See the log for more details.

 

However, when I set the parameter for enrichment statistic to classic, the analysis is successful without any warnings. 

So I wonder know that why the warnings appear and whether I should change the parameter for enrichment statistic from weighted to classic or not.

 

 

 

Thanks and best regards,

Wendy

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/26e5a72b-0d3a-458d-9886-151fec3b8199n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

Kayondo Fadhir

unread,
Jul 11, 2023, 1:14:50 PM7/11/23
to gsea-help
I am sorry about not creating a new thread. I just found the same issue as I wanted to be assisted, so I joined.

Mine are GWAS results (a genomic window weighted by the percentage of genetic variance it explains). The window ranking is thus from the highest % variance (Top) to the smallest (Bottom). This list thus doesn't have a threshold to differentiate top windows from bottom windows as would be the case of the fold-changed in gene expression data. My goal is to find the terms most enriched in the windows that explain high genetic variance for my trait (particularly the Top windows). Thanks

Reply all
Reply to author
Forward
0 new messages