distribution of p_val and fdr values; 0.00001 as min fdr

ilian atanassov

unread,

May 6, 2024, 7:52:22 AMMay 6

to webgestalt

Hello,

My question is related to these questions:

https://groups.google.com/g/webgestalt/c/PyQSkID__6w/m/gRaKmobJCwAJ
https://groups.google.com/g/webgestalt/c/9b3QJAUvcFI/m/M_gdfcHAAgAJ
https://groups.google.com/g/webgestalt/c/fXsQX6SxpSw/m/3RczUXeyAAAJ

I would like to ask about the large gap in the distribution of the -log10(fdr) value. This gap is well illustrated in the results obtained from Sample Run also attached as 01_GSEA_example_webgestalt.PNG. The are a large number of fdr value with a -log10 value under 5 and then a gap until we get to the value of around 16 due to the limit of precision of float numbers where smaller values are reported as zero. Why are there no values with a -log10 value larger than (under) 5 and lower than 16?

Looking at the results from a larger dataset with more than 20 ranked sets, I still can't obtain any fdr values that are larger than (just under) 5 and lower than 16, attached plot 02_p_val_vs_fdr.png. Looking at results with p_val equal to zero only, these get adjusted for multiple testing to a range with the abovementioned gap, attached plot 03_p_val_0_only_vs_fdr.png. Why are we observing this result? Why can't the adjustment of these very very low p_val results result in values in the gap?

Lastly, do you find any arguments against, for plotting purposes, reporting all fdr values that are 0 as 0.00001 instead of 2.220446e-16. This will keep the results identical in terms of ranking of the results in based on fdr but will make plotting and interpretation of results more intuitive. One would not overinterpret the gap in -log10(fdr value) as strong indication of one gene set being much more significant then another and sizes and color schemes when plotting the data would not be dominated by this large gap.

Best wishes,

Ilian

02_p_val_vs_fdr.png

01_GSEA_example_webgestalt.PNG

03_p_val_0_only_vs_fdr.png

John Elizarraras

unread,

May 13, 2024, 4:40:53 PMMay 13

to webgestalt

Hello Ilian,

That is something that I have also found to be common with GSEA. The ranges of values tends to vary with my data set, but the effect is the same. My best guess of the reason for this gap is based on the specifics of our GSEA implementation. GSEA uses enrichment scores, which are shown in the results page, as a measurement of the enrichment of your input list. It then permutes that list multiple times (default is 1000), and calculates the enrichment scores for those random permutations. It then filters those random permutations to the permutations with enrichment scores that share the same sign as the original list, and then calculates a p-Value.

For an example, imagine the original list had an enrichment score of 0.67. Out of the 1000 permutations, 700 permutations also had a positive enrichment score. The p-value is calculated as the number of those permutations with a larger enrichment score than the original divided by the number of positive permutations. For this example, if 25 out of the 700 permutations had a score above 0.67, the p-Value would be 25/700 = 0.0357. However, it is tricky when the original enrichment score is so high that there are zero permutations with an enrichment score above it. Using the above calculation, the p-Value would be 0. WebGestalt represents this value as 2.2E-16 due to computational limitations.

This happens fairly often, which results in the gap of many p-Values with the value of 2.2E-16. This leads to the gap that you see. In the example, if 1/700 were above 0.67, the p-value would be 0.00143, but changing that to 0/700, would result in 2.2E-16, which is a large gap. FDR is calculated in a similar way, with considerations of set size and other factors, but I would expect this gap to be maintained.

I think plotting the p-Values how you suggested would be acceptable, and as you mentioned, would help address the issue of not over-interpreting the gap. Depending on the plot type, it may be helpful to plot the 0 values as 0.000001 to have a nice color/size gradient, but have a legend represent these values as < 0.00001 (see below) which may be a good compromise.

If you have any questions or if I didn't answer something, please let me know.

Best,

John

<0.00001 example:

Screenshot 2024-05-13 at 15-37-10 Examples - Apache ECharts.png

ilian atanassov

unread,

Jun 7, 2024, 9:31:54 AMJun 7

to webgestalt

Hi John,

Thanks for the detailed explanation.

Best wishes,

Ilian

Reply all

Reply to author

Forward