filtering the normalised counts for GSEa

Theo

unread,

May 7, 2025, 8:50:55 AM5/7/25

to gsea-help

Hi Anthony,
What should be a better option to filter the normalised counts.
A. from raw counts using the rowsums >10 then perform the DESeq2 normalisation steps and get the counts from DESeq2::counts(dds, normalized = TRUE)
B. from normalised counts without any prefiltering get the normalised counts and now rowsums >1
C. after running the DESeq2 and do a filtering on the pvalue !is.na() (so no NAs allowed)
D. same as above but now on the p-adjusted value.

from the DESeq2 manual:

If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance.
If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA

any clues about the best?

Thank you in advance

Theo

unread,

May 7, 2025, 11:07:21 AM5/7/25

to gsea-help

just to add the table from C, D
C: So I have a list of DESEq2 normalised counts with 33093 genes that include a lot of entries like these:
ENSMUSG00002076809 na 7.61887276751807 1.04517064877761 0 3.14493664958679 3.36302840492525 0 2.37475962414038 1.7429215906826 2.87070705421875 0 0 1.08452671952552 4.55575010076612 0 2.92333011518995 0 4.28588557896389 1.83435785961577 0 7.93467245907308 4.661211134479 5.9605274857457 1.07302103388684 3.2660740101401 0 3.66232281671361 1.10042057046313 5.20013030348944 3.29810434950778 4.60248566047471 1.82206791835396 0 1.00128296642959 3.69803868468925 3.57883167019843 3.42257232250422 3.57327484685004 3.48329972829529 2.10859910390241 4.93710512858037
ENSMUSG00002076839 na 0 0 0 0 0 0 0 1.73939236825943 0 0 0.983370085853419 0 0 0.86449124181697 0 0 0 0 0 2.72230181222609 0 0 0 0 0 0 0 0 0 0 0 0 0 0.937200528251027 0 0 0 0 0 0
ENSMUSG00002076874 na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.74165130160389 0 0 0 0 0 0 0 0 0 0 0 0.870642492258776 0 0
ENSMUSG00002076896 na 0 0 0 1.06391825949919 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.879299869476766 1.02627741837094 0 0 0 0 0 0 0 0 0 0 0 0 0

ENSMUSG00002076983 na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.98836831478261

D: and I also have the cleaned normalised version with 21536 genes that looks like this:
ENSMUSG00002076161 na 116.960586790291 81.8538310258277 182.126797113322 80.4120498454981 70.7189234527203 101.426300857542 99.6672870772646 105.861338420998 77.3435872517743 55.5893628054226 151.500958375239 107.145285844184 46.814653795675 84.6166860076932 91.7321721173851 145.616731869837 142.005964856461 148.049162136052 63.9612658948641 59.4642635854178 192.110852019837 135.152975552708 113.039102843619 90.161583450996 81.7529002804642 138.231848310067 91.2779746315778 89.1396445461211 119.297959471515 93.1191227674348 164.280314478339 162.476600304416 130.611496617717 164.817416025053 83.9485642860851 182.703754078179 150.986350865459 182.466985600745 111.257232385293 125.090003642368
ENSMUSG00002076304 na 0 1.17401806479128 0.870074973579109 0.964306035967663 0 1.01858271536842 0 1.12799117641536 0.86109744367906 0 0 0 0.928135113347835 1.87607880245635 0 0 0 0 0 1.13480536240617 0 0 0 0 0 0 0.915923643569095 0 0 0 0 0 0 2.18286496845529 0 0 1.97608068196912 0 3.36302362135016 0
ENSMUSG00002076556 na 12.8000834178458 10.5372998796836 11.3096017825665 9.67037806036593 10.9872224231804 11.08317370437 13.7684177212397 10.115851939434 10.3461237765007 15.9330665866416 12.3206426636197 12.8822279545899 21.4311954667566 16.9382232803137 13.4340031246785 15.714461310248 20.6014752309649 20.5681233082612 17.9284988694637 10.2493316889462 11.6594911397061 13.7921341678752 20.5026404038179 5.5402494251325 9.34677147697609 16.5497912143527 6.39846055953904 15.0929794985184 16.7668188416405 14.5585516389334 21.7606843155166 14.1228658495539 26.9472937765298 16.5826836996339 14.955206454981 20.4971093157152 9.85171448695108 25.9883586273371 19.5106224065994 22.0494815004326
ENSMUSG00002076601 na 0 1.17736880893544 0 0 0.983915132390427 1.02755666280787 0 0 0.860375729535024 0 0.891957677096703 1.71485380322894 2.77484753774539 0 0.88765347098742 0 0 0 0 0 0 1.08105883971678 0 2.20562534133545 3.47923139004612 0 0 1.00380514633613 5.56692223746409 1.20282824410669 0 0 0.869823722079897 0 0 0 0 1.07687249420383 0 0
ENSMUSG00002076809 na 0 2.37475962414038 1.7429215906826 2.87070705421875 0 0 1.08452671952552 4.55575010076612 0 2.92333011518995 0 4.28588557896389 1.83435785961577 0 7.93467245907308 7.61887276751807 1.04517064877761 0 3.14493664958679 3.36302840492525 3.66232281671361 1.10042057046313 5.20013030348944 3.29810434950778 4.60248566047471 1.82206791835396 0 1.00128296642959 3.69803868468925 3.57883167019843 3.42257232250422 3.57327484685004 3.48329972829529 2.10859910390241 4.93710512858037 4.661211134479 5.9605274857457 1.07302103388684 3.2660740101401 0

Anthony Castanza

unread,

May 7, 2025, 4:13:45 PM5/7/25

to gsea...@googlegroups.com

I would refer you to the DESeq2 manual on prefiltering here: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

They consider this step optional, but if using the matrix downstream for GSEA it is recommended to do, and suggested to follow their best practices there.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/8b69f4bb-9808-487f-9931-ec4c16c696e8n%40googlegroups.com.

Theo

unread,

May 8, 2025, 11:03:11 AM5/8/25

to gsea...@googlegroups.com

They updated it seems.
But also, there is this line that gives a plot twist:
One can also omit this step entirely and just rely on the independent filtering procedures available in results(), either IHW or genefilter. See independent filtering section.

This is what I was reading a few years ago when they implemented the Independent filtering algorithm.
I believe the latter approach is the best one.
I might test them a bit and come back to state some results.

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/CAGCeyZzsSQcPKwAzyP-HF%2BSRFkWk5Kd%3DqZvm55HQ7MWWe3WpJg%40mail.gmail.com.

Reply all

Reply to author

Forward